The Basics of Text parsing in RPA
On , by
I’ve been learning uipath for about 2 weeks, and I found that text parsing plays a preeminent role. We can parse the text of various file formats like pdf, excel, word document, using this awesome stuff called Regex.
Regex or regular expressions simplifies the search process. Regex is a string of text that helps you match, locate, and manage text on any text file. Uipath provides two regex activities, Is match, and Matches.
Is match: Is match is used for indicating whether the specified regular expression finds a match in the specified input string and gives a result as either true or false.
Matches: Matches searches for the text specified in the input string, finds all occurrences of a regular expression and returns all the successful matches and gives the result as a collection of all the found matches.
So this is how regex works, now let us see how we can do it.
Consider a scenario where we need to read text from a pdf file and write it into an excel file. The pdf contains a list of items under name, designation, company, and location. Now we need to read this info and write it into an excel file.
Each data should be written to its respective titles, i.e., read data from name(pdf file) and write the data to name(excel file). To perform this, we need two things to be done:
1. Create a variable and use it for storing the entire text from the pdf file. This variable iterates each line fetches the required info and writes it to its respective cell in the excel file.
2. Create a variable called counter. Counter iterates through each record to find the required info. Let name be the first item. Set the counter as 1 to fetch data from the field name. The counter number gets incremented whenever the counter moves to the next field. The counter will reset to 1 when it finds the name field again.
To read the text from pdf file, we need read PDF text activity. If it doesn’t show up in activity pane, then search for pdf in manage packages. Then install UiPath.PDF.Activities
Next drag and drop read PDF text activity in your workflow pane. Then specify your file path in the FileName text box in the properties pane. Store the output text in a string variable. This string variable should be entered in the Text field under output.
Now write the extracted text in the text file using write text file activity. We can iterate through each line of the extracted text using File.ReadAllLines(“text file path”). File.ReadAllLines reads each line and stores it an array.
To iterate through a collection of lines, we need for each activity. Here comes the important part, we use is match activity to know whether it is the start of the record. Click on configure regular expression, then under Regex select advanced. In the value box, type the regex to match the Name record and click on the Save button.
The output generated from the is match is either true or false. We will use an if-else statement below this to handle our output from is match.
If the condition is true, set the counter to 1 and use Regex.match function to match the value of the name field.
Now create an int32 type variable rcount and set the default as 1. Now assign rcount+1.
Drag and drop the excel application scope activity and specify the file path. Use write cell activity to write the data in the excel sheet. Specify the sheet name and in the cell field type “A” +rcount.ToString.
After writing it into the excel sheet, assign counter = counter+1. Now the counter will be set to 2.
If the condition fails, we have a switch activity. In the expression field, specify the counter. Click on add new case and specify 2 so when the counter is 2 it does the series of activities.
When the counter is 2, assign counter = counter+1, then create a variable to match the data Regex.Match(item,”\:\s.\w.”).Value
and use variable= variable.Replace(”: “,”“) (replace function is used to replace : and space.)
Now add excel application scope activity and add write cell activity and specify the same sheet name and “B” +rcount.ToString to write the cell in B2.
Similarly, for case 3, it matches and writes the data in the cell C2 for the field company, and case 4 it matches and writes the data in the cell D2 for the field location.
So that’s it. Click on the save button and run the File. This is the output of the extracted text.
You can use the below GitHub link to get the XAML file