Processing Financial Reports in XML
Part 3
In part 1 and 2 of this 3 part series we have demonstrated how to turn a typical xml file into a tabular format file. In the third and final section we will articulate how to validate for missing fields and mismatch datatypes. This is an example of the “Checks” file already in the tabular format:
DevOperacoesDev_montante DevOperacoesDev_quantidade ImpLegal_montante ImpLegal_quantidade OperacoesDev_montante OperacoesDev_quantidade Operacoes_montante Operacoes_quantidade chs_id divisa_operacao pais_operacao processador scheme setor_institucional tipo_cheque
8.42 898 7.77 7 1040.03 671 75191.47 30337 CHS.000000000a EUR PT 1 1 S7 9
2286.91 948 1.85 1 2357.33 801 44864.07 75483 CHS.0000000002 EUR PT 1 1 S4 7
6890.53 878 14.47 2 3972.38 568 52403.68 @ CHS.0000000003 EUR PT 1 1 S4 7
647.40 173 7.76 6 3414.78 726 302043.31 89854 CHS.0000000004 EUR PT 1 1 S4 9
898.11 183 10.27 6 1929.11 448 479434.94 67094 CHS.0000000005 EUR PT 1 1 S2 7
1282.50 382 21.54 3 10.19 10 242343.41 79813 CHS.0000000006 EUR PT 1 1 S1 9
805.10 325 83.92 10 891.31 273 98729.25 11235 CHS.0000000007 EUR PT 1 1 S5 3
First we replace every missing field with “@”, with this line of code:
cat teste.xml | sed 's/></>@</g'
We use the “sed” command to identify empty tags in the xml file and fill them with the “@” character. After this is done, we run the following code:
# --- Reads file --- cat $file | # --- Selects every field self 1/NF | # --- Regular expression to catch every mismatch data type --- grep -vP "^[0-9]*\.\d\d [0-9]* [0-9]*\.\d\d [0-9]* [0-9]*\.\d\d [0-9]* [0-9]*\.\d\d [0-9]* [A-Z]{3}\.[0-9]* [A-Z]{2,4} [A-Z]{2,4} \d \d [A-Z][0-9] \d$" | fcols -- > INVALID_DATA/$filename
We use a regular expression to instruct the “grep” command to exclude all the records that do not present the desired format. The result of this command is a list all the records that conform to a given data format.
In the first example, the column “chs_id” contains the value “CHS.000000000a”. This value does not conform to the format the we specified using the regular expression. After we run this code we get the following file:
DevOperacoesDev_montante DevOperacoesDev_quantidade ImpLegal_montante ImpLegal_quantidade OperacoesDev_montante OperacoesDev_quantidade Operacoes_montante Operacoes_quantidade chs_id divisa_operacao pais_operacao processador scheme setor_institucional tipo_cheque
8.42 898 7.77 7 1040.03 671 75191.47 30337 CHS.000000000a EUR PT 1 1 S7 9
6890.53 878 14.47 2 3972.38 568 52403.68 @ CHS.0000000003 EUR PT 1 1 S4 7
As one can see, we were able to isolate the records with “@”, i.e., records with missing fields, and the record with the incorrect code “CHS.000000000a”, i.e., a mismatched datatype. This file is now in a ready state to be analyzed.
If you want to know more about this demonstration, you can send an email in our contacts section our fill out the form bellow.
Check out the other articles in this series:
Find out more
Request a demo and speak with our team about how you can leverage the power of Unicage in your organization.