Processing Financial Reports in XML

Part 1

This demonstration is based on an xml file which is the standard format  used for financial reporting from big institutions to regulatory authorities, banks and commissions. In this example, we use a file containing dummy data that mimics the report of a financial institution to a National Bank.

The first 20 lines of our file are shown below:
[uspeu@lecture-green DEMO_XML]$ head -20 teste.xml

This xml file is separated in different sections that correspond to different types of elements to be reported.   For instance, Checks, Transfers, etc. Having this information in a single xml file is very convenient for the communications phase, where the institution sends the information over a network to the National Bank, but the analysis phase becomes too burdensome. To simplify the analysis phase, one can parse the xml file and organize that data by instance in a tabular file. This process of parsing and organization includes a validation phase, where one can identify missing fields and mismatched data types. After performing this processing, the data can be more easily manipulated by visualization tools or other software packages that are used to post-process the data. By accomplishing this, not only we are able to turn a very complicated, hard to read file into several files parsed and validated but we are also able to save about 60% of memory space by only getting the essential data of the file without any loss of quality.

We will describe how to deal with this type of file using the Unicage method in 3 parts:

  • Converting the data to a “space-separated value” format;
  • Organizing the data into a tabular format;
  • Assessing the quality of the data;

The xml structure can be looked at as a tree, if we see the example above, the “IPSYS_PSP” would be the root of this tree, then the first branch would be “Header” followed by “period”, “PSP_reporter”, and then the value of such field. The first part of this process is to define every branch we want to collect from the file to then feed this information to a Unicage command called xmldir. The branches we selected are the following:

cat << EOF > ./tmp-XML_PATH

We save this information on a file called “tmp-XML_Path”, so we can use this later.

The xmldir command allows to collect every line, in the original xml file, according to the brance passed as argument, as you can see below:

[uspeu@lecture-green DEMO_XML]$ xmldir /IPSYS_PSP/Instrumento teste.xml | head
IPSYS_PSP Instrument Checks Ch_Raised ChS_Details chs_id CHS.000000000a
IPSYS_PSP Instrument Checks Ch_Raised ChS_Details scheme
IPSYS_PSP Instrument Checks Ch_Raised ChS_Details processor 1
IPSYS_PSP Instrument Checks Ch_Raised ChS_Details institutional_sector S3
IPSYS_PSP Instrument Checks Ch_Raised ChS_Details check_type 9
IPSYS_PSP Instrument Checks Ch_Raised ChS_Details operating_country PT
IPSYS_PSP Instrument Checks Ch_Raised ChS_Details operating_division EUR
IPSYS_PSP Instrument Checks Ch_Raised ChS_Details Operation quantity 8988
IPSYS_PSP Instrument Checks Ch_Raised ChS_Details Operation amout 15412.53
IPSYS_PSP Instrument Checks Ch_Raised ChS_Details Operation_Dev amount 587

By giving this input “/IPSYS_PSP/Instrument”, the xmldir command returns only the values we want.

Using the previously created file “tmp-XML_PATH”, we loop over every branch and input it to this command so we can only get the information to the specific branch we need, we also use another Unicage’s command, called self, that selects the fields we want, in this case the last 3 fields, after this we use a command called AWK to create a single file with the data in “space-seperated value” format .

# --- Parses file of the specific branch ---
xmldir $(echo $line) $xmlfile             |
# --- Selects last 3 fields ---
self NF-2 NF-1 NF                         |
# --- Creates a numeric value field that increments from id to id, ---
# --- which means from a full instance to another ---
awk 'BEGIN{count=0}{if($2 ~ "^[a-zA-Z]+_id$"){count++}; print count, $0}'       >> TABULAR_FILES/$dirname/tmp-$filename

After runing this code we get the data into this format:

1 ChS_Details chs_id CHS.000000000a
1 ChS_Details scheme @
1 ChS_Details processors 1
1 ChS_Details institutional_sector S3
1 ChS_Details check_type 9
1 ChS_Details operation_country PT
1 ChS_Details operation_division EUR
1 Operations quantity 8988
1 Operations amount 15412.53
1 Operations_Dev quantity 587

In part 2, we are going to explain how to get from this type of data, “space-seperated value” format, and how to transform it into a tabular format.

Check out the other articles in this series:

Find out more

Request a demo and speak with our team about how you can leverage the power of Unicage in your organization.

Privacy Policy