How to combine legacy sensor data with new sensor data systems for reporting

Use Case (Part 1)

Taming data in different formats is a common challenge in any data pipeline. In what follows, we will consider a scenario in which such a challenge occurs: combining data coming from different types of sensors. Sensors from different manufacturers follow different standards, which demands the inclusion of a normalization procedure. This procedure must reconcile all the data formats, so that all the information potential can be extracted from the data.

In this article we are going to use the Unicage tools to build a data pipeline that is able to normalize two different data formats and combine them in a ready-to-plot csv file. To make this example more concrete, we assume that the files we receive come from two different models of vending machines: one model produces json files and the other produces xml files. Here are some example files:

[user@unicage]$ ls -lh
total 0
-rw-rw-r-- 1 user user 49K abr 20 18:11 model01_file.json
-rw-rw-r-- 1 user user 52K abr 20 18:11 model02_file.xml

Each file contains information about the items that are delivered by each vending machine. Every item that is delivered represents an event and generates packet of data that includes the code that identifies the item, date and time coordinates at which the item was delivered, the ID of the machine that delivered the item and the shelf from where the item was taken (there are three different types of shelf: upper, middle and lower). This information is organized in the json file as:

[user@unicage]$ head -15 model01_file.json
{
"1": {
"prateleira": "superior",
"produto": "34WE4",
"maquina": "C010",
"hora": "16:29",
"data": "24-01-2021"
}
},{
"2": {
"prateleira": "meio",
"produto": "21WX5",

and in the XML file as:

[user@unicage]$ head -15 model02_file.xml
<events>
<event>
<date>2021-04-21</date>
<time>11:05</time>
<equipment>C012</equipment>
<item>87WV2</item>
<shelf>upper</shelf>
</event>
<event>
<date>2021-04-21</date>
<time>03:23</time>
<equipment>C014</equipment>

The most immediate observation is that the data fields are labelled in two different languages: the data fields in the xml file are written in English whereas the data fields in the json file are written in Portuguese. A more careful look also reveals that the format in which the date field is presented is different in both files and the values of the field “shelf”/“prateleira” are written in different languages too. Because the goal is combine all the data in a ready-to-plot csv file, the data from both files must be normalized to satisfy the following constraints:

  • the final csv file must have a header describing the data fields written in english;
  • the date field of the final csv file must be presented in the format YYYY/MM/DD;
  • the values of the “shelf” field must be written in english.

We will describe the data pipeline that harmonizes the different data formats in three parts:

  • Converting the json file to tabular format;
  • Converting the xml file to tabular format;
  • Reconciling the converted data and producing the csv file.

In part 1, we will convert the json file to tabular format, i.e., to the space-separated value (SSV) file format. This file format is very close to the csv file format and is the prefered file format byt the Unicage tools. After that we will translate the field’s names and the values of the field “shelf”/“prateleira”.

The first step in converting the json file into a tabular format is to use Unicage’s rjson command to parse the file

[user@unicage]$ rjson model01_file.json
1 prateleira superior
1 produto 34EW4
1 maquina C010
1 hora 16:29
1 data 24-01-2021
2 prateleira meio
2 produto 21WX5
...

where the number of the event is in the first column, the name of the data field is in the second column and the value of the data field in the third. Having the data in this format, we can proceed to the task of translating the field’s names according to the rule

shelf     <-->  prateleira
item      <-->  produto
equipment <-->  maquina
time      <-->  hora
date      <-->  data

This rule is saved in the file translations_types

[user@unicage]$ cat translations_types
data date
hora time
maquina equipment
prateleira shelf
produto item

where we have on the left column the names in English and, on the right column, the corresponding names in Portuguese. We will use this file in conjunction with the tabular form data as inputs to Unicage’s cjoin2 command, like this

[user@unicage]$ rjson model01_file.json | cjoin2 key=2 translations_type -
1 prateleira shelf superior
1 produto item 34WE4
1 maquina equipment C010
1 hora time 16:29
1 data date 24-01-2021
2 prateleira shelf meio
2 produto item 58W77
...

where in front of each portuguese word, the corresponding english word was added. We can do the same thing for the values of the field “shelf”/“prateleira”. Using the file translations_field

[user@unicage]$ cat translations_field
inferior lower
meio middle
superior upper

we apply Unicage’s cjoin2 command again

[user@unicage]$ rjson model01_file.json | cjoin2 key=2 translations_type - | cjoin2 +? key=4 translations_field -
1 prateleira shelf superior upper
1 produto item 34WE4 ?
1 maquina equipment C010 ?
1 hora time 16:29 ?
1 data date 24-01-2021 ?
2 prateleira shelf meio middle
2 produto item 58W77 ?
...

Now the data stream has 5 fields. The rightmost field contains the translation of the value of the field “shelf”/“prateleira” in the corresponding lines and “?” in all the other lines. To complete this process, we need to select the relevant fields and we’ll do it with Unicage’s self command and with the well-known awk command

[user@unicage]$ rjson model01_file.json | cjoin2 key=2 translations_type - | 
cjoin2 +? key=4 translations_field -
| self 1 3 4 5 | awk '{if($4 != "?")
{$3 = $4}; print $1, $2, $3}'
1 shelf upper 1 item 34WE4 1 equipment C010 1 time 16:29 1 date 24-01-2021 2 shelf middle 2 item 58W77 ...

We completed the conversion of the json file to a tabular format and we were able to translate some words from Portuguese to English. Now that we have the complete pipeline we can save the result in a file we will call converted_json_UF.

We finish with the complete pipeline of commands with comments:

# 
# **** PART 1 ****
# ================
#
# --- a) parse the json file
rjson model01_file.json                                 |
# 1:event 2:field_name(portuguese) 3:field_value
# --- b) add the translation of the field names
cjoin2 key=2 translations_type -                        |
# 1:event 2:field_name(portuguese) 3:field_name(english) 4:field_value
# --- c) add the translation of values of one field
cjoin2 +? key=4 translations_field -                    |
# 1:event 2:field_name(portuguese) 3:field_name(english) 4:field_value 5:translated_field_value
# --- d) discard the field names in portuguese
self 1 3 4 5                                            |
# 1:event 2:field_name(english) 3:field_value 4:translated_field_value
# --- e) select the values of "shelf" in english
awk '{if($4 != "?"){$3 = $4}; print $1, $2, $3}'        > converted_json_UF
# 1:event 2:field_name(english) 3:field_value

Find out more

Request a demo and speak with our team about how you can leverage the power of Unicage in your organization.

Privacy Policy