How to combine legacy sensor data with new sensor data systems for reporting

Use Case (Part 3)

Taming data in different formats is a common challenge in any data pipeline. In what follows, we will consider a scenario in which such a challenge occurs: combining data coming from different types of sensors. Sensors from different manufacturers follow different standards, which demands the inclusion of a normalization procedure. This procedure must reconcile all the data formats, so that all the information potential can be extracted from the data.
 
In part 1, we:
  • Identified the differences between the two datasets from the two sensor types
  • Converted the JSON file to a tabular format SSV file
  • Translated field names that were in different languages
  • Completed the conversion to tabular format to create a complete data pipeline
In part 2, we:
  • Converted the XML file to a tabular format SSV file
  • Merged Event / Events columns
  • Completed the conversion to tabular format to create a complete data pipeline

In part 3, we will combine the files converted_json_UF (produced in part 1) and converted_xml_UF (produced in part 2) into a single file csv file. We start by transposing the data with Unicage’s command map

[user@unicage]$ map num=1 converted_json_UF
* date equipment item shelf time
1 24-01-2021 C010 34WE4 upper 16:29
2 24-01-2021 C024 58W77 middle 20:11
...
[user@unicage]$ map num=1 converted_xml_UF
* date equipment item shelf time
1 2021-04-21 C012 87WV2 upper 11:05
2 2021-04-21 C014 32WI7 middle 03:23
...

Having the data in this format, we will harmonize the dates so that they conform to the requirement. We notice that the file converted_xml_UF has the date in the format YYYY-MM-DD, which is almost the same as the date format of the requirement. Since the file converted_json_UF has the date in the format DD-MM-YYYY, we will start by using Unicage’s command dayslash

[user@unicage]$ map num=1 converted_json_UF | tail -n+2 | dayslash --input 
dd-mm-yyyy --output yyyy-mm-dd 2

1 2021-01-24 C010 34WE4 upper 16:29
2 2021-01-24 C024 58W77 middle 20:11
...

where we used the command tail to remove the header and we used the options –input and –output of the dayslash command to specify the date formats involved. We can save the data in this format in a temporary file named temp-json_data

[user@unicage]$ map num=1 converted_json_UF | tail -n+2 | dayslash --input dd-mm-yyyy --output yyyy-mm-dd 2 > temp-json_data

Then, we will isolate the header of the file using the head command

[user@unicage]$ map num=1 converted_json_UF | head -1
* date equipment item shelf time

and we save the result in a temporary file named temp-header

[user@unicage]$ map num=1 converted_json_UF | head -1 > temp-header

Finally we can go back to the file converted_xml_UF and prepare it to be concatenated with temp-json_data by removing the header

[user@unicage]$ map num=1 converted_xml_UF | tail -n+2
1 2021-04-21 C012 87WV2 upper 11:05
2 2021-04-21 C014 32WI7 middle 03:23
...

Now, the concatenation is made with the cat command

[user@unicage]$ map num=1 converted_xml_UF | tail -n+2 | cat temp-json_data - 
1 2021-04-21 C012 87WV2 upper 11:05
2 2021-04-21 C014 32WI7 middle 03:23
...
1 2021-01-24 C010 34WE4 upper 16:29
2 2021-01-24 C024 58W77 middle 20:11
...

which leaves the data ready for the last operation on the date field: we will use Unicage’s dayslash command to convert the date from YYYY-MM-DD to YYYY/MM/DD

[user@unicage]$ map num=1 converted_xml_UF | tail -n+2 | cat temp-json_data - 
| dayslash --input yyyy-mm-dd --output yyyy/mm/dd 2
1 2021/01/24 C010 34WE4 upper 16:29
2 2021/01/24 C024 58W77 middle 20:11
...
1 2021/04/21 C012 87WV2 upper 11:05
2 2021/04/21 C014 32WI7 middle 03:23
...

Finally, we concatenate the header and remove the field with the event number (using Unicage’s self command)

[user@unicage]$ map num=1 converted_xml_UF | tail -n+2 | cat temp-json_data - 
| dayslash --input yyyy-mm-dd --output yyyy/mm/dd 2 | cat temp-header - | self 2 3 4 5 6
date equipment item shelf time
2021/01/24 C010 34WE4 upper 16:29
2021/01/24 C024 58W77 middle 20:11
...
2021/04/21 C012 87WV2 upper 11:05
2021/04/21 C014 32WI7 middle 03:23
...

and convert to csv format using Unicage’s tocsv command

[user@unicage]$ map num=1 converted_xml_UF | tail -n+2 | cat temp-json_data - 
| dayslash --input yyyy-mm-dd --output yyyy/mm/dd 2 | cat temp-header - | self 2 3 4 5 6 | tocsv
date,equipment,item,shelf,time
2021/01/24,C010,34WE4,upper,16:29
2021/01/24,C024,58W77,middle,20:11
...
2021/04/21,C012,87WV2,upper,11:05
2021/04/21,C014,32WI7,middle,03:23
...

We save the final result in a file named combined_data.csv

[user@unicage]$ map num=1 converted_xml_UF | tail -n+2 | cat temp-json_data - 
| dayslash --input yyyy-mm-dd --output yyyy/mm/dd 2 | cat temp-header - |
self 2 3 4 5 6
| tocsv > combined_data.csv

and we finish this sequence of articles with a single script containing all the pipelines used:

# 
# **** PART 1 ****
# ================
#
# --- a) parse the json file
rjson model01_file.json                                 |
# 1:event 2:field_name(portuguese) 3:field_value
# --- b) add the translation of the field names
cjoin2 key=2 translations_type -                        |
# 1:event 2:field_name(portuguese) 3:field_name(english) 4:field_value
# --- c) add the translation of values of one field
cjoin2 +? key=4 translations_field -                    |
# 1:event 2:field_name(portuguese) 3:field_name(english) 4:field_value 5:translated_field_value
# --- d) discard the field names in portuguese
self 1 3 4 5                                            |
# 1:event 2:field_name(english) 3:field_value 4:translated_field_value
# --- e) select the values of "shelf" in english
awk '{if($4 != "?"){$3 = $4}; print $1, $2, $3}'        > converted_json_UF
# 1:event 2:field_name(english) 3:field_value


#
# **** PART 2 ****
# ================
#
# --- a) parse the xml file
xmldir events/event model02_file.xml                    |
# 1:hierarchy_tag 2:hierarchy_tag 3:field_name 4:field_value
# --- b) discard the hierarchy tags
self 3 4                                                |
# 1:field_name 2:field_value
# --- c) adding an event number to the data
awk 'BEGIN{event_number=1;event_counter=0}
          {if(event_counter == 5){event_counter = 1;event_number++}
          else{event_counter++};
          print event_number, $0}'                     > converted_xml_UF
# 1:event 2:field_name 3:field_value

#
# **** PART 3 ****
# ================
#

# *** 3.1: preparing the json data
# --------------------------------
# --- a) transposing the json data
map num=1 converted_json_UF                 |
# 1:event 2:date(DD-MM-YYYY) 3:equipment 4:item 5:shelf 6:time
# --- b) discarding the header
tail -n+2                       |
# 1:event 2:date(DD-MM-YYYY) 3:equipment 4:item 5:shelf 6:time
# --- c) converting the date format
dayslash --input dd-mm-yyyy --output yyyy-mm-dd 2   > temp-json_data
# 1:event 2:date(YYYY-MM-DD) 3:equipment 4:item 5:shelf 6:time


# *** 3.2: separating the header
# ------------------------------
# --- a) transposing the json data
map num=1 converted_json_UF                 |
# 1:event 2:date 3:equipment 4:item 5:shelf 6:time
# --- b) isolating the header
head -1                         > temp-header
# 1:event 2:date 3:equipment 4:item 5:shelf 6:time


# *** 3.3: combining all the data and producing the csv file
# ----------------------------------------------------------
# --- a) transposing the xml data
map num=1 converted_xml_UF              |
# 1:event 2:date(YYYY-MM-DD) 3:equipment 4:item 5:shelf 6:time
# --- b) discarding the header
tail -n+2                       |
# 1:event 2:date(YYYY-MM-DD) 3:equipment 4:item 5:shelf 6:time
# --- c) combining with the json data in the temporary file
cat temp-json_data -                    |
# 1:event 2:date(YYYY-MM-DD) 3:equipment 4:item 5:shelf 6:time
# --- d) converting the format of the date field
dayslash --input yyyy-mm-dd --output yyyy/mm/dd 2   |
# 1:event 2:date(YYYY/MM/DD) 3:equipment 4:item 5:shelf 6:time
# --- e) concatenating the header
cat temp-header -                   |
# 1:event 2:date(YYYY/MM/DD) 3:equipment 4:item 5:shelf 6:time
# --- f) discarding the field "event"
self 2 3 4 5 6                      |
# 1:date(YYYY/MM/DD) 2:equipment 3:item 4:shelf 5:time
# --- g) converting to csv format
tocsv                           > combined_data.csv
# date(YYYY/MM/DD),equipment,item,shelf,time

Find out more

Request a demo and speak with our team about how you can leverage the power of Unicage in your organization.

Privacy Policy