Suppose the most important televised event of the year happens tonight. You sit comfortably in front of the TV and you keep your smartphone at hand to be updated by your social media feeds. The reporters and the commentators you see on TV and the notifications from the different social media platforms on your phone constitute sources of information to which you are the endpoint. Data flows towards you at an incredible rate, but how much of that data you do you use to build knowledge about the event?
Whatever amount of data you retain after the event is minute when compared to all the content that you missed. Of all the data you received from the many available sources of information, most of it was lost simply because you were receiving it in a careless, relaxed way. Bringing back this example
to the domain of data analysis, one can ask: given the effort put in obtaining and storing business-related data, shouldn’t all of its contents be used?
To release all the potential of any data stream, the pipelines that deal with such data must be carefully organized. After receiving and storing the data, the next two steps in the pipeline are data preparation and data exploration. In the data exploration step, it is important to look into all the details of the data. In that sense, the success of data exploration is supported by the quality of the data preparation. If the data preparation process is efficient and transparent, all the energies that are necessary to explore the data will be used and won’t be dispersed with anything else.
Unicage provides the ideal tools to make the the data preparation stage a maintainable and transparent process, that can be easily understood and adapted.
All the tools provided by Unicage are brought together in scripts which are applied to data files. The scripts are lists of elementary operations that should be applied to the files in order to transform their content.
Suppose the first 5 files of the originalFILE.csv are:
0001 ,20201220 ,0921 , azsw ,24 , SW
0005 ,20201222 ,1916 , dfgv ,99 ,N
0007 ,20201217 ,1243 , fasd ,05 , SE
0083 ,20201215 ,2311 , dkjg ,76 , NW
0932 ,20201228 ,1451 , cjkd ,73 ,W
If we want to isolate the data in the first two columns, and interpret the second column as a date, one can use Unicage commands to write the following script:
fromcsv originalFILE . csv |
self 1 2 |
dayslash --input yyyymmdd -- output dd -mm - yyyy 2 |
tocsv > newFILE . csv
This script is no more difficult to read than a list of sequencial tasks. The script analyzes the file, line by line, and performs all the filters and transformations that are needed. The final result is the file newFILE.csv, which looks like:
0001 ,20 -12 -2020
0005 ,22 -12 -2020
0007 ,17 -12 -2020
0083 ,15 -12 -2020
0932 ,28 -12 -2020
By orchestrating the data preparation process using scripts of Unicage commands, the whole pipeline is made transparent: it can be readily communicated, easily edited and intuitively escalated. In fact, a script written once to prepare data in a given file, can be used an arbitrary number of times to an arbitrary number of similar files, without extra effort. A script is written once and used infinitely.
The process of data preparation using the Unicage method is as simple as it can possibly get. All the necessary steps are organized in easy-to-understand scripts that can be applied directly to data file without any intermediary software. The resources used are totally devoted to the data preparation, which frees up memory space for the analysis and visualization tools to operate in the cleaned (or prepared) data. Only by optimizing all the steps in a data
pipeline, can its full informative potential be released.