Reporting from CDR Data:

Use Case (Part 2)

In order to keep track of the usage of their infrastructure, telecommunications companies rely on Call Detail Records (CDRs). These are standardized (see records that are created to document every exchange a given communication device makes with the network. CDR data sit at the base of the billing calculations, which means that its analysis is a fundamental element in the Telco industry.

A CDR contains a wealth of information that characterizes the usage of a network by its users. For instance, a phone call generates a CDR that contains the phone numbers involved in the call, the time, the date and the duration of the call. We will describe how to deal with this type of data using the Unicage method in 5 parts:

  • Converting the data to “space-separated value”format;
  • Cleaning the data and assessing its quality (part 1);
  • Cleaning the data and assessing its quality (part 2);
  • Performing summations over all CDRs;
  • Joining the information and producing the results of the complete analysis.

This is part 2 of the process. In part 1 we converted a csv file Bogus_CDR_Data.csv with some bogus CDR data into a “space-separated value”file (the Unicage format), which was named Bogus_CDR_Data_UF:

[ user@unicage ]$ ls -lh
-rw -rw -r-- 1 user user 5.8G Jan 12 10:18 Bogus_CDR_Data .csv
-rw -rw -r-- 1 user user 4.5G Jan 12 10:20 Bogus_CDR_Data_UF

Notice that the converted file is smaller; its first line looks like this:

[ user@unicage ]$ head -1 Bogus_CDR_Data_UF
G 0 +351999990665 +351999994370 12/02/2020 01:00:00 2865 837384086 206787737 Faro _
Special3 .2577 1221 +351999990321 2747 BT312CR Name 1 E PT NET2 _ _ _ +351999996172 719
_ EUR _ _ _ _ 1044171823 userID52260 _ _ 361464 _ _

Now that we have the data in the Unicage format, we can start focusing on the calculations we want to make. Because we just want to calculate the total duration of the calls, we won’t need all the fields contained in the CDR data. We will prepare the necessary fields in three steps:

  • Isolate the relevant data;
  • Guarantee the quality of the data;
  • Sort the data.

To isolate the relevant CDR data, we will use the self command. This is a Unicage command that selects fields out of each line on a file. Since we are interested only in the fields “CustomerIdentifier”, “CallDate”, “CallTime”, “NumberDialled” and “Duration”, and we know that these fields are at positions 3, 5, 6, 4 and 7, respectively, in the complete CDR data file, we will write:

self 3 5 6 4 7 Bogus_CDR_Data_UF        |

to isolate those fields. Notice the vertical bar (|) after the command: this means that the command we write after this vertical bar will receive only the five fields that we isolated using the self command. By doing this we are effectively building a pipeline of commands, in which each command passes on to the next the result of its action. 

The next command in the pipeline will be the awk command. This is not a Unicage command. awk is one of the most popular commmands that comes pre-installed with almost any Unix-based operating system. The use of this command illustrates the simplicity with which one can integrate the Unicage tools with the time-tested and popular tools the grew in the Unix environment. We will use the awk command to implement a data-quality filter with the following requirements:

  1. we want the phone numbers to have exactly 13 characters (including the “+”);
  2. we want that the 1st character is a + and from the 2nd to the 13th character we have digits (the actual number including the country code);
  3. we want the 2nd, 3rd and 4th digit to be, respectively, 3, 5 and 1, the country code for Portugal.

We can use awk to implement a filter on the number of characters of the phone number which placed the call by writing

awk 'length ($1) == 13'            |

This means that only the phone numbers with 13 digits will be sent to the next stage of the pipeline. We instruct awk by using the built-in function length and we indicate that we want that function to be applied to the phone number that placed the call by writing $01. This means “apply the length function to the first field of each line”. This makes sense because, from the point of view of the awk command, all the lines that it receives have only the following fields: “CustomerIdentifier”, “CallDate”, “CallTime”, “NumberDialled” and “Duration”, i.e., the fields that were isolated, in this order, by the self command. Notice, again, that after the awk command, we put a vertical line: we are continuing the pipeline.

Next, we apply another filter:

awk '$1 ~ /^+[0 -9]{12}/ '           |

In this filter, we are saying that only the phone numbers that start with a “+” sign followed by exactly 12 digits should pass on to the next stage of the pipeline. We do this using the language of Regular Expressions, which awk understands. Notice, again, that we indicate explicitly that we want this filter to be applied to the first field by using $1.

In part 3 we will complete this process by applying the filter that selects phone numbers with Portugal’s country code and by sorting the data. 

Want to learn more?

Find out more about how Unicage can help telecommunications businesses handle large volumes of call detail records.

Check out the orther articles in this series:

Find out more

Request a demo and speak with our team about how you can leverage the power of Unicage in your organization.

Privacy Policy