Reporting from CDR Data:

Use Case (Part 3)

In order to keep track of the usage of their infrastructure, telecommunications companies rely on Call Detail Records (CDRs). These are standardized (see records that are created to document every exchange a given communication device makes with the network. CDR data sit at the base of the billing calculations, which means that its analysis is a fundamental element in the Telco industry.

A CDR contains a wealth of information that characterizes the usage of a network by its users. For instance, a phone call generates a CDR that contains the phone numbers involved in the call, the time, the date and the duration of the call. We will describe how to deal with this type of data using the Unicage method in 5 parts:

  • Converting the data to “space-separated value”format;
  • Cleaning the data and assessing its quality (part 1);
  • Cleaning the data and assessing its quality (part 2);
  • Performing summations over all CDRs;
  • Joining the information and producing the results of the complete analysis.

This is part 3. In part 1 we selected, out of the full set of CDR elements, only the fields “CustomerIdentifier”, “CallDate”, “CallTime”, “NumberDialled” and “Duration”, and then we applied two filters to the field “CustomerIdentifier” (which is a phone number):

  • all the phone numbers have to have exactly 13 characters (including the “+”);
  • all the phone numbers must have as the 1st character a “+” and from the 2nd to the 13th character only digits;

In this part of the process we will describe how to apply the filter that selects phone numbers with Portugal’s country code to field “CustomerIdentifier” and how to expand the filters described in part 2 and part 3 of this series of articles to other fields in the CDR data.

We select only the numbers which have the country code of Portugal by writing:

awk 'substr ($1 ,2 ,3) == "351"'          |

Similarly to what was used in part 3, we are using a built-in function of awk: the substr function. In this case, we are saying “in the first field, starting to count on the second character, the next three characters should be equal to 351”. The complete pipeline looks like this:

awk 'length ($1) == 13'             |
awk '$1 ~ /^+[0 -9]{12}/ ' |
awk 'substr ($1 ,2 ,3) == "351"' |

Now, we have to do the same for the phone number that received the call, i.e., the phone number that is present in the fourth field (“NumberDialed”). Well, that is easy, we just write:

awk 'length ($4) == 13'           |
awk '$4 ~ /^+[0 -9]{12}/ ' |
awk 'substr ($4 ,2 ,3) == "351"' |

The only difference between these two blocks of commands is that the first contains $1 where the second contains $4. This happens because from the point of view of these commands, the data they see is the data that was isolated by the self command, and the self command placed the phone numbers in the 1st and 4th field.

The last command of this pipeline is Unicage’s msort command. This command sorts data. In this case, it will sort by the phone number that placed the call, i.e., the element in the first field.

msort key =1                                    > tmp_phone_calls_duration

We indicate that we want the sorting to be made according to the elements in the first field by writing key=1. Because this command is the last one in the pipeline, it writes the results of its action to the file tmp_phone_calls_duration. This is a temporary file because it contains an intermediary result that will be used as we continue our data manipulation task. 

We can look at this temporary file:

[ user@unicage ]$ ls -lh
-rw -rw -r-- 1 user user 5.8G Jan 12 10:18 Bogus_CDR_Data .csv
-rw -rw -r-- 1 user user 4.5G Jan 12 10:20 Bogus_CDR_Data_UF
-rw -rw -r-- 1 user user 1.1G Jan 12 10:22 tmp_phone_calls_duration

tmp_phone_calls_duration is smaller than Bogus_CDR_Data_UF because it contains only 5 fields out of the 42 that exist in Bogus_CDR_Data_UF. Moreover, some inconsistent lines may have been discarded after the application of the data quality filter with the awk command. Here’s how the first five lines of tmp_phone_calls_duration look like:

[ user@unicage ]$ head -5 tmp_phone_calls_duration
+351999990000 12/02/2020 01:00:00 +351999994375 3558
+351999990000 12/02/2020 01:00:00 +351999991637 1076
+351999990000 12/02/2020 01:00:00 +351999990235 1271
+351999990000 12/02/2020 01:00:00 +351999993094 1804
+351999990000 12/02/2020 01:00:00 +351999990902 102

We see only five fields. So far, so good.

Want to learn more?

Find out more about how Unicage can help telecommunications businesses handle large volumes of call detail records.

Check out the previous articles in this series:

Find out more

Request a demo and speak with our team about how you can leverage the power of Unicage in your organization.

Privacy Policy