Reporting from CDR Data:

Use Case (Part 5)

In order to keep track of the usage of their infrastructure, telecommunications companies rely on Call Detail Records (CDRs). These are standardized (see https://www.fcs.org.uk/membergroups/billing) records that are created to document every exchange a given communication device makes with the network. CDR data sit at the base of the billing calculations, which means that its analysis is a fundamental element in the Telco industry.

A CDR contains a wealth of information that characterizes the usage of a network by its users. For instance, a phone call generates a CDR that contains the phone numbers involved in the call, the time, the date and the duration of the call. We will describe how to deal with this type of data using the Unicage method in 5 parts:

  • Converting the data to “space-separated value”format;
  • Cleaning the data and assessing its quality (part 1);
  • Cleaning the data and assessing its quality (part 2);
  • Performing summations over all CDRs;
  • Joining the information and producing the results of the complete analysis.

This is part 5.

Read the previous parts

In this final step, we want to bring together all the work developed in the previous four parts. At this point, this is how our working directory looks like

[ user@unicage ]$ ls -lh
-rw -rw -r-- 1 user user 5.8G Jan 12 10:18 Bogus_CDR_Data .csv
-rw -rw -r-- 1 user user 4.5G Jan 12 10:20 Bogus_CDR_Data_UF
-rw -rw -r-- 1 user user 1.1G Jan 12 10:22 tmp_phone_calls_duration
-rw -rw -r-- 1 user user 215K Feb 11 10:23 tmp_number_total_duration

We have the original CDR data in csv format (Bogus_CDR_Data.csv), the same CDR data in the Unicage format (Bogus_CDR_Data_UF), the filtered file tmp_phone_calls_duration containing data a CDR data subset:

[ user@unicage ]$ head -5 tmp_phone_calls_duration
+351999990000 12/02/2020 01:00:00 +351999994375 3558
+351999990000 12/02/2020 01:00:00 +351999991637 1076
+351999990000 12/02/2020 01:00:00 +351999990235 1271
+351999990000 12/02/2020 01:00:00 +351999993094 1804
+351999990000 12/02/2020 01:00:00 +351999990902 102

and a the file tmp_number_total_duration containing the sum of the duration of all the calls made by each phone number:

[ user@unicage ]$ head -5 tmp_number_total_duration
+351999990000 4101500
+351999990001 3427680
+351999990002 3873560
+351999990003 4553480
+351999990004 3258580

Now we want to add to each record in the file tmp_phone_calls_duration, where we have the records for all the phone calls, the total call duration of each phone number, which is the content of file tmp_number_total_duration.

Figure 1: How the join2 command works. Notice that the key used is the first field of both files.

We use Unicage’s join2 command to add the information of the latter file to the information of tmp_phone_calls_duration:

join2 key =1 tmp_number_total_duration tmp_phone_calls_duration

After the join2 command, the data looks like this:

+351999990000 4101500 12/02/2020 01:00:00 +351999994375 3558
+351999990000 4101500 12/02/2020 01:00:00 +351999991637 1076
+351999990000 4101500 12/02/2020 01:00:00 +351999990235 1271
+351999990000 4101500 12/02/2020 01:00:00 +351999993094 1804
+351999990000 4101500 12/02/2020 01:00:00 +351999990902 102

Notice that, instead of 5 fields, we have 6: after the caller’s phone number we have a new field, that corresponds to the total duration of all the phone calls made by the corresponding number – precisely the information that was in the file tmp_phone_calls_duration (see Figure 1 for a description of the join2 command). With this information, we can calculate the percentage of time a given phone call took with respect to the total duration of calls; we do this using Unicage’s lcalc command:

lcalc '$1 , $3 , $4 , $5 , $6 , $6/$2 *100 '

After using this command, the data in the pipeline is organized as follows

+351999990000 12/02/2020 01:00:00 +351999994375 3558 0.086748750457149800
+351999990000 12/02/2020 01:00:00 +351999991637 1076 0.026234304522735500
+351999990000 12/02/2020 01:00:00 +351999990235 1271 0.030988662684383700
+351999990000 12/02/2020 01:00:00 +351999993094 1804 0.043983908326222100
+351999990000 12/02/2020 01:00:00 +351999990902 102 0.002486895038400500

i.e., we organized the fields so that the field of the total duration is excluded, but before excluding it, we used its value in the following calculation

duration_of_call
total_duration x100

The result of the calculation was placed in the sixth field of each line. This indicates, for each call, the percentage of the call’s duration with respect to the duration of all the calls made by the caller’s phone number. 

Finally, and in order to get a good presentation of the numbers, we apply Unicage’s round command to the value of the percentage so that only three decimal places are kept and we save the result in the file FINAL_RESULT.

round 6.3                 > FINAL_RESULT

The file FINAL_RESULT looks like this:

[ user@unicage ]$ head -5 FINAL_RESULT
+351999990000 12/02/2020 01:00:00 +351999994375 3558 0.087
+351999990000 12/02/2020 01:00:00 +351999991637 1076 0.026
+351999990000 12/02/2020 01:00:00 +351999990235 1271 0.031
+351999990000 12/02/2020 01:00:00 +351999993094 1804 0.044
+351999990000 12/02/2020 01:00:00 +351999990902 102 0.002

The complete script 

One can put all the commands just described in a script that can be schedulled to run with as many CDR files as we want and that will perform all these tasks to each of the files. 

In a script, the steps happen one after the other. I.e., what is written first is executed first. This means that we can write a sequence of commands that will run without ambiguity and will accomplish, without failures, the task that is described.

# ------------------------------------
# convert from .csv to Unicage Format
# ------------------------------------
fromcsv Bogus_CDR_Data .csv > Bogus_CDR_Data_UF
# ----------------------------------
# 1. Isolating the data needed for our analysis
self 3 5 6 4 7 Bogus_CDR_Data_UF |
# 1: CustomerIdentifier 2: CallDate 3: CallTime 4: NumberDialled 5: Duration
# ----------------------------------
# 2. Applying filters to the phone number making the call

awk 'length ($1) == 13' |
awk '$1 ~ /^+[0 -9]{12}/ ' |
awk 'substr ($1 ,2 ,3) == "351"' |
# ----------------------------------
# 3. Applying filters to the phone number receiving the call
awk 'length ($4) == 13' |
awk '$4 ~ /^+[0 -9]{12}/ ' |
awk 'substr ($4 ,2 ,3) == "351"' |
# ----------------------------------
# 4. sorting by phone number tha made the call
msort key =1 > tmp_phone_calls_duration
# 1: CustomerIdentifier 2: CallDate 3: CallTime 4: NumberDialled 5: Duration
self 1 5 tmp_phone_calls_duration |
# 1: CustomerIdentifier 2: Duration
# ----------------------------------
# 4. calculate the total " call duration " for each number
sm2 1 1 2 2 > tmp_number_total_duration
# 1: CustomerIdentifier 2: Duration ( total )
# --- 4. joining the total durantion for each number
join2 key =1 tmp_number_total_duration tmp_phone_calls_duration |
# 1: CustomerIdentifier 2: Duration ( total ) 3: CallDate 4: CallTime 5: NumberDialled 6: Duration
# ----------------------------------
# --- 5. Calculating the percentage of duration of each call with
# --- respect to the total durantion of each client
lcalc '$1 , $3 , $4 , $5 , $6 , $6/$2 *100 ' |
# 1: CustomerIdentifier 2: CallDate 3: CallTime 4: NumberDialled 5: Duration 6: Duration (
percentage )
# ----------------------------------
# --- 6. Rounding percentage to two decimal places
round 6.3 > FINAL RESULT

Want to learn more?

Find out more about how Unicage can help telecommunications businesses handle large volumes of call detail records.

Find out more

Request a demo and speak with our team about how you can leverage the power of Unicage in your organization.

Privacy Policy