#9 Data Engineering-TRANSFORM DATA-Cleaning the data(ETL Pipelines)

Sakshi Agarwal
8 min readNov 8, 2021

--

This is the ninth blog in the series of posts related to Data Engineering. I have been writing all the important things that I learn as a part of the Data Scientist Nanodegree Program, Udacity. I have realized it is the best way to test my understanding of the course material and maintain the discipline to study. Please check the other posts as well.

Now that we know how to combine data, let’s clean the dirty data. Dirty data is the data that contain errors. For eg. One might forget to put a decimal in the number 10.198 and the number ends up being 10198. This kind of data can come from data entry mistakes, duplicate data, missing values, incomplete records, or inconsistencies between datasets. So, the first step to do after compiling the data is to audit it and look for errors.

We should find missing values, duplicates, inconsistent values, and incorrect encodings. If we don’t clean the data, our ML model will not give us the right results.

Let’s get to the practice. As usual, the GitHub code can be found here.

Part -1 Explore the data

Let’s first read in the dataset using pandas and drop the last column.

import pandas as pd

# read in the population data and drop the final column
df_indicator = pd.read_csv('../data/population_data.csv',skiprows=4)
df_indicator.drop(['Unnamed: 62'], axis=1, inplace=True)

# read in the projects data set with all columns type string
df_projects = pd.read_csv('../data/projects_data.csv', dtype=str)
df_projects.drop(['Unnamed: 56'], axis=1, inplace=True)

Then we will output the unique country names and ISO abbreviations in the population indicator data set. You’ll notice a few values that represent world regions such as ‘East Asia & Pacific’ and ‘East Asia & Pacific (excluding high income)’.

df_indicator[['Country Name', 'Country Code']].drop_duplicates()

We will run the next code cell to see the unique country names in the project data set. Notice that the projects data has two columns for country name. One is called ‘countryname’ and the other is called ‘Country’. The ‘Country’ column only has NaN values.

Another thing of note: It would’ve been easier to join the two data sets together if the projects data had the ISO country abbreviations like the indicator data has. Unfortunately, the projects data does not have the ISO country abbreviations. To join these two data sets together, you essentially have two choices:

  • add a column of ISO 3 codes to the projects data set
  • find the difference between the projects data country names and indicator data country names. Then clean the data so that they are the same.

Run the code cell below to see what the project countries look like:

df_projects[‘countryname’].unique()

Part 2 — Use the Pycountry library

Did you notice a pattern in the projects data country names? The entries are repeated and separated by a semi-colon like this:

'Kingdom of Spain;Kingdom of Spain'
'New Zealand;New Zealand'

The first step is to clean the country name column and get rid of the semi-colon. Do that below:

# TODO: In the df_projects dataframe, create a new column called #'Official Country Name' so that the country name only appears once. 
# For example, `Republic of Malta;Republic of Malta` should be #`Republic of Malta`.
#
# HINT: use the split() method - see https://pandas.pydata.org/pandas-docs/stable/text.html for examples
# HINT: with pandas, you can do all of this with just one line of code
###

df_projects['Official Country Name'] = df_projects['countryname'].str.split(";").str.get(0)

Let’s see it now

df_projects.head()

It looks like the projects data set has official country names. Hence, this data set has an entry like “Kingdom of Spain” whereas the indicators data has just “Spain”.

Luckily, someone has developed a Python library called pycountry. This library has country names, ISO abbreviations, and official country names. While you might not be able to clean all of the data with the help of this Python library, it will probably help.

Run the code cells below to install the pycountry library and see how it works.

# Run this code cell to install and import the pycountry library
!pip install pycountry
from pycountry import countries
OUT:Collecting pycountry
Downloading https://files.pythonhosted.org/packages/76/73/6f1a412f14f68c273feea29a6ea9b9f1e268177d32e0e69ad6790d306312/pycountry-20.7.3.tar.gz (10.1MB)
100% |████████████████████████████████| 10.1MB 3.2MB/s eta 0:00:01 18% |██████ | 1.9MB 35.5MB/s eta 0:00:01 47% |███████████████▍ | 4.9MB 32.0MB/s eta 0:00:01 62% |████████████████████ | 6.3MB 29.3MB/s eta 0:00:01
Building wheels for collected packages: pycountry
Running setup.py bdist_wheel for pycountry ... done
Stored in directory: /root/.cache/pip/wheels/33/4e/a6/be297e6b83567e537bed9df4a93f8590ec01c1acfbcd405348
Successfully built pycountry
Installing collected packages: pycountry
Successfully installed pycountry-20.7.3

Let us check the library now:

#Run this code cell to see an example of how the library works
countries.get(name='Spain')
OUT:Country(alpha_2='ES', alpha_3='ESP', name='Spain', numeric='724', official_name='Kingdom of Spain')#Run this code cell to see how you can also look up countries #without specifying the keycountries.lookup('Kingdom of Spain')OUT:Country(alpha_2='ES', alpha_3='ESP', name='Spain', numeric='724', official_name='Kingdom of Spain')

The goal is to add the ISO codes to the projects data set. To start, use the pycountry library to make a dictionary mapping the unique countries in ‘Official Country Name’ to the ISO code.

Iterate through the unique countries in df_projects[‘Official Country Name’]. Create a dictionary mapping the ‘Country Name’ to the alpha_3 ISO abbreviations.

The dictionary should look like: {'Kingdom of Spain':'ESP'}

If a country name cannot be found in the pycountry library, add it to a list called country_not_found.

# set up the libraries and variables
from collections import defaultdict
country_not_found = [] # stores countries not found in the pycountry library
project_country_abbrev_dict = defaultdict(str) # set up an empty dictionary of string values

# TODO: iterate through the country names in df_projects.
# Create a dictionary mapping the country name to the alpha_3 ISO code
for country in df_projects['Official Country Name'].drop_duplicates().sort_values():
try:
# TODO: look up the country name in the pycountry library
# store the country name as the dictionary key and the ISO-3 code as the value
project_country_abbrev_dict[country] = countries.lookup(country).alpha_3
except:
# If the country name is not in the pycountry library, then print out the country name
# And store the results in the country_not_found list
print(country, ' not found')
country_not_found.append(country)

Quite a few country names were not in the pycountry library. Some of these are regions like “South Asia” or “Southern Africa”, so it makes sense that these would not show up in the pycountry library.

Part 3 — Making a Manual Mapping

Perhaps some of these missing df_projects countries are already in the indicators data set. In the next cell, check if any of the countries in the country_not_found list are in the indicator list of countries.

# Run this code cell to 
# iterate through the country_not_found list and check if the country name is in the df_indicator data set
indicator_countries = df_indicator[['Country Name', 'Country Code']].drop_duplicates().sort_values(by='Country Name')

for country in country_not_found:
if country in indicator_countries['Country Name'].tolist():
print(country)

OUT:
South Asia
St. Kitts and Nevis
St. Lucia
St. Vincent and the Grenadines
West Bank and Gaza
World

Unfortunately, there aren’t too many country names that match between df_indicator and df_projects. This is where data cleaning becomes especially tedious, but in this case, we’ve done a lot of the work for you.

We’ve manually created a dictionary that maps all of the countries in country_not_found to the ISO-3 alpha codes. You could try to do this programatically using some sophisticated string matching rules. That might be worth your time for a larger data set. But in this case, it’s probably faster to type out the dictionary.

# run this code cell to load the dictionary

country_not_found_mapping = {'Co-operative Republic of Guyana': 'GUY',
'Commonwealth of Australia':'AUS',
'Democratic Republic of Sao Tome and Prin':'STP',
'Democratic Republic of the Congo':'COD',
'Democratic Socialist Republic of Sri Lan':'LKA',
'East Asia and Pacific':'EAS',
'Europe and Central Asia': 'ECS',
'Islamic Republic of Afghanistan':'AFG',
'Latin America':'LCN',
'Caribbean':'LCN',
'Macedonia':'MKD',
'Middle East and North Africa':'MEA',
'Oriental Republic of Uruguay':'URY',
'Republic of Congo':'COG',
"Republic of Cote d'Ivoire":'CIV',
'Republic of Korea':'KOR',
'Republic of Niger':'NER',
'Republic of Kosovo':'XKX',
'Republic of Rwanda':'RWA',
'Republic of The Gambia':'GMB',
'Republic of Togo':'TGO',
'Republic of the Union of Myanmar':'MMR',
'Republica Bolivariana de Venezuela':'VEN',
'Sint Maarten':'SXM',
"Socialist People's Libyan Arab Jamahiriy":'LBY',
'Socialist Republic of Vietnam':'VNM',
'Somali Democratic Republic':'SOM',
'South Asia':'SAS',
'St. Kitts and Nevis':'KNA',
'St. Lucia':'LCA',
'St. Vincent and the Grenadines':'VCT',
'State of Eritrea':'ERI',
'The Independent State of Papua New Guine':'PNG',
'West Bank and Gaza':'PSE',
'World':'WLD'}

Next, update the project_country_abbrev_dict variable with these new values.

# TODO: Update the project_country_abbrev_dict with the country_not_found_mapping dictionary
# HINT: This is relatively straightforward. Python dictionaries have a method called update(), which essentially
# appends a dictionary to another dictionary
project_country_abbrev_dict.update(country_not_found_mapping)

Part 5 — Make a ‘Country Code’ Column

Next, create a ‘Country Code’ column in the data_projects data frame. Use the project_country_abbrev_dict and df_projects[‘Official Country Name’] column to create a new columns called ‘Country ‘Code’.

# TODO: Use the project_country_abbrev_dict and the df_projects['Country Name'] column to make a new column
# of the alpha-3 country codes. This new column should be called 'Country Code'.
# HINT: Use the apply method and a lambda function
# HINT: The lambda function will use the project_country_abbrev_dict that maps the country name to the ISO code
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
df_projects['Country Code'] = df_projects['Official Country Name'].apply(lambda x: project_country_abbrev_dict[x])# Run this code cell to see which projects in the df_projects data frame still have no country code abbreviation.
# In other words, these projects do not have a matching population value in the df_indicator data frame.
df_projects[df_projects['Country Code'] == '']

You’ll notice that there are still a few entries without country abbreviations. This includes projects that were labeled as “Africa” rather than a specific country. It also includes “Yugoslavia”, which is a country that ceased to exist in the 1990s.

Conclusion

Now the df_projects dataframe and the df_indicator dataframe have a matching column called ‘Country Code’. But these two data frames can’t be merged quite yet.

Each project in the df_projects dataframe also has a date associated with it. The idea would be to merge the df_projects dataframe with the df_indicator dataframe so that each project also had a population value associated with it. There are still more data transformations to do in order for that to be possible.

In fact, the challenge problem from the previous exercise on merging data would help quite a bit. In that exercise, the indicator data was transformed from a wide format to a long format.

You could then merge the df_projects dataframe and the df_indicator dataframe using the alpha-3 country abbreviation and the project or indicator year. You can start to see how data transformations become a series of processes that pipeline data from one format into a different format.

In the next blog, we’ll practice working with the data types.

This blog series is my journey through the course and in the field. I will try my best to write the blogs in layman’s terminologies. Please join along if you are learning and share any insights that I might have missed.

Feel free to write to me at agarwalsak10@gmail.com if you have any questions or suggestions.

--

--

Sakshi Agarwal
Sakshi Agarwal

Written by Sakshi Agarwal

Computer Science Engineering graduate. Specialisation-Python Programming, Javascript, React, three.js

No responses yet