#11 Data Engineering — TRANSFORM DATA — Parsing Dates (ETL Pipeline)

Sakshi Agarwal
7 min readNov 26, 2021

--

This is the eleventh blog in the series of posts related to Data Engineering. I have been writing all the important things that I learn as a part of the Data Scientist Nanodegree Program, Udacity. I have realized it is the best way to test my understanding of the course material and maintain the discipline to study. Please check the other posts as well.

Parsing Dates

Another common data transformation involves parsing dates. Parsing generally means that you start with a string and then transform that string into a different data type. In this case, that means taking a date in the format of a string and transforming the string into a date type. The Github code for the exercise is here.

Run the next cell to see an example.

In:

import pandas as pdparsed_date = pd.to_datetime('January 1st, 2017')parsed_date

Out:

Timestamp('2017-01-01 00:00:00')

In:

parsed_date.month

Out:

1

In:

parsed_date.year

Out:

2017

In:

parsed_date.second

Out:

0

Sometimes date strings are formatted in unexpected ways. For example, in the United States, dates are given with the month first and then the day. That is what pandas expect by default. However, some countries write the date with the day first and then the month. Run the next three examples to see Panda’s default behavior and how you can specify the date formatting.

In:

parsed_date = pd.to_datetime('5/3/2017 5:30')parsed_date.month

Out:

5

In:

parsed_date = pd.to_datetime('3/5/2017 5:30', format='%d/%m/%Y %H:%M')parsed_date.month

Out:

5

In:

parsed_date = pd.to_datetime('5/3/2017 5:30', format='%m/%d/%Y %H:%M')parsed_date.month

Out:

5

The formatting abbreviations are actually part of the python standard. You can see examples at this link.

Part 1 — Practice Parsing Dates

Run the code cells below to import the World Bank projects data. The last line of the code outputs all of the column names in the data frame.

In:

# Run this code cell. Read in the projects data set with all columns type stringdf_projects = pd.read_csv('../data/projects_data.csv', dtype=str)df_projects.drop(['Unnamed: 56'], axis=1, inplace=True)df_projects.columns

Out:

Index(['id', 'regionname', 'countryname', 'prodline', 'lendinginstr',
'lendinginstrtype', 'envassesmentcategorycode', 'supplementprojectflg',
'productlinetype', 'projectstatusdisplay', 'status', 'project_name',
'boardapprovaldate', 'board_approval_month', 'closingdate',
'lendprojectcost', 'ibrdcommamt', 'idacommamt', 'totalamt', 'grantamt',
'borrower', 'impagency', 'url', 'projectdoc ', 'majorsector_percent ',
'sector1', 'sector2', 'sector3', 'sector4', 'sector5', 'sector',
'mjsector1', 'mjsector2', 'mjsector3', 'mjsector4', 'mjsector5',
'mjsector', 'theme1', 'theme2', 'theme3', 'theme4', 'theme5', 'theme ',
'goal', 'financier', 'mjtheme1name', 'mjtheme2name', 'mjtheme3name',
'mjtheme4name', 'mjtheme5name', 'location', 'GeoLocID', 'GeoLocName',
'Latitude', 'Longitude', 'Country'],
dtype='object')

Notice there are three columns associated with dates: boardapprovaldate, board_approval_month, and closingdate. Run the code cell below to see what these values look like.

In:

# Run this code celldf_projects.head(15)[['boardapprovaldate', 'board_approval_month', 'closingdate']]

Out:

boardapprovaldateboard_approval_monthclosingdate02018–06–28JuneNaT12018–06–28June2023–12–3122018–06–28JuneNaT32018–06–27June2023–06–2842018–06–27June2023–05–3152018–06–27June2019–12–3162018–06–27June2023–10–3172018–06–27June2023–12–3182018–06–27June2022–12–3192018–06–27June2023–12–31102018–06–27June2023–12–31112018–06–27JuneNaT122018–06–27JuneNaT132018–06–27JuneNaT142018–06–27JuneNaT

Use the pandas to_datetime method to convert the boardapprovaldate and closingdate columns into datetime objects.

In:

# TODO: Use the pandas to_datetime method to convert these two columns#   (boardapprovaldate, closingdate) into date times.# HINT: It's easier to do this one column at a timedf_projects['boardapprovaldate'] = pd.to_datetime(df_projects['boardapprovaldate'])df_projects['closingdate'] = pd.to_datetime(df_projects['closingdate'])

In :

df_projects['closingdate']

Out:

0              NaT
1 2023-12-31
2 NaT
3 2023-06-28
4 2023-05-31
5 2019-12-31
6 2023-10-31
7 2023-12-31
8 2022-12-31
9 2023-12-31
10 2023-12-31
11 NaT
12 NaT
13 NaT
14 NaT
15 2020-06-30
16 NaT
17 2023-12-31
18 2024-07-31
19 2024-12-31
20 2023-12-31
21 2019-12-31
22 2021-12-31
23 NaT
24 NaT
25 NaT
26 2023-12-31
27 2021-12-31
28 2023-06-30
29 2019-12-31
...
18218 1954-03-01
18219 1956-03-31
18220 1952-12-31
18221 1958-08-30
18222 1961-06-30
18223 1955-12-31
18224 1956-12-31
18225 1953-06-30
18226 1956-02-29
18227 1955-01-01
18228 1950-12-31
18229 1951-09-30
18230 1953-12-31
18231 1951-04-01
18232 1951-03-31
18233 1953-12-31
18234 1948-08-03
18235 1952-06-01
18236 1954-12-31
18237 1956-03-31
18238 1950-07-01
18239 1948-08-03
18240 1948-08-03
18241 1948-08-03
18242 1950-01-01
18243 1954-12-31
18244 1949-03-31
18245 1949-03-31
18246 1948-06-30
18247 1947-12-31
Name: closingdate, Length: 18248, dtype: datetime64[ns]

In :

# Run the code cells below to see how you can access the different parts of the datetime objects# Series.dt gives access to the datetime object as explained here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.htmldf_projects['boardapprovaldate'].dt.second

Out:

0        0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
21 0.0
22 0.0
23 0.0
24 0.0
25 0.0
26 0.0
27 0.0
28 0.0
29 0.0
...
18218 0.0
18219 0.0
18220 0.0
18221 0.0
18222 0.0
18223 0.0
18224 0.0
18225 0.0
18226 0.0
18227 0.0
18228 0.0
18229 0.0
18230 0.0
18231 0.0
18232 0.0
18233 0.0
18234 0.0
18235 0.0
18236 0.0
18237 0.0
18238 0.0
18239 0.0
18240 0.0
18241 0.0
18242 0.0
18243 0.0
18244 0.0
18245 0.0
18246 0.0
18247 0.0
Name: boardapprovaldate, Length: 18248, dtype: float64

In:

# Run this code cell to see the outputdf_projects['boardapprovaldate'].dt.month

Out:

0         6.0
1 6.0
2 6.0
3 6.0
4 6.0
5 6.0
6 6.0
7 6.0
8 6.0
9 6.0
10 6.0
11 6.0
12 6.0
13 6.0
14 6.0
15 6.0
16 6.0
17 6.0
18 6.0
19 6.0
20 6.0
21 6.0
22 6.0
23 6.0
24 6.0
25 6.0
26 6.0
27 6.0
28 6.0
29 6.0
...
18218 9.0
18219 8.0
18220 8.0
18221 7.0
18222 7.0
18223 6.0
18224 5.0
18225 4.0
18226 4.0
18227 12.0
18228 10.0
18229 10.0
18230 9.0
18231 8.0
18232 8.0
18233 7.0
18234 7.0
18235 2.0
18236 1.0
18237 1.0
18238 1.0
18239 7.0
18240 7.0
18241 7.0
18242 3.0
18243 3.0
18244 8.0
18245 8.0
18246 8.0
18247 5.0
Name: boardapprovaldate, Length: 18248, dtype: float64

In:

# Run this code to see the output# weekday represents the day of the week from 0 (Monday) to 6 (Sunday).df_projects['boardapprovaldate'].dt.weekday

Out:

0        3.0
1 3.0
2 3.0
3 2.0
4 2.0
5 2.0
6 2.0
7 2.0
8 2.0
9 2.0
10 2.0
11 2.0
12 2.0
13 2.0
14 2.0
15 2.0
16 2.0
17 2.0
18 1.0
19 1.0
20 1.0
21 1.0
22 1.0
23 0.0
24 5.0
25 4.0
26 4.0
27 4.0
28 4.0
29 4.0
...
18218 2.0
18219 4.0
18220 1.0
18221 4.0
18222 4.0
18223 2.0
18224 4.0
18225 4.0
18226 1.0
18227 2.0
18228 0.0
18229 3.0
18230 3.0
18231 4.0
18232 3.0
18233 1.0
18234 2.0
18235 1.0
18236 3.0
18237 3.0
18238 3.0
18239 3.0
18240 3.0
18241 3.0
18242 3.0
18243 3.0
18244 3.0
18245 4.0
18246 3.0
18247 4.0
Name: boardapprovaldate, Length: 18248, dtype: float64

Part 2 — Create new columns

Now that the boardapprovaldate and closingdates are in datetime formats, create a few new columns in the df_projects data frame:

  • approvalyear
  • approvalday
  • approvalweekday
  • closingyear
  • closingday
  • closingweekday

In:

# TODO create the follwing new columns in the df_projects data frame# approvalyeardf_projects['approvalyear'] = df_projects['boardapprovaldate'].dt.year
# approvaldaydf_projects['approvalday'] = df_projects['boardapprovaldate'].dt.day
# approvalweekdaydf_projects['approvalweekday'] = df_projects['boardapprovaldate'].dt.weekday
# closingyeardf_projects['closingyear'] = df_projects['boardapprovaldate'].dt.year
# closingdaydf_projects['closingday'] = df_projects['boardapprovaldate'].dt.day# closingweekdaydf_projects['closingweekday'] = df_projects['boardapprovaldate'].dt.weekday###

Part 3 (Challenge)

Use what you’ve practiced in this exercise to make a visualization of the year on the x-axis and the sum of the totalamt columns per year on the y-axis.

You’ll first need to clean the totalamt column to get rid of commas and convert the values to numeric. Then you’ll need to use pandas’ groupby method to sum the totalamt column for each year.

Finally, you can use the pandas plot() method to output the visualization.

In :

#### TODO: Make a visualization with year on the x-axis and the sum of the totalamt columns per year on the y-axis# HINT: The totalamt column is currently a string with commas. For example 100,250,364. You'll need to remove the commas and convert the column to a numeric variable.# HINT: pandas groupby, sum, and plot methods should also be helpful####import matplotlib.pyplot as plt%matplotlib inline# TODO: Step 1 - convert the totalamt column from string to numeric. Be sure to remove the commas in this columndf_projects['totalamt'] = pd.to_numeric(df_projects['totalamt'].replace(',','', regex=True))# TODO: Step 2 - calculate the sum of the totalamt column by year and plot the results with a line plotax = df_projects.groupby('approvalyear')['totalamt'].sum().plot(x='approvalyear', y='totalamt',title ='Total Amount Approved per Year'ax.set_xlabel('year')ax.set_ylabel('amount $')plt.show()

Conclusion

Parsing dates is a common task data transformation task. This is true not just with pandas but with other data systems like SQL.

Next in the series will be Matching Encodings.

--

--

Sakshi Agarwal

Computer Science Engineering graduate. Specialisation-Python Programming, Javascript, React, three.js