#11 Data Engineering — TRANSFORM DATA — Parsing Dates (ETL Pipeline)

7 min readNov 26, 2021

This is the eleventh blog in the series of posts related to Data Engineering. I have been writing all the important things that I learn as a part of the Data Scientist Nanodegree Program, Udacity. I have realized it is the best way to test my understanding of the course material and maintain the discipline to study. Please check the other posts as well.

Parsing Dates

Another common data transformation involves parsing dates. Parsing generally means that you start with a string and then transform that string into a different data type. In this case, that means taking a date in the format of a string and transforming the string into a date type. The Github code for the exercise is here.

Run the next cell to see an example.

In:

import pandas as pdparsed_date = pd.to_datetime('January 1st, 2017')parsed_date

Out:

Timestamp('2017-01-01 00:00:00')

In:

parsed_date.month

Out:

In:

parsed_date.year

Out:

In:

parsed_date.second

Out:

Sometimes date strings are formatted in unexpected ways. For example, in the United States, dates are given with the month first and then the day. That is what pandas expect by default. However, some countries write the date with the day first and then the month. Run the next three examples to see Panda’s default behavior and how you can specify the date formatting.

In:

parsed_date = pd.to_datetime('5/3/2017 5:30')parsed_date.month

Out:

In:

parsed_date = pd.to_datetime('3/5/2017 5:30', format='%d/%m/%Y %H:%M')parsed_date.month

Out:

In:

parsed_date = pd.to_datetime('5/3/2017 5:30', format='%m/%d/%Y %H:%M')parsed_date.month

Out:

The formatting abbreviations are actually part of the python standard. You can see examples at this link.

Part 1 — Practice Parsing Dates

Run the code cells below to import the World Bank projects data. The last line of the code outputs all of the column names in the data frame.

In:

# Run this code cell. Read in the projects data set with all columns type stringdf_projects = pd.read_csv('../data/projects_data.csv', dtype=str)df_projects.drop(['Unnamed: 56'], axis=1, inplace=True)df_projects.columns

Out:

Index(['id', 'regionname', 'countryname', 'prodline', 'lendinginstr',
       'lendinginstrtype', 'envassesmentcategorycode', 'supplementprojectflg',
       'productlinetype', 'projectstatusdisplay', 'status', 'project_name',
       'boardapprovaldate', 'board_approval_month', 'closingdate',
       'lendprojectcost', 'ibrdcommamt', 'idacommamt', 'totalamt', 'grantamt',
       'borrower', 'impagency', 'url', 'projectdoc ', 'majorsector_percent ',
       'sector1', 'sector2', 'sector3', 'sector4', 'sector5', 'sector',
       'mjsector1', 'mjsector2', 'mjsector3', 'mjsector4', 'mjsector5',
       'mjsector', 'theme1', 'theme2', 'theme3', 'theme4', 'theme5', 'theme ',
       'goal', 'financier', 'mjtheme1name', 'mjtheme2name', 'mjtheme3name',
       'mjtheme4name', 'mjtheme5name', 'location', 'GeoLocID', 'GeoLocName',
       'Latitude', 'Longitude', 'Country'],
      dtype='object')

Notice there are three columns associated with dates: boardapprovaldate, board_approval_month, and closingdate. Run the code cell below to see what these values look like.

In:

# Run this code celldf_projects.head(15)[['boardapprovaldate', 'board_approval_month', 'closingdate']]

Out:

boardapprovaldateboard_approval_monthclosingdate02018–06–28JuneNaT12018–06–28June2023–12–3122018–06–28JuneNaT32018–06–27June2023–06–2842018–06–27June2023–05–3152018–06–27June2019–12–3162018–06–27June2023–10–3172018–06–27June2023–12–3182018–06–27June2022–12–3192018–06–27June2023–12–31102018–06–27June2023–12–31112018–06–27JuneNaT122018–06–27JuneNaT132018–06–27JuneNaT142018–06–27JuneNaT

Use the pandas to_datetime method to convert the boardapprovaldate and closingdate columns into datetime objects.

In:

# TODO: Use the pandas to_datetime method to convert these two columns#   (boardapprovaldate, closingdate) into date times.# HINT: It's easier to do this one column at a timedf_projects['boardapprovaldate'] = pd.to_datetime(df_projects['boardapprovaldate'])df_projects['closingdate'] = pd.to_datetime(df_projects['closingdate'])

In :

df_projects['closingdate']

Out:

0              NaT
1       2023-12-31
2              NaT
3       2023-06-28
4       2023-05-31
5       2019-12-31
6       2023-10-31
7       2023-12-31
8       2022-12-31
9       2023-12-31
10      2023-12-31
11             NaT
12             NaT
13             NaT
14             NaT
15      2020-06-30
16             NaT
17      2023-12-31
18      2024-07-31
19      2024-12-31
20      2023-12-31
21      2019-12-31
22      2021-12-31
23             NaT
24             NaT
25             NaT
26      2023-12-31
27      2021-12-31
28      2023-06-30
29      2019-12-31
           ...    
18218   1954-03-01
18219   1956-03-31
18220   1952-12-31
18221   1958-08-30
18222   1961-06-30
18223   1955-12-31
18224   1956-12-31
18225   1953-06-30
18226   1956-02-29
18227   1955-01-01
18228   1950-12-31
18229   1951-09-30
18230   1953-12-31
18231   1951-04-01
18232   1951-03-31
18233   1953-12-31
18234   1948-08-03
18235   1952-06-01
18236   1954-12-31
18237   1956-03-31
18238   1950-07-01
18239   1948-08-03
18240   1948-08-03
18241   1948-08-03
18242   1950-01-01
18243   1954-12-31
18244   1949-03-31
18245   1949-03-31
18246   1948-06-30
18247   1947-12-31
Name: closingdate, Length: 18248, dtype: datetime64[ns]

In :

# Run the code cells below to see how you can access the different parts of the datetime objects# Series.dt gives access to the datetime object as explained here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.htmldf_projects['boardapprovaldate'].dt.second

Out:

0        0.0
1        0.0
2        0.0
3        0.0
4        0.0
5        0.0
6        0.0
7        0.0
8        0.0
9        0.0
10       0.0
11       0.0
12       0.0
13       0.0
14       0.0
15       0.0
16       0.0
17       0.0
18       0.0
19       0.0
20       0.0
21       0.0
22       0.0
23       0.0
24       0.0
25       0.0
26       0.0
27       0.0
28       0.0
29       0.0
        ... 
18218    0.0
18219    0.0
18220    0.0
18221    0.0
18222    0.0
18223    0.0
18224    0.0
18225    0.0
18226    0.0
18227    0.0
18228    0.0
18229    0.0
18230    0.0
18231    0.0
18232    0.0
18233    0.0
18234    0.0
18235    0.0
18236    0.0
18237    0.0
18238    0.0
18239    0.0
18240    0.0
18241    0.0
18242    0.0
18243    0.0
18244    0.0
18245    0.0
18246    0.0
18247    0.0
Name: boardapprovaldate, Length: 18248, dtype: float64

In:

# Run this code cell to see the outputdf_projects['boardapprovaldate'].dt.month

Out:

0         6.0
1         6.0
2         6.0
3         6.0
4         6.0
5         6.0
6         6.0
7         6.0
8         6.0
9         6.0
10        6.0
11        6.0
12        6.0
13        6.0
14        6.0
15        6.0
16        6.0
17        6.0
18        6.0
19        6.0
20        6.0
21        6.0
22        6.0
23        6.0
24        6.0
25        6.0
26        6.0
27        6.0
28        6.0
29        6.0
         ... 
18218     9.0
18219     8.0
18220     8.0
18221     7.0
18222     7.0
18223     6.0
18224     5.0
18225     4.0
18226     4.0
18227    12.0
18228    10.0
18229    10.0
18230     9.0
18231     8.0
18232     8.0
18233     7.0
18234     7.0
18235     2.0
18236     1.0
18237     1.0
18238     1.0
18239     7.0
18240     7.0
18241     7.0
18242     3.0
18243     3.0
18244     8.0
18245     8.0
18246     8.0
18247     5.0
Name: boardapprovaldate, Length: 18248, dtype: float64

In:

# Run this code to see the output# weekday represents the day of the week from 0 (Monday) to 6 (Sunday).df_projects['boardapprovaldate'].dt.weekday

Out:

0        3.0
1        3.0
2        3.0
3        2.0
4        2.0
5        2.0
6        2.0
7        2.0
8        2.0
9        2.0
10       2.0
11       2.0
12       2.0
13       2.0
14       2.0
15       2.0
16       2.0
17       2.0
18       1.0
19       1.0
20       1.0
21       1.0
22       1.0
23       0.0
24       5.0
25       4.0
26       4.0
27       4.0
28       4.0
29       4.0
        ... 
18218    2.0
18219    4.0
18220    1.0
18221    4.0
18222    4.0
18223    2.0
18224    4.0
18225    4.0
18226    1.0
18227    2.0
18228    0.0
18229    3.0
18230    3.0
18231    4.0
18232    3.0
18233    1.0
18234    2.0
18235    1.0
18236    3.0
18237    3.0
18238    3.0
18239    3.0
18240    3.0
18241    3.0
18242    3.0
18243    3.0
18244    3.0
18245    4.0
18246    3.0
18247    4.0
Name: boardapprovaldate, Length: 18248, dtype: float64

Part 2 — Create new columns

Now that the boardapprovaldate and closingdates are in datetime formats, create a few new columns in the df_projects data frame:

approvalyear
approvalday
approvalweekday
closingyear
closingday
closingweekday

In:

# TODO create the follwing new columns in the df_projects data frame# approvalyeardf_projects['approvalyear'] = df_projects['boardapprovaldate'].dt.year
# approvaldaydf_projects['approvalday'] = df_projects['boardapprovaldate'].dt.day
# approvalweekdaydf_projects['approvalweekday'] = df_projects['boardapprovaldate'].dt.weekday
# closingyeardf_projects['closingyear'] = df_projects['boardapprovaldate'].dt.year
# closingdaydf_projects['closingday'] = df_projects['boardapprovaldate'].dt.day# closingweekdaydf_projects['closingweekday'] = df_projects['boardapprovaldate'].dt.weekday###

Part 3 (Challenge)

Use what you’ve practiced in this exercise to make a visualization of the year on the x-axis and the sum of the totalamt columns per year on the y-axis.

You’ll first need to clean the totalamt column to get rid of commas and convert the values to numeric. Then you’ll need to use pandas’ groupby method to sum the totalamt column for each year.

Finally, you can use the pandas plot() method to output the visualization.

In :

#### TODO: Make a visualization with year on the x-axis and the sum of the totalamt columns per year on the y-axis# HINT: The totalamt column is currently a string with commas. For example 100,250,364. You'll need to remove the commas and convert the column to a numeric variable.# HINT: pandas groupby, sum, and plot methods should also be helpful####import matplotlib.pyplot as plt%matplotlib inline# TODO: Step 1 - convert the totalamt column from string to numeric. Be sure to remove the commas in this columndf_projects['totalamt'] = pd.to_numeric(df_projects['totalamt'].replace(',','', regex=True))# TODO: Step 2 - calculate the sum of the totalamt column by year and plot the results with a line plotax = df_projects.groupby('approvalyear')['totalamt'].sum().plot(x='approvalyear', y='totalamt',title ='Total Amount Approved per Year'ax.set_xlabel('year')ax.set_ylabel('amount $')plt.show()

Conclusion

Parsing dates is a common task data transformation task. This is true not just with pandas but with other data systems like SQL.

Next in the series will be Matching Encodings.

#11 Data Engineering — TRANSFORM DATA — Parsing Dates (ETL Pipeline)

Parsing Dates

Part 1 — Practice Parsing Dates

Part 2 — Create new columns

Part 3 (Challenge)

Conclusion

Written by Sakshi Agarwal

No responses yet