#14 Data Engineering —TRANSFORM DATA — Duplicate Data(ETL Pipeline)

Sakshi Agarwal
7 min readJan 12, 2022

This is the fourteenth blog in the series of posts related to Data Engineering. I have been writing all the important things that I learn as a part of the Data Scientist Nanodegree Program, Udacity. I have realized it is the best way to test my understanding of the course material and maintain the discipline to study. Please check the other posts as well.

Duplicate Data

Apart from missing data, another major problem is duplicate data, the same record is represented multiple times.

One easy way is the drop_duplicates method in pandas. However, it isn’t always easy to detect duplicates.

For example, let’s have two tables that we want to merge:

Table 1:

+---------+-------+---------------+-------------------+
| First | Last | Phone | Address |
+---------+-------+---------------+-------------------+
| Cristina| Real | 000-0000-0001 | X Street, Y city |
| Amber | Fitz | 000-0000-0002 | A Street, Y city |
+---------+-------+---------------+-------------------+

Table 2:

+------------+------+---------------+
| First Name | Last | Email |
+------------+------+---------------+
| Christina | Real | xyz@email.com |
|…

--

--

Sakshi Agarwal

Computer Science Engineering graduate. Specialisation-Python Programming, Javascript, React, three.js