#2 Data Engineering — PIPELINES

Sakshi Agarwal
3 min readMar 6, 2024

This is the second blog in the series of posts related to Data Engineering. I am going to write down all the important things that I learn as a part of the Data Scientist Nanodegree Program, Udacity. I have realized it is the best way to test my understanding of the course material and maintain the discipline to study. Please check the other posts as well.

The first thing we will learn as Data Engineer is Data Pipelining.

Pipelines — INTRODUCTION

Pipelining is nothing but moving data from one place to another. There are two major pipelines:

ETL

An ETL pipeline is a specific kind of data pipeline and is very common. ETL stands for Extract, Transform, Load. Imagine that you have a database containing web log data. Each entry contains the IP address of a user, a timestamp, and the link that the user clicked.

What if your company wanted to run an analysis of links clicked by city and by day? You would need another data set that maps an IP address to a city, and you would also need to extract the day from the timestamp. With an ETL pipeline, you could run code once per day that would extract the previous day’s log data, map the IP address to city, aggregate link clicks by city, and then load these results into a new database. That way, a data analyst or scientist would have access to a table of log data by city and day. That is more convenient than always having to run the same complex data transformations on the raw…

--

--

Sakshi Agarwal

Computer Science Engineering graduate. Specialisation-Python Programming, Javascript, React, three.js