#15 Data Engineering — TRANSFORM DATA — Dummy Variables (ETL Pipeline)

Sakshi Agarwal
11 min readJan 13, 2022

This is the fifteenth blog in the series of posts related to Data Engineering. I have been writing all the important things that I learn as a part of the Data Scientist Nanodegree Program, Udacity. I have realized it is the best way to test my understanding of the course material and maintain the discipline to study. Please check the other posts as well.

Dummy Variables — What are they and when do we use them?

Oftentimes, when we use linear regression, we should convert categorical data into a set of numbers called dummy variables.

For example, the world bank project data contains the total amount of money associated with each project. Let’s assume we want to predict what amount a new proposed proposal will receive.

We can use the project theme as one of the feature models. Say there are five categories, “agriculture”, “banking”, “retail”, “roads”, and “government”, then we can convert them to categories as follows:

However, you only need four of those five categories for dummy variables. If all those four categories are…

--

--

Sakshi Agarwal

Computer Science Engineering graduate. Specialisation-Python Programming, Javascript, React, three.js