#1 Data Engineering Course — INTRODUCTION

Sakshi Agarwal
3 min readAug 31, 2021

--

This is the first blog in the series of posts related to Data Engineering. I am going to write down all the important things that I learn as a part of the Data Scientist Nanodegree Program, Udacity. I have realised it is the best way to test my understanding on the course material and maintain the discipline to study.

For this first blog, let’s learn the three main things:

  1. What do the Data Engineers do?
  2. Roadmap of what is going to come in the future posts.
  3. Final Project Preview

So, let’s dive into the first topic:

What exactly is a Data Engineer’s role?

One, they are responsible for fetching data from all the different kinds of sources. Two, they then process it and clean it. Finally, they store data that is further accessed and utilized by others(Data Scientist/Business Analyst, etc.) in an organisation.

They may also automate all this process into what is called data pipelines. They work with large systems that they build, test, and maintain.

As a Data Scientist, it is important to understand at least a bit of data engineering, so that you can communicate the needs and requirements to a data engineer.

Even as a Machine Learning Engineer, it is important that you learn data engineering as it is almost impossible to train and test a model on poorly formatted data.

Course Roadmap

Data Engineering

  • Data Pipelines
  • ETL (Extract Transform Load) Pipelines

NLP Pipelines-

  • Text Processing
  • Modeling

Machine Learning Pipelines

  • Scikit-learn pipelines
  • Feature Union
  • Grid Search

Data Engineering Project

  • Classify disaster response messages
  • Skills: data pipelines, NLP pipelines, machine learning pipelines, supervised learning

Final Project Preview

In this project, I will be analyzing thousands of real messages provided by Figure 8, sent during natural disasters either via social media or directly to disaster response organizations. I’ll build an ETL pipeline that processes message and category data from CSV files and load them into an SQLite database, which my machine learning pipeline will then read from to create and save a multi-output supervised learning model. Then, my web app will extract data from this database to provide data visualizations and use my model to classify new messages for 36 categories.

Machine learning is critical to helping different organizations understand which messages are relevant to them and which messages to prioritize. During these disasters is when they have the least capacity to filter out messages that matter and find basic methods such as using keyword searches to provide trivial results. In this course, you’ll learn the skills you need in ETL pipelines, natural language processing, and machine learning pipelines to create an amazing project with real-world significance.

In conclusion, this blog series will be my journey through the course and in the field. I will try my best to write the blogs in layman’s terminologies. Please join along if you are learning and share any insights that I might have missed.

Note: I have already completed the first two parts of the courses which are Introduction to Data Science and Software Engineering. However, as a part of first assignment, I have written a blog post which you can find here. The github code for the same can be found here.

--

--

Sakshi Agarwal
Sakshi Agarwal

Written by Sakshi Agarwal

Computer Science Engineering graduate. Specialisation-Python Programming, Javascript, React, three.js

No responses yet