#4 Data Engineering — EXTRACT DATA from JSON and XML files

7 min readMar 6, 2024

This is the fourth blog in the series of posts related to Data Engineering. I have been writing all the important things that I learn as a part of the Data Scientist Nanodegree Program, Udacity. I have realized it is the best way to test my understanding of the course material and maintain the discipline to study. Please check the other posts as well.

In this article, our focus is on the first part of the ETL pipeline which is extracting the data. We extract the data from two sources in this blog, namely JSON and XML files. The Github code for it is here. If you are looking to learn how to extract data from CSV files, please refer to my last blog.

EXTRACT JSON

Let us first learn what is a JSON file format.

JSON (JavaScript Object Notation) is a lightweight data-interchange format.

JSON is built on two structures:

A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

These are universal data structures. Virtually all modern programming languages support…

#4 Data Engineering — EXTRACT DATA from JSON and XML files

EXTRACT JSON

Written by Sakshi Agarwal