#12 Data Engineering — TRANSFORM DATA — Matching Encodings(ETL Pipeline)

Sakshi Agarwal
4 min readDec 24, 2021

This is the twelfth blog in the series of posts related to Data Engineering. I have been writing all the important things that I learn as a part of the Data Scientist Nanodegree Program, Udacity. I have realized it is the best way to test my understanding of the course material and maintain the discipline to study. Please check the other posts as well.

Encodings

Encodings are a set of rules mapping string characters to their binary representations. Python supports dozens of different encoding as seen here in this link. Because the web was originally in English, the first encoding rules mapped binary code to the English alphabet.

The English alphabet has only 26 letters. But other languages have many more characters including accents, tildes, and umlauts. As time went on, more encodings were invented to deal with languages other than English. The UTF-8 standard tries to provide a single encoding schema that can encompass all text.

The problem is that it’s difficult to know what encoding rules were used to make a file unless somebody tells you. The most common encoding by far is UTF-8. Pandas will assume that files are UTF-8 when you read them in or write them out.

--

--

Sakshi Agarwal

Computer Science Engineering graduate. Specialisation-Python Programming, Javascript, React, three.js