#5 Data Engineering — EXTRACT DATA from SQL Databases
This is the fifth blog in the series of posts related to Data Engineering. I have been writing all the important things that I learn as a part of the Data Scientist Nanodegree Program, Udacity. I have realized it is the best way to test my understanding of the course material and maintain the discipline to study. Please check the other posts as well.
In this article, we’ll gain experience extracting data from SQL databases. This is an overview of Python tools and assumes you already have experience writing SQL queries.
Pandas and sqlite3
You can use Pandas to open a SQL database or to run a SQL query against a database. There is more than one way to do this depending on the type of SQL database you are working with: the sqlite3 library or the sqlalchemy library.
In the same folder as this Jupyter notebook, there is a SQLite database file called “population_data.db”. SQLite is a database engine meant for single applications. The entire database is contained in one file. You can read more about SQLite here.
In this example, the “population_data.db” database contains only one table called “population_data”. The following code is used to use a SQLite database with pandas.
Demo: SQLite3 and Pandas
import sqlite3
import pandas as pd# connect to the database
conn = sqlite3.connect(‘population_data.db’)# run a query
pd.read_sql(‘SELECT * FROM population_data’, conn)
pd.read_sql(‘SELECT “Country_Name”, “Country_Code”, “1960” FROM population_data’, conn)
Demo: SQLAlchemy and Pandas
If you are working with a different type of databases such as MySQL or PostgreSQL, you can use the SQLAlchemy library with pandas. Here are the instructions for connecting to different types of databases using SQLAlchemy.
Run the code below in the jupyter notebook to see how to connect to the population_data.db database.
import pandas as pd
from sqlalchemy import create_engine###
# create a database engine
# to find the correct file path, use the python os library:
# import os
# print(os.getcwd())
#
###engine = create_engine(‘sqlite:////home/workspace/3_sql_exercise/population_data.db’)
pd.read_sql(“SELECT * FROM population_data”, engine)
Bonus Exercise
Connect to the population_data.db SQLite database, and answer the following questions:
- Write a query that finds the change in population from 1960 to 1961 in Aruba
- Write a query that finds the population of Belgium and also Luxembourg in 1975. The output should have two rows.
If you are done, match the answers:
pd.read_sql(‘SELECT “Country_name”,”1961" — “1960” FROM population_data WHERE Country_Name = “Aruba”’, engine)
pd.read_sql(‘SELECT “Country_name”,”1975" FROM population_data WHERE Country_Name = “Belgium”or””’, engine)
Conclusion
This was the basics of how we extracted the data from SQL databases. The next article will be about extracting data using APIs. Keep an eye here and let’s learn together.
Feel free to write to me at agarwalsak10@gmail.com if you have any questions or suggestions.