Extract, transform, and load (ETL) pipelines process data from multiple sources into central repositories to serve various business purposes.This project uses a toy pipeline to model an ETL in the wild that ingests data from a REST API using PySpark.
This project creates an ETL (extract, transform, load) pipeline that:
- Imports data from a public API (using PySpark, the Python API for Spark)
- Creates a dataframe
- Creates a temporary view or HIVE table for SQL queries
- Cleans and transform the data based on business requirements
- Converts and stores data in requested file formats such as (CSV,ORC, JSON, Parquet)
- Creates visualization using Matplotlib on process data.
For a detailed explanation on the ETL processes used, check out the accompanying article on medium
You can find the code for this project here.
File overview:
ETL_Pipeline_with_Apache_Spark_using_a_Rest_API.ipynb
- the full code from this project
To follow this project, please install the following locally:
- Python 3.8+
- Spark-3.2.1
- os
- sys
- Python packages
- pyspark
- pyspark.sql.functions
- Matplotlib
You can access the USA Spending API used for the project here:
- Toptier Agencies - REST API data source
To understand the terminiologies used to describe columns in the data, you can preview this document:
- Basic Federal Budgeting Terminology - Basic Federal Budgeting Terminology