The purpose of the data engineering capstone project is to give you a chance to combine what you've learned throughout the program. This project will be an important part of your portfolio that will help you achieve your data engineering-related career goals.
In this project, you can choose to complete the project provided for you, or define the scope and data for a project of your own design. Either way, you'll be expected to go through the same steps outlined below.
- Identify and gather the data you'll be using for your project (at least two sources and more than 1 million rows). See Project Resources for ideas of what data you can use.
- Explain what end use cases you'd like to prepare the data for (e.g., analytics table, app back-end, source-of-truth database, etc.)
- Explore the data to identify data quality issues, like missing values, duplicate data, etc.
- Document steps necessary to clean the data
- Map out the conceptual data model and explain why you chose that model
- List the steps necessary to pipeline the data into the chosen data model
- Create the data pipelines and the data model
- Include a data dictionary
- Run data quality checks to ensure the pipeline ran as expected
- Integrity constraints on the relational database (e.g., unique key, data type, etc.)
- Unit tests for the scripts to ensure they are doing the right thing
- Source/count checks to ensure completeness
- What's the goal? What queries will you want to run? How would Spark or Airflow be incorporated? Why did you choose the model you chose?
- Clearly state the rationale for the choice of tools and technologies for the project.
- Document the steps of the process.
- Propose how often the data should be updated and why.
- Post your write-up and final data model in a GitHub repo.
- Include a description of how you would approach the problem differently under the following scenarios:
- If the data was increased by 100x.
- If the pipelines were run on a daily basis by 7am.
- If the database needed to be accessed by 100+ people.
The project I have created is to expand U.S. immigration data with additional dimensions. This will allow for wider possibilities when analyzing these data by analysts. In the project I used a star schema with one fact table. For cleaning data, manipulating and creating structure of data warehouse I used Spark, because this technology allow with ease to operations on large data sets. One more thing follows from the use of this technology - data warehouse is saved in parquet file. The use case for this analytical database is to look at this data through the prism of different dimensions and their connections - for example we can find out if there is connection between immigration and demographic or temperature data. It meanse that this data warehouse will allow us to see things like correlation between big aglomerations and immigration, or between warmer places and immigration.
I have implemented star schema. It is the typical schema for a Data Warehouse and together with the snowflake model they are the most popular data warehouse schemas. For cleaning data, creating structure of data warehouse and all operations with data frames and files I used Spark. Data Warehouse schema is saved in parquet file. For this project I used Spark because this technology allows you to easily manipulate large files. Despite the fact that the data for this task was not that large yet, I decided that the data size was sufficient to use a Spark.
- immigration - records with immigration data
- cicid bigint
- code_port string
- code_state string
- visapost string
- occup string
- entdepa string
- entdepd string
- entdepu string
- matflag string
- dtaddto string
- gender string
- insnum string
- airline string
- admnum double
- fltno string
- visatype string
- code_visa integer
- code_mode integer
- code_country_origin integer
- code_country_city integer
- year integer
- month integer
- birth_year integer
- age integer
- counter_summary integer
- arrival_date date
- departure_date date
- arrival_year integer
- arrival_month integer
- arrival_day integer
-
demographic - demographic data
- state_code string
- city string
- state string
- median_age string
- male_population string
- female_population string
- total_population string
- number_of_veterans string
- foreign_born string
- average_household_size string
- american_indian_and_alaska_native integer
- asian integer
- black_or_african_american integer
- hispanic_or_atino integer
- white integer
-
temperature - temperature data
- port_code string
- month integer
- avg_tempertature double
-
airport - airports data
- ident string
- type string
- name string
- elevation_ft string
- continent string
- iso_country string
- iso_region string
- municipality string
- gps_code string
- iata_code string
- local_code string
- coordinates string
-
state - state codes data
- state_code string
- state_name string
-
country - countries data
- country_code string
- country_name string
-
visa - visa codes data
- visa_id integer
- visa_type string
-
mode - modes data
- mode_id integer
- mode_name string
-
port - ports data
- port_code string
- port_name string
- csv_data - folder that contains raw data for immigration, demographics, airports and column labels
- static_files - folder that contains static files (for example: picture of database schema)
- data_cleaning.py - functions for cleaning and preparing tables
- data_gathering.py - functions for gathering raw data
- data_paths.py - data paths to data
- data_saving.py - functions for saving database to PySpark parquet
- data_validation.py - functions for data quality checks
- etl.py - ETL script of whole process
- etl_jupyter.ipynb - Jupyter Notebook with details and explanations
- README.md - project description
- Read raw data from CSV and SAS files.
- Processing data
- Extracting all the information for dimension tables from SAS Label files.
- Cleaning and preparing all the dimension and fact tables.
- Inserting whole Data Warehouse to the parquet.
- Inserting fact immigration table to the parquet.
- Inserting all dimension tables to the parquet.
- Run quality check for data warehouse.
- Quality check for rows.
- Quality check for joins.
We can run whole ETL process with one command:
python etl.py
Here is screenshot how whole ETL process look in the console: