danielbeach / data-engineering-practice Goto Github PK

Data Engineering Practice Problems

Dockerfile 63.33% Python 36.67%

data-engineering-practice's Introduction

Data Engineering Practice Problems

One of the main obstacles of Data Engineering is the large and varied technical skills that can be required on a day-to-day basis.

*** Note - If you email a link to your GitHub repo with all the completed exercises, I will send you back a free copy of my ebook Introduction to Data Engineering. ***

This aim of this repository is to help you develop and learn those skills. Generally, here are the high level topics that these practice problems will cover.

Python data processing.
csv, flat-file, parquet, json, etc.
SQL database table design.
Python + Postgres, data ingestion and retrieval.
PySpark
Data cleansing / dirty data.

How to work on the problems.

You will need two things to work effectively on most all of these problems.

Docker
docker-compose

All the tools and technologies you need will be packaged into the dockerfile for each exercise.

For each exercise you will need to cd into that folder and run the docker build command, that command will be listed in the README for each exercise, follow those instructions.

Beginner Exercises

Exercise 1 - Downloading files.

The first exercise tests your ability to download a number of files from an HTTP source and unzip them, storing them locally with Python. cd Exercises/Exercise-1 and see README in that location for instructions.

Exercise 2 - Web Scraping + Downloading + Pandas

The second exercise tests your ability perform web scraping, build uris, download files, and use Pandas to do some simple cumulative actions. cd Exercises/Exercise-2 and see README in that location for instructions.

Exercise 3 - Boto3 AWS + s3 + Python.

The third exercise tests a few skills. This time we will be using a popular aws package called boto3 to try to perform a multi-step actions to download some open source s3 data files. cd Exercises/Exercise-3 and see README in that location for instructions.

Exercise 4 - Convert JSON to CSV + Ragged Directories.

The fourth exercise focuses more file types json and csv, and working with them in Python. You will have to traverse a ragged directory structure, finding any json files and converting them to csv.

Exercise 5 - Data Modeling for Postgres + Python.

The fifth exercise is going to be a little different than the rest. In this problem you will be given a number of csv files. You must create a data model / schema to hold these data sets, including indexes, then create all the tables inside Postgres by connecting to the database with Python.

Intermediate Exercises

Exercise 6 - Ingestion and Aggregation with PySpark.

The sixth exercise Is going to step it up a little and move onto more popular tools. In this exercise we are going to load some files using PySpark and then be asked to do some basic aggregation. Best of luck!

Exercise 7 - Using Various PySpark Functions

The seventh exercise Taking a page out of the previous exercise, this one is focus on using a few of the more common build in PySpark functions pyspark.sql.functions and applying their usage to real-life problems.

Many times to solve simple problems we have to find and use multiple functions available from libraries. This will test your ability to do that.

Exercise 8 - Using DuckDB for Analytics and Transforms.

The eighth exercise Using new tools is imperative to growing as a Data Engineer. DuckDB is one of those new tools. In this exercise you will have to complete a number of analytical and transformation tasks using DuckDB. This will require an understanding of the functions and documenation of DuckDB.

Exercise 9 - Using Polars lazy computation.

The ninth exercise Polars is a new Rust based tool with a wonderful Python package that has taken Data Engineering by storm. It's better than Pandas because it has both SQL Context and supports Lazy evalutation for larger than memory data sets! Show your Lazy skills!

data-engineering-practice's People

Contributors

Stargazers

Watchers

Forkers

celestialized krishnabyggari94 nadimessa joshgreenslade kishorepradeep savadev broges stewartwc shrishailde iyedsaadli tochiebere rifatrkf risarora binbenban jacke callmekofi rbackupx frank2533 fullrobot xiaolongguo artusc cclauss jianbozheng varuserey shamoo100 sivamsinghsh mmdanas blopezpi eatzebaby dcccastro parikannappan tonyle9 hartyquinn raghavkhandelwal31 tthustla awesome-project-ic awesome-tutors vic-hatem b3rnhch jinysong mastervel elonvampire fvillena4 gruizebury rammakireddi nexxyb shubham23471 cyrusleungst moritzkoerber javapagar vladimirdepistado2003 beannguyengw sumanengg nithish761 mide-clp dvainrub samdewriter sandeep298 marclamberti trungnghiahoang96 hungnphan nan0t brianalytics sbeep neeleshapatil anshuman1117 cherifsouare priya-gittest thesekyi cakkhoiron the-mirak susmit07 kunjalmaheshwari pavan3401 mouradap somasekharreddy1119 analytics-sam chinmay4382 atse0612 deanlj-dev dhamodarb mateus-ae conradobio deepak7093 mithranvm dataengdev syedmushtaq17 realattaurrehman ochibobo telltoarvind muhammadyasir1 rpshgupta haguy77 cpkabra jkcai dennokush fabiomarquez sureshb208 no0b1t0 aadiiitiii

data-engineering-practice's Issues

Exercise-2 docker build hanging on building wheel for pandas (pyproject.toml)

On exercise 2 when I try to run docker build --tag=exercise-2 . it gets to Building wheel for pandas (pyproject.toml)... then hangs for 20+ minutes. Is this expected?

I tried updating pip install --upgrade pip before building the image as suggested here, but no luck.

I cancelled and then attempted again so you can see my terminal, but I left it to run for 20+ minutes previously

Exercise 2: permission denied

I get error 403 and that I don't have permission when sending a request to the url https://www.ncei.noaa.gov/data/local-climatological-data/access/2021/.
How can I get permission? Has the URL changed?

Complete all of exercises

Hi, I have done all of your exercises.
My GitHub repository link is https://github.com/yellowflash2041/data-engineering-practice.
My email address is [email protected].
This practice is very good.

is not available to download the links in the first exercise

when i try to donwload the links on the first exercise the links are 400 status

Exercise 2 - 102 files matching last updated ts of `2022-02-07 14:03`

Instructions: You are looking for the file that was Last Modified on 2022-02-07 14:03, you can't cheat and lookup the file number yourself.

I am planning on having the code identify the proper file, but even when checking manually, there are 102 files with this same last updated timestamp:

Is this intended as part of the exercise?

Also would be nice to have some sort of solutions/answers available to check if we completed the exercise properly

Uri does not exist

Hi,
there is an issue in this URI, please fix it

https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2220_Q1.zip

Exercise-3 "botocore.exceptions.NoCredentialsError: Unable to locate credentials"

When I try to print a list of files that have bucket s3, console says to me "botocore.exceptions.NoCredentialsError: Unable to locate credentials"

i write my code:

import boto3

def main():
s3= boto3.client('s3')
paquete=s3.download_file('commoncrawl','crawl-data/CC-MAIN-2022-05/wet.paths.gz','wet.paths.gz')
paquete.content

if name == "main":
main()

Exercise-6 docker image won't start

Hey Daniel, I've been loving this so far, thanks for putting it together! I finally made it to exercise 6 but when I run "docker-compose up run" I get this error message (see below) and the docker container won't start. I've never used pyspark before, so I have no idea how I could troubleshoot this.

s\GitHub\data-engineering-practice\Exercises\Exercise-6> docker-compose up run
[+] Running 1/0
 - Container exercise-6-run-1  Created                                                                             0.0s
Attaching to exercise-6-run-1
exercise-6-run-1  | WARNING: An illegal reflective access operation has occurred
exercise-6-run-1  | WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
exercise-6-run-1  | WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
exercise-6-run-1  | WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
exercise-6-run-1  | WARNING: All illegal access operations will be denied in a future release
exercise-6-run-1  | 22/10/13 01:03:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
exercise-6-run-1  | Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkContext: Running Spark version 3.0.1
exercise-6-run-1  | 22/10/13 01:03:55 INFO ResourceUtils: ==============================================================
exercise-6-run-1  | 22/10/13 01:03:55 INFO ResourceUtils: Resources for spark.driver:
exercise-6-run-1  |
exercise-6-run-1  | 22/10/13 01:03:55 INFO ResourceUtils: ==============================================================
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkContext: Submitted application: Exercise6
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing view acls to: root
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing modify acls to: root
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing view acls groups to:
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing modify acls groups to:
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
exercise-6-run-1  | 22/10/13 01:03:55 INFO Utils: Successfully started service 'sparkDriver' on port 36347.
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering MapOutputTracker
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering BlockManagerMaster
exercise-6-run-1  | 22/10/13 01:03:55 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
exercise-6-run-1  | 22/10/13 01:03:55 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
exercise-6-run-1  | 22/10/13 01:03:55 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-72b58301-91c5-4d58-b06a-c81c8deea2bc
exercise-6-run-1  | 22/10/13 01:03:55 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering OutputCommitCoordinator
exercise-6-run-1  | 22/10/13 01:03:56 INFO Utils: Successfully started service 'SparkUI' on port 4040.
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://06fa963d87c7:4040
exercise-6-run-1  | 22/10/13 01:03:56 INFO Executor: Starting executor ID driver on host 06fa963d87c7
exercise-6-run-1  | 22/10/13 01:03:56 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37915.
exercise-6-run-1  | 22/10/13 01:03:56 INFO NettyBlockTransferService: Server created on 06fa963d87c7:37915
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMasterEndpoint: Registering block manager 06fa963d87c7:37915 with 434.4 MiB RAM, BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/app/spark-warehouse').
exercise-6-run-1  | 22/10/13 01:03:56 INFO SharedState: Warehouse path is 'file:/app/spark-warehouse'.
exercise-6-run-1  | ['Divvy_Trips_2019_Q4.zip', 'Divvy_Trips_2020_Q1.zip']
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkContext: Invoking stop() from shutdown hook
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkUI: Stopped Spark web UI at http://06fa963d87c7:4040
exercise-6-run-1  | 22/10/13 01:03:56 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
exercise-6-run-1  | 22/10/13 01:03:56 INFO MemoryStore: MemoryStore cleared
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManager: BlockManager stopped
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMaster: BlockManagerMaster stopped
exercise-6-run-1  | 22/10/13 01:03:56 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkContext: Successfully stopped SparkContext
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Shutdown hook called
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-2b13cb58-fac7-4349-88dc-add6ed84faf0/pyspark-12e6df8c-60f3-4687-8d09-651ece254651
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-2b13cb58-fac7-4349-88dc-add6ed84faf0
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-7da8b58f-f65f-4ac9-beb2-c4128e23bb07
exercise-6-run-1 exited with code 0
PS E:\Documents\GitHub\data-engineering-practice\Exercises\Exercise-6>

Practice with Orchestrator

Maybe it's a good idea to add also a section regarding Data Orchestrator (Dagster,Prefect,Mage,Airflow ect.ect).
It's a crucial part of Data Engineer to understand and debug a data pipeline.

What you think?

Riccardo

Exercise 2-- No files last modified in 2022

Hey everyone ,I think the page might have changed can't find a file modified in 2022 in the link provided, could you confirm? I assume the web scraping is only supposed to be done in that link I guess I could search all the links, but would that be the objective?
Thanks !!!

Unable to connect to the Postgres DB

I ran the commands docker build --tag=exercise-5 . and docker-compose up run and I get this error:

I noticed when I check the logs for the exercise-5-postgres-1 container it says this:

danielbeach / data-engineering-practice Goto Github PK

data-engineering-practice's Introduction

Data Engineering Practice Problems

How to work on the problems.

Beginner Exercises

Exercise 1 - Downloading files.

Exercise 2 - Web Scraping + Downloading + Pandas

Exercise 3 - Boto3 AWS + s3 + Python.

Exercise 4 - Convert JSON to CSV + Ragged Directories.

Exercise 5 - Data Modeling for Postgres + Python.

Intermediate Exercises

Exercise 6 - Ingestion and Aggregation with PySpark.

Exercise 7 - Using Various PySpark Functions

Exercise 8 - Using DuckDB for Analytics and Transforms.

Exercise 9 - Using Polars lazy computation.

data-engineering-practice's People

Contributors

Stargazers

Watchers

Forkers

data-engineering-practice's Issues

Recommend Projects

Recommend Topics

Recommend Org