Code Monkey home page Code Monkey logo

data-engineering-practice's Introduction

Data Engineering Practice Problems

One of the main obstacles of Data Engineering is the large and varied technical skills that can be required on a day-to-day basis.

*** Note - If you email a link to your GitHub repo with all the completed exercises, I will send you back a free copy of my ebook Introduction to Data Engineering. ***

This aim of this repository is to help you develop and learn those skills. Generally, here are the high level topics that these practice problems will cover.

  • Python data processing.
  • csv, flat-file, parquet, json, etc.
  • SQL database table design.
  • Python + Postgres, data ingestion and retrieval.
  • PySpark
  • Data cleansing / dirty data.

How to work on the problems.

You will need two things to work effectively on most all of these problems.

  • Docker
  • docker-compose

All the tools and technologies you need will be packaged into the dockerfile for each exercise.

For each exercise you will need to cd into that folder and run the docker build command, that command will be listed in the README for each exercise, follow those instructions.

Beginner Exercises

Exercise 1 - Downloading files.

The first exercise tests your ability to download a number of files from an HTTP source and unzip them, storing them locally with Python. cd Exercises/Exercise-1 and see README in that location for instructions.

Exercise 2 - Web Scraping + Downloading + Pandas

The second exercise tests your ability perform web scraping, build uris, download files, and use Pandas to do some simple cumulative actions. cd Exercises/Exercise-2 and see README in that location for instructions.

Exercise 3 - Boto3 AWS + s3 + Python.

The third exercise tests a few skills. This time we will be using a popular aws package called boto3 to try to perform a multi-step actions to download some open source s3 data files. cd Exercises/Exercise-3 and see README in that location for instructions.

Exercise 4 - Convert JSON to CSV + Ragged Directories.

The fourth exercise focuses more file types json and csv, and working with them in Python. You will have to traverse a ragged directory structure, finding any json files and converting them to csv.

Exercise 5 - Data Modeling for Postgres + Python.

The fifth exercise is going to be a little different than the rest. In this problem you will be given a number of csv files. You must create a data model / schema to hold these data sets, including indexes, then create all the tables inside Postgres by connecting to the database with Python.

Intermediate Exercises

Exercise 6 - Ingestion and Aggregation with PySpark.

The sixth exercise Is going to step it up a little and move onto more popular tools. In this exercise we are going to load some files using PySpark and then be asked to do some basic aggregation. Best of luck!

Exercise 7 - Using Various PySpark Functions

The seventh exercise Taking a page out of the previous exercise, this one is focus on using a few of the more common build in PySpark functions pyspark.sql.functions and applying their usage to real-life problems.

Many times to solve simple problems we have to find and use multiple functions available from libraries. This will test your ability to do that.

Exercise 8 - Using DuckDB for Analytics and Transforms.

The eighth exercise Using new tools is imperative to growing as a Data Engineer. DuckDB is one of those new tools. In this exercise you will have to complete a number of analytical and transformation tasks using DuckDB. This will require an understanding of the functions and documenation of DuckDB.

Exercise 9 - Using Polars lazy computation.

The ninth exercise Polars is a new Rust based tool with a wonderful Python package that has taken Data Engineering by storm. It's better than Pandas because it has both SQL Context and supports Lazy evalutation for larger than memory data sets! Show your Lazy skills!

data-engineering-practice's People

Contributors

cclauss avatar danielbeach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-engineering-practice's Issues

Exercise-2 docker build hanging on building wheel for pandas (pyproject.toml)

On exercise 2 when I try to run docker build --tag=exercise-2 . it gets to Building wheel for pandas (pyproject.toml)... then hangs for 20+ minutes. Is this expected?

I tried updating pip install --upgrade pip before building the image as suggested here, but no luck.

I cancelled and then attempted again so you can see my terminal, but I left it to run for 20+ minutes previously
image

Exercise 2 - 102 files matching last updated ts of `2022-02-07 14:03`

Instructions: You are looking for the file that was Last Modified on 2022-02-07 14:03, you can't cheat and lookup the file number yourself.

I am planning on having the code identify the proper file, but even when checking manually, there are 102 files with this same last updated timestamp:
image

Is this intended as part of the exercise?

Also would be nice to have some sort of solutions/answers available to check if we completed the exercise properly

Exercise-3 "botocore.exceptions.NoCredentialsError: Unable to locate credentials"

When I try to print a list of files that have bucket s3, console says to me "botocore.exceptions.NoCredentialsError: Unable to locate credentials"

i write my code:

import boto3

def main():
s3= boto3.client('s3')
paquete=s3.download_file('commoncrawl','crawl-data/CC-MAIN-2022-05/wet.paths.gz','wet.paths.gz')
paquete.content

if name == "main":
main()

Exercise-6 docker image won't start

Hey Daniel, I've been loving this so far, thanks for putting it together! I finally made it to exercise 6 but when I run "docker-compose up run" I get this error message (see below) and the docker container won't start. I've never used pyspark before, so I have no idea how I could troubleshoot this.

s\GitHub\data-engineering-practice\Exercises\Exercise-6> docker-compose up run
[+] Running 1/0
 - Container exercise-6-run-1  Created                                                                             0.0s
Attaching to exercise-6-run-1
exercise-6-run-1  | WARNING: An illegal reflective access operation has occurred
exercise-6-run-1  | WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
exercise-6-run-1  | WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
exercise-6-run-1  | WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
exercise-6-run-1  | WARNING: All illegal access operations will be denied in a future release
exercise-6-run-1  | 22/10/13 01:03:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
exercise-6-run-1  | Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkContext: Running Spark version 3.0.1
exercise-6-run-1  | 22/10/13 01:03:55 INFO ResourceUtils: ==============================================================
exercise-6-run-1  | 22/10/13 01:03:55 INFO ResourceUtils: Resources for spark.driver:
exercise-6-run-1  |
exercise-6-run-1  | 22/10/13 01:03:55 INFO ResourceUtils: ==============================================================
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkContext: Submitted application: Exercise6
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing view acls to: root
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing modify acls to: root
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing view acls groups to:
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: Changing modify acls groups to:
exercise-6-run-1  | 22/10/13 01:03:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
exercise-6-run-1  | 22/10/13 01:03:55 INFO Utils: Successfully started service 'sparkDriver' on port 36347.
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering MapOutputTracker
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering BlockManagerMaster
exercise-6-run-1  | 22/10/13 01:03:55 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
exercise-6-run-1  | 22/10/13 01:03:55 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
exercise-6-run-1  | 22/10/13 01:03:55 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-72b58301-91c5-4d58-b06a-c81c8deea2bc
exercise-6-run-1  | 22/10/13 01:03:55 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
exercise-6-run-1  | 22/10/13 01:03:55 INFO SparkEnv: Registering OutputCommitCoordinator
exercise-6-run-1  | 22/10/13 01:03:56 INFO Utils: Successfully started service 'SparkUI' on port 4040.
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://06fa963d87c7:4040
exercise-6-run-1  | 22/10/13 01:03:56 INFO Executor: Starting executor ID driver on host 06fa963d87c7
exercise-6-run-1  | 22/10/13 01:03:56 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37915.
exercise-6-run-1  | 22/10/13 01:03:56 INFO NettyBlockTransferService: Server created on 06fa963d87c7:37915
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMasterEndpoint: Registering block manager 06fa963d87c7:37915 with 434.4 MiB RAM, BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 06fa963d87c7, 37915, None)
exercise-6-run-1  | 22/10/13 01:03:56 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/app/spark-warehouse').
exercise-6-run-1  | 22/10/13 01:03:56 INFO SharedState: Warehouse path is 'file:/app/spark-warehouse'.
exercise-6-run-1  | ['Divvy_Trips_2019_Q4.zip', 'Divvy_Trips_2020_Q1.zip']
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkContext: Invoking stop() from shutdown hook
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkUI: Stopped Spark web UI at http://06fa963d87c7:4040
exercise-6-run-1  | 22/10/13 01:03:56 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
exercise-6-run-1  | 22/10/13 01:03:56 INFO MemoryStore: MemoryStore cleared
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManager: BlockManager stopped
exercise-6-run-1  | 22/10/13 01:03:56 INFO BlockManagerMaster: BlockManagerMaster stopped
exercise-6-run-1  | 22/10/13 01:03:56 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
exercise-6-run-1  | 22/10/13 01:03:56 INFO SparkContext: Successfully stopped SparkContext
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Shutdown hook called
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-2b13cb58-fac7-4349-88dc-add6ed84faf0/pyspark-12e6df8c-60f3-4687-8d09-651ece254651
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-2b13cb58-fac7-4349-88dc-add6ed84faf0
exercise-6-run-1  | 22/10/13 01:03:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-7da8b58f-f65f-4ac9-beb2-c4128e23bb07
exercise-6-run-1 exited with code 0
PS E:\Documents\GitHub\data-engineering-practice\Exercises\Exercise-6>

Practice with Orchestrator

Hi

Maybe it's a good idea to add also a section regarding Data Orchestrator (Dagster,Prefect,Mage,Airflow ect.ect).
It's a crucial part of Data Engineer to understand and debug a data pipeline.

What you think?

Riccardo

Exercise 2-- No files last modified in 2022

Hey everyone ,I think the page might have changed can't find a file modified in 2022 in the link provided, could you confirm? I assume the web scraping is only supposed to be done in that link I guess I could search all the links, but would that be the objective?
Thanks !!!

Unable to connect to the Postgres DB

I ran the commands docker build --tag=exercise-5 . and docker-compose up run and I get this error:
Screen Shot 2022-03-04 at 2 23 10 pm

I noticed when I check the logs for the exercise-5-postgres-1 container it says this:

Screen Shot 2022-03-04 at 2 24 37 pm

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.