Code Monkey home page Code Monkey logo

data-engineering's Introduction

Data Engineering Zoomcamp

This is my collection of notes and code from following the DataTalks Data Engineering Zoomcamp.

Deadlines

πŸ—“οΈ Project's timeline

Module Start Date Homework Due Weeks to complete Videos Duration Notes
1. Introduction & Prerequisites 15 Jan 25 Jan 2 x9 2h 50m πŸ“
2. Workflow Orchestration 29 Jan 05 Feb 1 x11 1h 32m πŸ“
3. Data Warehouse 05 Feb 12 Feb 1 x6 1h 01m πŸ“
dlt workshop 05 Feb 15 Feb 1.5 x1 1h 20m πŸ“
4. Analytics Engineering 15 Feb 22 Feb 1 x10 2h 41m πŸ“
5. Batch processing 22 Feb 04 Mar 1.5 πŸ“
6. Streaming 04 Mar 15 Mar 1.5 πŸ“
RisingWave workshop 04 Mar 18 Mar n/a πŸ“
Project (attempt 1) 18 Mar 01 Apr 2 πŸ“
Project evaluation (attempt 1) 01 Apr 08 Apr 1 πŸ“
Project (attempt 2) 01 Apr 15 Apr 2 πŸ“
Project evaluation (attempt 2) 15 Apr 29 Apr 1 πŸ“

Prep

Here is a checklist of what you need:

  • Set up virtual environment for python development
  • Install Docker Desktop
  • Get Google Cloud account
  • Install Terraform (you can follow the docs, or like me, install it in a conda environment)

Create a python virtual environment

I use mamba to manage my virtual environments, see env.yaml for requirements (This will be updated as I move through the course).

Install Docker Desktop

Setting up Docker with Windows 11 and WSL is very easy. Assuming WSL is already installed, install Docker Desktop on Windows. To enable the docker CLI on your distro of choice within WSL, just adjust the settings in Docker Desktop:

  • Settings > Resources > WSL integration
  • Select the distros where you want to enable it to use docker commands.

Modules

1. Introduction and Prerequisites

This section will cover Docker, running postgres and pgAdmin containers, some SQL basics and setting up cloud resources in Google Cloud using Terraform.

πŸ“š Resources

πŸ“Ί Videos

Bonus videos:

2. Workflow Orchestration

This section covers workflow orchestration with Mage.

πŸ“š Resources

πŸ“Ί Videos

Deployment videos (they say optional, but this is pretty crucial for me):

Office hours recording here.

3. Data Warehouse

In this section we will talk about data warehousing in general and use Google BigQuery as an example.

πŸ“š Resources

πŸ“Ί Videos

4. Analytics Engineering

πŸ“š Resources

πŸ“Ί Videos

Optional video (but watch this first if like me, you still don't have the full green and yellow taxi data in GCP or local postgres db): Hack for loading data to BigQuery

5. Batch Processing

πŸ“š Resources

πŸ“Ί Videos

  • 1: Introduction to Batch Processing
  • 2: Introduction to Spark
  • 3: First Look at Spark/PySpark
  • 4: Spark Dataframes
  • 5: SQL with Spark
  • 6: Anatomy of a Spark Cluster
  • 7: GroupBy in Spark
  • 8: Joins in Spark

9m 30s +

Optional:

Workshops

dlt

The workshop quickly covers how to build data ingestion pipelines using dlt. It includes:

  • ​Extracting data from APIs, or files.
  • ​Normalizing and loading data
  • ​Incremental loading

πŸ“š Resources

πŸ“Ί Video

data-engineering's People

Contributors

sf-pear avatar

Stargazers

Soumyadip Bhattacharjya  avatar

Watchers

 avatar

Forkers

soumyaco

data-engineering's Issues

5. Batch processing

πŸ“š Resources

πŸ“Ί Videos

  • 1: Introduction to Batch Processing
  • 2: Introduction to Spark
  • 3: First Look at Spark/PySpark
  • 4: Spark Dataframes
  • 5: SQL with Spark
  • 6: Anatomy of a Spark Cluster
  • 7: GroupBy in Spark
  • 8: Joins in Spark

Optional:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.