Build, run, and monitor data pipelines at scale
Prepared for O'Reilly Media
- Kalise Richmond - Sales Engineer, Prefect
- Nathan Nowack - Solutions Engineer, Prefect
Data engineers and scientists spend most of their time on negative or defensive engineering, writing code to handle unpredictable failures such as resources going down, APIs intermittently failing, or malformed data corrupting data pipelines. Workflow orchestration tools help eliminate negative engineering, allowing engineers and scientists to focus on the problems they are solving. Modern data applications have evolved, and orchestrators such as Prefect are providing more runtime flexibility and the ability to leverage distributed compute through Dask.
Discover how workflow orchestration can free you up to build solutions, not just avert failures. You’ll learn about basic orchestration features such as retries, scheduling, parameterization, caching, and secret management, and you’ll construct real data pipelines.
For this course you will need:
Python greater than version 3.6 is required (version 3.6 is reaching end of life soon).
- Packages in the
requirements.txt
fileprefect==2.0b2
- workflow orchestrationbeautifulsoup4
- web scrapingjupyter
- interactive notebooks
Ideally, you should create a virtual environment (conda, pipenv, poetry) to install the dependencies.
To install the requirements with pip:
pip install -r requirements.txt
Docker is a great entrypoint (pun somewhat intended) into world of engineering! We'll be using it to provide reproducible environments to execute our workflows in. We also have a section devoted to Docker.
These are optional dependencies but were added in the requirements.txt
for convenience.
For the advanced section of this course, we will use a couple of common data engineering tools:
- your own Airbyte instance
- Snowflake trial account
- install dbt to run transforms on your warehouse objects
To clone the repo and run locally
git clone https://github.com/zzstoatzz/oreilly-workflow-orchestration.git
And then each notebook can be viewed and executed. Some of the code will extend beyond the notebooks because data workflows glue other tools (sometimes non-Python) together.
For any questions, feel free to reach to out us!
- Kalise - [email protected]
- Nate - [email protected]
The Prefect Slack is also a good resource for Prefect and Workflow Orchestration questions.
Listed below are the documentation pages for the tools used:
Data Movement
Distributed Computing