Code Monkey home page Code Monkey logo

whirl's Introduction

whirl

Fast iterative local development and testing of Apache Airflow workflows.

The idea of whirl is pretty simple: use Docker containers to start up Apache Airflow and the other components used in your workflow. This gives you a copy of your production environment that runs on your local machine. You can run your DAG locally from start to finish - with the same code as in production. Seeing your pipeline succeed gives you more confidence about the logic you are creating/refactoring and how it integrates with other components. It also gives new developers an isolated environment for experimenting with your workflows.

whirl connects the code of your DAG and your (mock) data to the Apache Airflow container that it spins up. Using volume mounts you are able to make changes to your code in your favorite IDE and immediately see the effect in the running Apache Airflow UI on your machine. This also works with custom Python modules that you are developing and using in your DAGs.

NOTE: whirl is not intended to replace proper (unit) testing of the logic you are orchestrating with Apache Airflow.

Prerequisites

whirl relies on Docker and Docker Compose. Make sure you have it installed. If using Docker for Mac or Windows ensure that you have configured it with sufficient RAM (8GB or more recommended) for running all your containers.

When you want to use whirl in your CI pipeline, you need to have jq installed. For example, with Homebrew:

brew install jq

The current implementation was developed on macOS but is intended to work with any platform supported by Docker. In our experience, Linux and macOS are fine. You can run it on native Windows 10 using WSL. Unfortunately, Docker on Windows 10 (version 1809) is hamstrung because it relies on Windows File Sharing (CIFS) to establish the volume mounts. Airflow hammers the volume a little harder than CIFS can handle, and you'll see intermittent FileNotFound errors in the volume mount. This may improve in the future. For now, running whirl inside a Linux VM in Hyper-V gives more reliable results.

Airflow Versions

As of January 2021, Whirl uses Airflow 2.x.x as the default version. A specific tag was made for Airflow 1.10.x, which can be found here

Getting Started

Development

Clone this repository:

git clone https://github.com/godatadriven/whirl.git <target directory of whirl>

For ease of use you can add the base directory to your PATH environment variable, so the command whirl is available.

export PATH=<target directory of whirl>:${PATH}

Use the release

Download the latest Whirl release artifact

Extract the file (for example into /usr/local/opt)

tar -xvzf whirl-release.tar.gz -C /usr/local/opt

Make sure the whirl script is available on your path

export PATH=/usr/local/opt/whirl:$PATH

Usage

The whirl script is used to perform all actions.

Getting usage information

$ whirl -h
$ whirl --help

Starting whirl

The default action is to start the DAG in your current directory.

With the [-x example] commandline argument you can run whirl from anywhere and tell whirl which example dag to run. The example refers to a directory with the same name in the examples directory located near the whirl script.

Whirl expects an environment to be configured. You can pass this as a command line argument [-e environment] or you can configure it as environment variable WHIRL_ENVIRONMENT in a .whirl.env file. (See Configuring environment variables.) The environment refers to a directory with the same name in the envs directory located near the whirl script.

$ whirl [-x example] [-e <environment>] [start]

Specifying the start command line argument is a more explicit way to start whirl.

Stopping whirl

$ whirl  [-x example] [-e <environment>] stop

Stops the configured environment.

If you want to stop all containers from a specific environment you can add the -e or --environment commandline argument with the name of the environment. This name corresponds with a directory in the envs directory.

Usage in a CI Pipeline

We run most of the examples from within our own CI (github actions, see for implementation details our github workflow.

You are able to run an example in ci mode on your local system by useing the whirl ci command. This will:

  • run the Docker containers daemonized in the background;
  • ensure the DAG(s) are unpaused; and
  • wait for the pipeline to either succeed or fail.

Upon success the containers will be stopped and exit successfully.

In case of failure (or success if failure is expected)we print out the logs of the failed task and cleanup before indicating the pipeline has failed.

Configuring Environment Variables

Instead of using the environment option each time you run whirl, you can also configure your environment in a .whirl.env file. This can be in three places. They are applied in order:

  • A .whirl.env file in the root of this repository. This can also specify a default environment to be used when starting whirl. You do this by setting the WHIRL_ENVIRONMENT which references a directory in the envs folder. This repository contains an example you can modify. It specifies the default PYTHON_VERSION to be used in any environment.
  • A .whirl.env file in your envs/{your-env} subdirectory. The environment directory to use can be set by any of the other .whirl.env files or specified on the commandline. This is helpful to set environment specific variables. Of course it doesn't make much sense to set the WHIRL_ENVIRONMENT here.
  • A .whirl.env in your DAG directory to override any environment variables. This can be useful for example to overwrite the (default) WHIRL_ENVIRONMENT.

Internal environment variables

Inside the whirl script the following environment variables are set:

Environment Variable Value Description
DOCKER_CONTEXT_FOLDER ${SCRIPT_DIR}/docker Base build context folder for Docker builds referenced in Docker Compose
ENVIRONMENT_FOLDER ${SCRIPT_DIR}/envs/<environment> Base folder for environment to start. Contains docker-compose.yml and environment specific preparation scripts.
DAG_FOLDER $(pwd) Current working directory. Used as Airflow DAG folder. Can contain preparation scripts to prepare for this specific DAG.
PROJECTNAME $(basename ${DAG_FOLDER})

Structure

This project is based on docker-compose and the notion of different environments where Airflow is a central part. The rest of the environment depends on the tools/setup of the production environment used in your situation.

The whirl script combines the DAG and the environment to make a fully functional setup.

To accommodate different examples:

  • The environments are split up into separate environment-specific directories inside the envs/ directory.
  • The DAGS are split into sub-directories in the examples/ directory.

Environments

Environments use Docker Compose to start containers which together mimic your production environment. The basis of the environment is the docker-compose.yml file which as a minimum declares the Airflow container to run. Extra tools (e.g. s3, sftp) can be linked together in the docker-compose file to form your specific environment.

Each environment also contains some setup code needed for Airflow to understand the environment, for example Connections and Variables. Each environment has a whirl.setup.d/ directory which is mounted in the Airflow container. On startup all scripts in this directory are executed. This is a location for installing and configuring extra client libraries that are needed to make the environment function correctly; for example awscli if S3 access is required.

DAGs

The DAGs in this project are inside the examples/ directory. In your own project you can have your code in its own location outside this repository.

Each example directory consists of at least one example DAG. Also project- specific code can be placed there. As with the environment the DAG directory can contain a whirl.setup.d/ directory which is also mounted inside the Airflow container. Upon startup all scripts in this directory are executed. The environment-specific whirl.setup.d/ is executed first, followed by the DAG one.

This is also a location for installing and configuring extra client libraries that are needed to make the DAG function correctly; for example a mock API endpoint.

Examples

This repository contains some example environments and workflows. The components used might serve as a starting point for your own environment. If you have a good example you'd like to add, please submit a merge request!

Each example contains it's own README file to explain the specifics of that example.

Generic running of examples

To run a example:

$ cd ./examples/<example-dag-directory>
# Note: here we pass the whirl environment as a command-line argument. It can also be configured with the WHIRL_ENVIRONMENT variable
$ whirl -e <environment to use>

or

$
# Note: here we pass the whirl environment as a command-line argument. It can also be configured with the WHIRL_ENVIRONMENT variable
$ whirl -x <example to run> -e <environment to use>

Open your browser to http://localhost:5000 to access the Airflow UI. Manually enable the DAG and watch the pipeline run to successful completion.

References

An early version of whirl was brought to life at ING. Bas Beelen gave a presentation describing how whirl was helpful in their infrastructure during the 2nd Apache Airflow Meetup, January 23 2019, hosted at Google London HQ.

Whirl explained at Apache Airflow Meetup

whirl's People

Contributors

abij avatar asnare avatar barend-xebia avatar basph avatar basvdl avatar bjgbeelen avatar danielvdende avatar jasperhg90 avatar krisgeus avatar mwallace582 avatar pugillum avatar sechitwood avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

whirl's Issues

Add >2.1.x Support

Airflow removed the AIRFLOW_GID parameter causing the whirl build to break for airflow version 2.2+ - relevant airflow commit here and failing whirl line here

Exact error message when setting airflow_version=2.2.2:

 => [ 3/10] RUN mkdir -p "/etc/airflow/whirl.setup.d/dag.d"                                                                                                                                                                                                      0.3s
 => ERROR [ 4/10] RUN chown -R airflow:airflow "/etc/airflow/whirl.setup.d"                                                                                                                                                                                      0.3s
------
 > [ 4/10] RUN chown -R airflow:airflow "/etc/airflow/whirl.setup.d":
#7 0.246 chown: invalid group: 'airflow:airflow'
------
executor failed running [/bin/bash -o pipefail -c chown -R airflow:airflow "${WHIRL_SETUP_FOLDER}"]: exit code: 1

This is fixed by updating the whirl python docker file as so

RUN chown -R airflow:0 "${WHIRL_SETUP_FOLDER}"

Happy to help remediate!

Use Whirl in your CI

Let's see if we can use Whirl in our CI to actually run a DAG and test that the DagRun turned successful

Whirl fails to build base image

When trying to run an example I am getting an error:

cd $WHIRL_DIR/examples/localhost-ssh-example
whirl start -e local-ssh

Package openjdk-8-jre-headless is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'openjdk-8-jre-headless' has no installation candidate
The command '/bin/sh -c mkdir -p /usr/share/man/man1   && update-ca-certificates -f   && apt-get update   && apt-get install -y --no-install-recommends --reinstall build-essential   && apt-get install -y --no-install-recommends --allow-unauthenticated      software-properties-common      wget      dnsutils      vim      git      default-libmysqlclient-dev      gcc      openjdk-8-jre-headless      ${AIRFLOW_BUILD_DEPS}      ${AIRFLOW_APT_DEPS}      nginx      gosu      sudo   && apt-get clean   && apt-get autoremove -yqq --purge   && rm -rf /var/cache/apk/* /var/lib/apt/lists/*   && (find /usr/share/doc -type f -and -not -name copyright -print0 | xargs -0 -r rm)' returned a non-zero code: 100
Pulling airflow (docker-whirl-airflow:py-3.6-local)...
ERROR: The image for the service you're trying to recreate has been removed. If you continue, volume data could be lost. Consider backing up your data before continuing.

Continue with the new image? [yN]y
Pulling airflow (docker-whirl-airflow:py-3.6-local)...
ERROR: pull access denied for docker-whirl-airflow, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

Allow users to specify the folder to look for environments

Feature request:
Users should be able to specify where whirl looks for environment directories. eg -d to specify the folder that contains the environments (defaults to SCRIPT_DIR)

Use cases:
Embedding the environments in source control with DAGs to allow test environments to move around with code.

Extract Spark from Airflow image

Right now, we install complete Spark application in the Airflow container.
We want to have Spark run in a separate container.
We only need the client/driver for Spark in the Airflow container

Add commandline arguments for Whirl

provide environment, start/stop commands.

e.g.

whirl -e default (start)
whirl -e default start (start daemon)
whirl -e default stop (stop)

Divolte issue

hi @krisgeus
sorry for that I am writing you here actually that Divolte repository has been archived by the owner, and I am stuck in one problem of Divolte session Id

1. every page refresh I am getting new session id, why session id wont be persist till their session timeout configured ?
2. How can we make a function so that we can close this session id manually without waiting for inactivity ?

I am facing these issues
can you please guide me how can we solve this
This would be really helpful for me

Thanks

No module named 'werkzeug.wrappers.json' error

Hi @bjgbeelen
Can you advise below error. This is a first time examining with whirl for the first example "SSH to Localhost".
There are couple of errors "No module named 'werkzeug.wrappers.json';" at step 29/29 and then exit with code 1.

airflow_1  | ModuleNotFoundError: No module named 'werkzeug.wrappers.json'; 'werkzeug.wrappers' is not a package
local-ssh_airflow_1 exited with code 1

Change detect_dag function

Right now this is hardcoded to the filename dag.py. See if we can check the contents of the file to detect a DAG

Add README

Describe what Whirl is and how to use it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.