Before following the next steps, make sure you have docker and docker-compose and you are in the folder where lies Dockerfile
and docker-compose.yml
- Create a
.env
file following the.env.sample
AIRFLOW_UID
: User ID that is running the docker commandsAIRFLOW_GID
: Group ID of the user_AIRFLOW_WWW_USER_USERNAME
: Username for airflow webserver. You can erase it andairflow
will be the default username_AIRFLOW_WWW_USER_PASSWORD
: Same as above but for password
- Build the image with
docker image build -t extending_airflow .
docker-compose up airflow-init
ordocker compose
depending on your docker version- After initialization is complete, you should see a message like this:
start_airflow-init_1 exited with code 0
docker-compose up -d
- Check with
docker ps
that you have the DB, scheduler and webserver running.
It uses a DB running on Digital Ocean (dvdrental) which is a PostgreSQL sample database.
The following schema will give you more details about how the pipeline works.
All scripts executed in dataproc are in the pyspark
folder.