auto-compose is a utility for dynamically generating Google cloud managed Apache Airflow DAGs from YAML configuration files. It is a fork of dag-factory and uses its logic to parse YAML files and convert them to airflow DAG's.
To run auto-compose without checking out the github repository run /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/suchitpuri/auto-compose/master/scripts/bootstrap.sh)"
. It requires docker which has all the required dependencies baked in.
You can also checkout the repository and run /bin/bash ./scripts/bootstrap.js
Once you run auto-compose, it will ask you for the following details
- project-id : This is your GCP project id. When you run auto-compose it uses the underlying environment authentication to gcp environment. If you are not logged in go run
gcloud auth login
or similar command before running auto-compose. - composer-id : This is the name/id of the composer environment. You can get that from the name column of https://console.cloud.google.com/composer/environments
- composer-location: This is the name of the region ( e.g asia-northeast1 ) where composer is running. You can get that from the location column of https://console.cloud.google.com/composer/environments
- YAML file absolute path: This is the absolute path of the YML file. Correct absolute path is needed so that docker mount the file
To deploy a DAG in airflow managed by google cloud you first need to create a YAML configuration file. For example:
default:
default_args:
owner: 'default_owner'
start_date: 2019-08-02
email: ['[email protected]']
email_on_failure: True
retries: 1
email_on_retry: True
max_active_runs: 1
schedule_interval: '0 * * * */1'
bq_dag_complex:
default_args:
owner: 'add_your_ldap'
start_date: 2019-02-14
description: 'this is an sample bigquery dag which runs every day'
tasks:
query_1:
operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2018`'
use_legacy_sql: false
query_2:
operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2017`'
dependencies: [query_1]
use_legacy_sql: false
query_3:
operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2016`'
dependencies: [query_1]
use_legacy_sql: false
query_4:
operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2015`'
dependencies: [query_1, query_2]
use_legacy_sql: false
query_5:
operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2014`'
dependencies: [query_3]
use_legacy_sql: false
bq_dag_simple:
default_args:
owner: 'add_your_ldap'
start_date: 2019-02-14
description: 'this is an sample bigquery dag which runs every 12 hours'
schedule_interval: '0 */12 * * *'
tasks:
query_1:
operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2018`'
use_legacy_sql: false
query_2:
operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2017`'
dependencies: [query_1]
use_legacy_sql: false
query_3:
operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2016`'
dependencies: [query_1]
use_legacy_sql: false
You can see that it has all the airflow semantics, like default args, schedule interval, max active runs and more. You can find a complete list here.
The best part is that currently you can use any of the following operators in YAML file directly without any configuration.
- Logging
- GoogleCloudBaseHook
- BigQuery
- Cloud Spanner
- Cloud SQL
- Cloud Bigtable
- Compute Engine
- Cloud Functions
- Cloud DataFlow
- Cloud DataProc
- Cloud Datastore
- Cloud ML Engine
- Cloud Storage
- Transfer Service
- Google Kubernetes Engine
And this DAG will be generated and ready to run in Airflow!
- Construct DAGs without knowing Python
- Construct DAGs without learning Airflow primitives
- Avoid duplicative code
- Use any of the available google cloud operators
- Everyone loves YAML! ;)
Contributions are welcome! Just submit a Pull Request or Github Issue.