Code Monkey home page Code Monkey logo

docker-etl's Introduction

Docker ETL

This repo is a collection of dockerized ETL jobs to increase discoverability of the source code of scheduled ETL. There are also tools here that automate the common steps involved with creating and scheduling an ETL job. This includes defining a Docker image, setting up CI, and language boilerplate. The primary use of this repo is to create Dockerized jobs that are pushed to GCR so they can be scheduled via the Airflow GKE pod operator.

Project Structure

Jobs

Each job is located in its own directory in the jobs/ directory, e.g. the contents of a job named my-job would go into jobs/my-job

All job directories should have a Dockerfile, a ci_job.yaml, a ci_workflow.yaml, and a README.md in the root directory. ci_job.yaml and ci_workflow.yaml contain the yaml structure that will be placed in the - jobs: and - workflows: sections of the CircleCI config.yml respectively.

Templates

Templates for job creation and the CI config file are located in templates/.

The CI config template is in .circleci/config.template.yml. This is the file that should be modified instead of the circleci/config.yml.

Each job template is located in a directory in templates/ that is the name of the template, e.g. a python template is in templates/python/. Within the directory of a template is a directory named job/ that contains all the contents that will be copied when the template is used. Other files in the directory of a particular template are used for job creation, e.g. ci_job.template.yaml.

Example Directory Structure:

+--docker-etl/
|  +--jobs/
|     +--example-python-1/
|        +--ci_job.yaml
|        +--ci_workflow.yaml
|        +--Dockerfile
|        +--README.md
|        +--script
|  +--templates/
|     +--python/
|        +--job/
|           +--module/
|           +--tests/
|           +--Dockerfile
|           +--README.md
|           +--requirements.txt
|        +--ci_job.template.yaml
|        +--ci_workflow.template.yaml

Development

The tools in this repository are intended for python 3.8+.

To install dependencies:

pip install -r requirements.txt

This project uses pip-tools to pin dependencies. New dependencies go in requirements.in and pip-compile is used to generate requirements.txt:

pip install pip-tools
pip-compile --generate-hashes requirements.in

To run tests:

pytest --flake8 --black tests/

Adding a new job

To add a new job:

./script/create_job --job-name example-job --template python

job-name is the name of the directory that will be created in jobs/.

template is an optional argument that will populate the created directory with the contents of a template. If no template is given, a directory with only the required files is created.

Available Templates:

Template name Description
default Base directory with readme, Dockerfile, and CI config files
python Simple Python module with unit test and lint config

Modifying the CI config

This repo uses CircleCI which only allows a single global config file. In order to simplify adding and removing jobs to CI, the config file is generated using templates. This means the config.yml in .circleci/ should not be modified directly.

Generate .circleci/config.yml:

./script/update_ci_config

To make changes to the config that are not ETL job specific (e.g. add a command), changes should be made to templates/config.template.yml and the output config should be re-generated.

Each job has a ci_job.yaml and a ci_workflow.yaml which define the steps that will go into the jobs and workflow sections of the CircleCI config. Any changes to these files should be followed by updating the global config via script/update_ci_config. When a job is created, the CI files are created based on the ci_*.template.yaml files in the template directory.

Adding a template

To add a new template, create a new directory in templates/ with the name of the template. This directory must have a ci_job.template.yaml, a ci_workflow.template.yaml, and a job/ directory which contains all the files that will be copied to any job that uses this template.

docker-etl's People

Contributors

acmiyaguchi avatar perrymcmanis144 avatar chelseatroy avatar dexterp37 avatar bochocki avatar benwu avatar irrationalagent avatar kik-kik avatar simon-friedberger avatar whd avatar akkomar avatar quiiver avatar edugfilho avatar scholtzan avatar ksy36 avatar xluo-ds avatar dependabot[bot] avatar anich avatar jklukas avatar relud avatar skahmann3 avatar dzeber avatar alekhyamoz avatar inahga avatar fbertsch avatar ncloudioj avatar rebecca-burwei avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.