Code Monkey home page Code Monkey logo

python_packages_for_pipeline_workflow's Introduction

Comparison of Python pipeline packages: Airflow, Luigi, Gokart, Metaflow, Kedro, PipelineX

This article compares open-source Python packages for pipeline/workflow development: Airflow, Luigi, Gokart, Metaflow, Kedro, PipelineX.

In this article, terms of "pipeline", "workflow", and "DAG" are used almost interchangeably.

Summary

  • πŸ‘: good
  • πŸ‘πŸ‘: better
Package Airflow LuigiΒ Β Β  Gokart Metaflow KedroΒ Β Β  PipelineX
Developer, Maintainer Airbnb, Apache Spotify M3 Netflix Quantum-Black (McKinsey) Yusuke Minami
Wrapped packages Luigi Kedro, MLflow
Easiness/flexibility to define DAG πŸ‘ πŸ‘ πŸ‘ πŸ‘πŸ‘
Modularity of DAG definition πŸ‘πŸ‘ πŸ‘πŸ‘ πŸ‘πŸ‘
Unstructured data can be passed between tasks πŸ‘πŸ‘ πŸ‘πŸ‘ πŸ‘πŸ‘ πŸ‘πŸ‘ πŸ‘πŸ‘
Built-in various data (file/database) existence check wrappers πŸ‘πŸ‘ πŸ‘πŸ‘ πŸ‘πŸ‘ πŸ‘πŸ‘
Built-in various data (file/database) operation (read/write) wrappers πŸ‘ πŸ‘πŸ‘ πŸ‘πŸ‘
Modularity, reusability, testability of data operation πŸ‘ πŸ‘πŸ‘ πŸ‘πŸ‘
Automatic resuming option by detecting the intermediate data πŸ‘πŸ‘ πŸ‘πŸ‘ πŸ‘πŸ‘
Force rerun of tasks by detecting parameter change πŸ‘πŸ‘
Save parameters for experiments πŸ‘πŸ‘ πŸ‘πŸ‘
Parallel execution πŸ‘ πŸ‘ πŸ‘ πŸ‘ πŸ‘ πŸ‘
Distributed parallel execution with Celery πŸ‘πŸ‘
Visualization of DAG πŸ‘πŸ‘ πŸ‘ πŸ‘ πŸ‘ πŸ‘
Execution status monitoring in GUI πŸ‘πŸ‘ πŸ‘ πŸ‘
Scheduling, Triggering in GUI πŸ‘
Notification to Slack πŸ‘ πŸ‘

Airflow

https://github.com/apache/airflow

Released in 2015 by Airbnb.

Airflow enables you to define your DAG (workflow) of tasks in Python code (an independent Python module).

(Optionally, unofficial plugins such as dag-factory enables you to define DAG in YAML.)

Pros:

  • Provides rich GUI with features including DAG visualization, execution progress monitoring, scheduling, and triggering.
  • Provides distributed computing option (using Celery).
  • DAG definition is modular; independent from processing functions.
  • Workflow can be nested using SubDagOperator.
  • Supports Slack notification.

Cons:

  • Not designed to pass data between dependent tasks without using a database. There is no good way to pass unstructured data (e.g. image, video, pickle, etc.) between dependent tasks in Airflow.
  • You need to write file access (read/write) code.
  • Does not support automatic pipeline resuming option using the intermediate data files or databases.

Luigi

https://github.com/spotify/luigi

Released in 2012 by Spotify.

Luigi enables you to define your pipeline by child classes of Task with 3 class methods (requires, output, run) in Python code.

Pros:

  • Support automatic pipeline resuming option using the intermediate data files in local or cloud (AWS, GCP, Azure) or databases as defined in Task.output method using Target class.
  • You can write code so any data can be passed between dependent tasks.
  • Provides GUI with features including DAG visualization, execution progress monitoring.

Cons:

  • You need to write file/database access (read/write) code.
  • Pipeline definition, task processing (Transform of ETL), and data access (Extract&Load of ETL) are tightly coupled and not modular. You need to modify the task classes to reuse in future projects.

Gokart

https://github.com/m3dev/gokart

Released in Dec 2018 by M3.

Gokart works on top of Luigi.

Pros:

In addition to Luigi's advantages:

  • Can split task processing (Transform of ETL) from pipeline definition using TaskInstanceParameter so you can easily reuse them in future projects.
  • Provides built-in file access (read/write) wrappers as FileProcessor classes for pickle, npz, gz, txt, csv, tsv, json, xml.
  • Saves parameters for each experiment to assure reproducibility. Viewer called thunderbolt can be used.
  • Reruns tasks upon parameter change based on hash string unique to the parameter set in each intermediate file name. This feature is useful for experimentation with various parameter sets.
  • Syntactic sugar for Luigi's requires class method using class decorator.
  • Supports Slack notification.

Cons:

  • Supported data formats for file access wrappers are limited. You need to write file/database access (read/write) code to use unsupported formats.

Metaflow

https://github.com/Netflix/metaflow

Released in Dec 2019 by Netflix.

Metaflow enables you to define your pipeline as a child class of FlowSpec that includes class methods with step decorators in Python code.

Pros:

  • Integration with AWS services (Especially AWS Batch).

Cons:

  • You need to write file/database access (read/write) code.
  • Pipeline definition, task processing (Transform of ETL), and data access (Extract&Load of ETL) are tightly coupled and not modular. You need to modify the task classes to reuse in future projects.
  • Does not support GUI.
  • Not much support for GCP & Azure.
  • Does not support automatic pipeline resuming option using the intermediate data files or databases.

Kedro

https://github.com/quantumblacklabs/kedro

Released in May 2019 by QuantumBlack, part of McKinsey & Company.

Kedro enables you to define pipelines using list of node functions with 3 arguments (func: task processing function, inputs: input data name (list or dict if multiple), outputs: output data name (list or dict if multiple)) in Python code (an independent Python module).

Pros:

  • Provides built-in file/database access (read/write) wrappers as DataSet classes for CSV, Pickle, YAML, JSON, Parquet, Excel, and text in local or cloud (S3 in AWS, GCS in GCP), as well as SQL, Spark, etc.
  • Any data format support can be added by users.
  • Pipeline definition, task processing (Transform of ETL), and data access (Extract&Load of ETL) are independent and modular. You can easily reuse in future projects.
  • Pipelines can be nested. (A pipeline can be used as a sub-pipeline of another pipeline. )
  • GUI (kedro-viz) provides DAG visualization feature.

Cons:

  • Does not support automatic pipeline resuming option using the intermediate data files or databases.
  • GUI (kedro-viz) does not provide execution progress monitoring feature.
  • Package dependencies which are not used in many cases (e.g. pyarrow) are included in the requirements.txt.

PipelineX:

https://github.com/Minyus/pipelinex

Released in Nov 2019 by a Kedro user (me).

PipelineX works on top of Kedro and MLflow.

PipelineX enables you to define your pipeline in YAML (an independent YAML file).

Pros:

In addition to Kedro's advantages:

  • Supports automatic pipeline resuming option using the intermediate data files or databases.
  • Optional syntactic sugar for Kedro Pipeline. (e.g. Sequential API similar to PyTorch (torch.nn.Sequential) and Keras (tf.keras.Sequential))
  • Optional syntactic sugar for Kedro DataSet catalog. (e.g. Use file name in the file path as the dataset instance name)
  • Backward-compatible to pure Kedro.
  • Integration with MLflow to save parameters, metrics, and other output artifacts such as models for each experiment.
  • Integration with common packages for Data Science: PyTorch, Ignite, pandas, OpenCV.
  • Additional DataSet including image set (a folder including images) useful for computer vision applications.
  • Lean project template compared with pure Kedro.

Cons:

  • GUI (kedro-viz) does not provide execution progress monitoring feature.
  • Package dependencies which are not used in many cases (e.g. pyarrow) are included in the requirements.txt of Kedro.
  • PipelineX is developed and maintained by an individual (me) at this moment.

Platform-specific options

Argo

https://github.com/argoproj/argo

Uses Kubernetes to run pipelines.

Kubeflow Pipelines

https://github.com/kubeflow/pipelines

Works on top of Argo.

Oozie

https://github.com/apache/oozie

Manages Hadoop jobs.

Azkaban

https://github.com/azkaban/azkaban

Manages Hadoop jobs.

GitLab CI/CD

https://docs.gitlab.com/ee/ci/

  • Runs pipelines defined in YAML.
  • Supports triggering by git push, CRON-style scheduling, and manual clicking.
  • Supports Docker containers.

References

Airflow

Luigi

Gokart

Metaflow

Kedro

PipelineX

Airflow vs Luigi

Inaccuracies

Please kindly let me know if you find anything inaccurate.

Pull requests for https://github.com/Minyus/Python_Packages_for_Pipeline_Workflow/blob/master/README.md are welcome.

python_packages_for_pipeline_workflow's People

Contributors

minyus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.