Code Monkey home page Code Monkey logo

pipeline's People

Contributors

tobiasraabe avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pipeline's Issues

Project configuration and global variables.

  • pipeline version used, if any: 0.0.5
  • Python version, if any: any
  • Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

Currently, the complete configuration of the project (rendered content of .pipeline.yaml is available in a task template. The following issues occur:

  1. If a variable in the project configuration is the same as in the task definition, a TypeError is raised because _render_task_template receives the same argument twice.
  2. The explicit mention of global variables does not make sense. You do not have to define a special dictionary. Just create any variable in the yaml.

Describe the solution you'd like

  1. Make a dictionary update before passing arguments to _render_task_template. Task information should be able to overwrite the general configuraiton ...

    OR ... keep the error because silently overwriting a rendered value might produce unexpected behavior.

    I would prefer the latter.

  2. Remove globals and document that the configuration is available in every task.

Change comment in task definitions from {# to #.

  • pipeline version used, if any: 0.0.5
  • Python version, if any: any
  • Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

By default, commented lines in Jinja2 are achieved with {# ... #} which is not normal for yaml and for Python files.

Describe the solution you'd like

Switch to #.

This will become easier if there exist environments for each set of templates because they can have their comment syntax and pipeline has its own.

.pipeline.yaml cannot be rendered.

  • pipeline version used, if any: 0.0.5
  • Python version, if any: any
  • Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

.pipeline.yaml cannot be rendered which is un-intuitive if you want to define custom paths.

data_directory: {{ source_directory }}/data

Describe the solution you'd like

I would suggest that there are two passes to read the configuration.

  1. Read the configuration as a simple yaml file.
  2. Then, render the configuration by using the former information as variables.

This would solve issues like the following:

data_directory: {{ source_directory }}/data
  1. Read the yaml and set defaults if missing. This will add {"source_directory": "src"} to the config.
  2. Render the configuration again with the aforementioned information which will produce {"data_directory": "src/data"}.

The second step could be repeated until even more nested expressions like

data_directory: {{ source_directory }}/data
soep_directory: {{ data_directory }}/soep

are rendered, too.

Allow tasks to persist.

  • pipeline version used, if any: 0.0.5
  • Python version, if any: any
  • Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

It is a common problem that some tasks in your project are very expensive to run and you do not want to accidentally overwrite them.

At the same time, pipeline keeps track of many changes (rendered template, dependencies, targets) which trigger a new execution. For example, formatting your project with black would trigger many re-runs.

Describe the solution you'd like

Add a key to a task definition named persists: true which skips a task as long as its targets exist. If not the task is re-run.

A user could either clean the whole project or selectively delete the tasks' targets.

Allow to execute the workflow within the project directory.

  • pipeline version used, if any: 0.0.5
  • Python version, if any: any
  • Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

Sometimes you find yourself inside the project directory and want to build the project. For that, you need to go back to the project root. Why not search upwards for a .pipeline.yaml?

Describe the solution you'd like

Describe alternatives you've considered

Better identification of tasks.

  • pipeline version used, if any: 0.0.5
  • Python version, if any: any
  • Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

  • Tasks can only be found in yaml. Document it.
  • If there is a yaml which is not a task, handle it gracefully instead of crashing.
  • Rethink whether tasks should be defined in yamls or Python scripts.

Describe the solution you'd like

Describe alternatives you've considered

Do not consider the task definition as a dependency of a task.

  • pipeline version used, if any: 0.0.5
  • Python version, if any: any
  • Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

Currently, the file in which the task is defined, some task.yaml, is considered a dependency of the task. If it is changed because other tasks are defined, it also leads to a re-run of the unchanged task.

The idea behind this implementation was that you can pass variables to tasks which will then be used inside the template. If such a variable is changed, the task should also be re-run.

One of the former PRs added the rendered template to the task dependencies which will also include the change of the variable. So, this is completely unnecessary.

Describe the solution you'd like

Remove the task definition from the list of task dependencies.

Computing hashes takes time.

  • pipeline version used, if any: 0.0.5
  • Python version, if any: any
  • Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

I have a project where I work with much data which can be stored in one big file or multiple smaller files. Computing the hashes of all these files for each task which depends on them takes a lot of time.

To be absolutely precise, pipeline needs to compute the hashes of all files. It is not even possible to take a short-cut if one of the hashes is not matching, because you need to update all hashes of all dependencies for this task.

Describe the solution you'd like

  • Caching if the hash of the same file is requested again. This already works within one build process, but needs more testing and benchmarking. For each build, the hashes are computed again. A persistent cache which is part of the data base would be helpful.
  • Parallelize hashing.

Describe alternatives you've considered

Implement a plug-in system for pipeline.

  • pipeline version used, if any: any
  • Python version, if any: any
  • Operating System: any

Is your feature request related to a problem? Please describe.

pipeline is currently a chimera of two components.

  1. A build system which supports executors and templates in different languages (Python, R).
  2. Templates in different languages which make it easy to run common tasks. These templates are heavily dependent on a topic.

Describe the solution you'd like

To tear these two components apart, I propose a plug-in system similar to pytest or Flask which allows to add more functionality to pipeline. I see these two components.

  • More executors (Julia, Jupyter Notebooks, etc.).
  • Templates which should be provided as an environment so that plug-ins can define their own syntax.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.