tobiasraabe / pipeline Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 106 KB

Build system for the scientific publication process.

Home Page: https://pipeline-wp.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Batchfile 0.09% Shell 0.13% Python 94.07% R 5.71%

pipeline's People

Contributors

Stargazers

Watchers

pipeline's Issues

Project configuration and global variables.

pipeline version used, if any: 0.0.5
Python version, if any: any
Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

Currently, the complete configuration of the project (rendered content of .pipeline.yaml is available in a task template. The following issues occur:

If a variable in the project configuration is the same as in the task definition, a TypeError is raised because _render_task_template receives the same argument twice.
The explicit mention of global variables does not make sense. You do not have to define a special dictionary. Just create any variable in the yaml.

Describe the solution you'd like

Make a dictionary update before passing arguments to _render_task_template. Task information should be able to overwrite the general configuraiton ...

OR ... keep the error because silently overwriting a rendered value might produce unexpected behavior.

I would prefer the latter.
Remove globals and document that the configuration is available in every task.

Change comment in task definitions from {# to #.

pipeline version used, if any: 0.0.5
Python version, if any: any
Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

By default, commented lines in Jinja2 are achieved with {# ... #} which is not normal for yaml and for Python files.

Describe the solution you'd like

Switch to #.

This will become easier if there exist environments for each set of templates because they can have their comment syntax and pipeline has its own.

.pipeline.yaml cannot be rendered.

pipeline version used, if any: 0.0.5
Python version, if any: any
Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

.pipeline.yaml cannot be rendered which is un-intuitive if you want to define custom paths.

data_directory: {{ source_directory }}/data

Describe the solution you'd like

I would suggest that there are two passes to read the configuration.

Read the configuration as a simple yaml file.
Then, render the configuration by using the former information as variables.

This would solve issues like the following:

data_directory: {{ source_directory }}/data

Read the yaml and set defaults if missing. This will add {"source_directory": "src"} to the config.
Render the configuration again with the aforementioned information which will produce {"data_directory": "src/data"}.

The second step could be repeated until even more nested expressions like

data_directory: {{ source_directory }}/data
soep_directory: {{ data_directory }}/soep

are rendered, too.

Allow tasks to persist.

pipeline version used, if any: 0.0.5
Python version, if any: any
Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

It is a common problem that some tasks in your project are very expensive to run and you do not want to accidentally overwrite them.

At the same time, pipeline keeps track of many changes (rendered template, dependencies, targets) which trigger a new execution. For example, formatting your project with black would trigger many re-runs.

Describe the solution you'd like

Add a key to a task definition named persists: true which skips a task as long as its targets exist. If not the task is re-run.

A user could either clean the whole project or selectively delete the tasks' targets.

Allow to execute the workflow within the project directory.

pipeline version used, if any: 0.0.5
Python version, if any: any
Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

Sometimes you find yourself inside the project directory and want to build the project. For that, you need to go back to the project root. Why not search upwards for a .pipeline.yaml?

Describe the solution you'd like

Describe alternatives you've considered

Better identification of tasks.

pipeline version used, if any: 0.0.5
Python version, if any: any
Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

Tasks can only be found in yaml. Document it.
If there is a yaml which is not a task, handle it gracefully instead of crashing.
Rethink whether tasks should be defined in yamls or Python scripts.

Describe the solution you'd like

Describe alternatives you've considered

Do not consider the task definition as a dependency of a task.

pipeline version used, if any: 0.0.5
Python version, if any: any
Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

Currently, the file in which the task is defined, some task.yaml, is considered a dependency of the task. If it is changed because other tasks are defined, it also leads to a re-run of the unchanged task.

The idea behind this implementation was that you can pass variables to tasks which will then be used inside the template. If such a variable is changed, the task should also be re-run.

One of the former PRs added the rendered template to the task dependencies which will also include the change of the variable. So, this is completely unnecessary.

Describe the solution you'd like

Remove the task definition from the list of task dependencies.

Computing hashes takes time.

pipeline version used, if any: 0.0.5
Python version, if any: any
Operating System: any

What would you like to enhance and why? Is it related to an issue/problem?

I have a project where I work with much data which can be stored in one big file or multiple smaller files. Computing the hashes of all these files for each task which depends on them takes a lot of time.

To be absolutely precise, pipeline needs to compute the hashes of all files. It is not even possible to take a short-cut if one of the hashes is not matching, because you need to update all hashes of all dependencies for this task.

Describe the solution you'd like

Caching if the hash of the same file is requested again. This already works within one build process, but needs more testing and benchmarking. For each build, the hashes are computed again. A persistent cache which is part of the data base would be helpful.
- http://www.grantjenks.com/docs/diskcache/tutorial.html
Parallelize hashing.

Describe alternatives you've considered

Implement a plug-in system for pipeline.

pipeline version used, if any: any
Python version, if any: any
Operating System: any

Is your feature request related to a problem? Please describe.

pipeline is currently a chimera of two components.

A build system which supports executors and templates in different languages (Python, R).
Templates in different languages which make it easy to run common tasks. These templates are heavily dependent on a topic.

Describe the solution you'd like

To tear these two components apart, I propose a plug-in system similar to pytest or Flask which allows to add more functionality to pipeline. I see these two components.

More executors (Julia, Jupyter Notebooks, etc.).
Templates which should be provided as an environment so that plug-ins can define their own syntax.

tobiasraabe / pipeline Goto Github PK

pipeline's People

Contributors

Stargazers

Watchers

pipeline's Issues

What would you like to enhance and why? Is it related to an issue/problem?

Describe the solution you'd like

What would you like to enhance and why? Is it related to an issue/problem?

Describe the solution you'd like

What would you like to enhance and why? Is it related to an issue/problem?

Describe the solution you'd like

What would you like to enhance and why? Is it related to an issue/problem?

Describe the solution you'd like

What would you like to enhance and why? Is it related to an issue/problem?

Describe the solution you'd like

Describe alternatives you've considered

What would you like to enhance and why? Is it related to an issue/problem?

Describe the solution you'd like

Describe alternatives you've considered

What would you like to enhance and why? Is it related to an issue/problem?

Describe the solution you'd like

What would you like to enhance and why? Is it related to an issue/problem?

Describe the solution you'd like

Describe alternatives you've considered

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Recommend Projects

Recommend Topics

Recommend Org