wfcommons / wfcommons Goto Github PK

WfCommons: A Framework for Enabling Scientific Workflow Research and Development

License: GNU General Public License v3.0

Python 97.94% Makefile 0.23% C++ 0.82% Cuda 1.01%

scientific-workflows simulation reproducible-research workflow distributed-systems workflow-simulator scheduling-simulator hpc workflow-management-system workflow-generator

wfcommons's Introduction

A Framework for Enabling Scientific Workflow Research and Development

This Python package provides a collection of tools for:

Analyzing instances of actual workflow executions;
Producing recipes structures for creating workflow recipes for workflow generation;
Generating synthetic realistic workflow instances; and
Generating realistic workflow benchmark specifications.

Installation

WfCommons is available on PyPI. WfCommons requires Python3.8+ and has been tested on Linux and macOS.

Installation using pip

While pip can be used to install WfCommons, we suggest the following approach for reliable installation when many Python environments are available:

$ python3 -m pip install wfcommons

Retrieving the latest unstable version

If you want to use the latest WfCommons unstable version, that will contain brand new features (but also contain bugs as the stabilization work is still underway), you may consider retrieving the latest unstable version.

Cloning from WfCommons's GitHub repository:

$ git clone https://github.com/wfcommons/wfcommons
$ cd wfcommons
$ pip install .

Optional Requirements

Graphviz

WfCommons uses pygraphviz for generating visualizations for the workflow task graph. If you want to enable this feature, you will have to install the graphviz package (version 2.16 or later). You can install graphviz easily on Linux with your favorite package manager, for example for Debian-based distributions:

sudo apt-get install graphviz libgraphviz-dev

and for RedHat-based distributions:

sudo yum install python-devel graphviz-devel

On macOS you can use brew package manager:

brew install graphviz

Then you can install pygraphviz by running:

python3 -m pip install pygraphviz

pydot

WfCommons uses pydot for reading and writing DOT files. If you want to enable this feature, you will have to install the pydot package:

python3 -m pip install pydot

Get in Touch

The main channel to reach the WfCommons team is via the support email: [email protected].

Bug Report / Feature Request: our preferred channel to report a bug or request a feature is via
WfCommons's Github Issues Track.

Citing WfCommons

When citing WfCommons, please use the following paper. You should also actually read that paper, as it provides a recent and general overview on the framework.

@article{wfcommons,
    title = {{WfCommons: A Framework for Enabling Scientific Workflow Research and Development}},
    author = {Coleman, Taina and Casanova, Henri and Pottier, Loic and Kaushik, Manav and Deelman, Ewa and Ferreira da Silva, Rafael},
    journal = {Future Generation Computer Systems},
    volume = {128},
    number = {},
    pages = {16--27},
    doi = {10.1016/j.future.2021.09.043},
    year = {2022},
}

wfcommons's People

Contributors

Stargazers

Watchers

Forkers

curioustauseef wyywyy23 jaredraycoleman sheypex wrwilliams apollo-afcl jeff-yc-wong ftschirpke

wfcommons's Issues

Recipe for generating SRA Search synthetic workflows

development of a recipe based on traces from the SRA Search workflow: https://github.com/workflowhub/pegasus-traces/tree/master/srasearch

Add a logs parser for Makeflow

https://cctools.readthedocs.io/en/latest/makeflow/

Add logs parser for Pegasus 5.x

Pegasus has moved to YAML in release 5.0, so a new parser is required

Integrate WfChef into WfCommons

WfBench: allowing max memory consumption set for tasks

Request from Thomas Fahringer and Stefan: RAM limit (e.g. 10 GB)

Fix random generation issue for Montage mDiffFit jobs

Sporadically the random generation method for picking up mProject parent jobs selects the same index.

File size distribution fit is not fully accurate

file size data need to be normalized to find better distribution fit

WfBench: add support for time limit for running a task

Feature requested by Thomas Fahringer and Stefan: Task execution time limit (e.g. 15 min)

Traces generated by the package are fully compatible with our schema

The traces generated by our package lack from several JSON fields:

Makespan
executedAt
wms
machines

Because of this, the traces generated cannot be analyzed as regular traces.

Make graphviz dependency optional

Based on feedback from NERSC, graphviz should only be required if using the visualization feature of WfCommons

Ability to increase/reduce task runtime or file sizes by a factor of X

Update installation instructions for using PyPI

Simplify PegasusLogsParser with removing the legacy flag

Is your feature request related to a problem? Please describe.
When using PegasusLogsParser, the user currently has to know whether the submit directory has been generated using Pegasus 4.x or 5.x (5.x uses YAML and 4.x a custom XML-based format).
We could improve PegasusLogsParser to automatically detects the submit directory version and acts appropriately.

Describe the solution you'd like
Remove the legacy flag

Generators for Makeflow workflows

BLAST workflow
BWA workflow

Use pathlib to manage all file/folder paths

Add generation of average CPU usage for synthetic workflows (from recipe)

Montage workflow : no transfer file between mImgtbl and mAdd job

Hi @rafaelfsilva

I am not sure if there is a bug in workhub or not
when I generate a Montage DAG
based on the presented structure here : https://pegasus.isi.edu/workflow_gallery/gallery/montage/index.php
there should be a link between mImgtbl and mAdd job
I can see in the generated JSON file, mAdd is listed as a child of mImgtbl
however, the mImgtbl output link is not appear as input file for mAdd

...
 {
    "name": "mImgtbl_00000131",
    ...
    "children": [
        "mAdd_00000132"
    ],
    "files": [
        ...
        {
            "link": "output",
            "name": "509e9372-a8f4-4be5-bef5-6cfc5dcc34f9.tbl",
            "size": 2594
        }
    ]
}
...

i.e, there is no 509e9372-a8f4-4be5-bef5-6cfc5dcc34f9.tbl as an input file for mAdd_00000132 job

Bug when using PegasusLogsParser with

WfCommons Information

WfCommons version: master
Python Version: 3.910

Describe the bug
When using PegasusLogsParser with a Pegasus submit directory <= 4.9, I obtain the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/wfcommons/wfinstances/logs/pegasus.py", line 78, in build_workflow
    self._parse_braindump()
  File "/usr/local/lib/python3.9/site-packages/wfcommons/wfinstances/logs/pegasus.py", line 96, in _parse_braindump
    raise OSError(f'Unable to find braindump file: {braindump_file}')
OSError: Unable to find braindump file: /braindump.txt

Desktop (please complete the following information):

OS: macOS
Version: 12.2.1

Edges on Montage and Soykbr workflows are not correct.

WfCommons Information

WfCommons version: 0.7, master branch
Python Version: 3.8.10

Describe the bug
Some of the edges in Montage and Soykbr do not have the correct weight.

To Reproduce
Sum of the "input" files of a child in the montage or soykbr workflows do not result in same value as
the parent "output" files.

Expected behavior
The sum of the input files of the child and sum of the output files in the parent should have the same weight because they describe the same edge.

Screenshots
Example of creating a montage workflow. (Similar to the one I am using in my program)

Desktop (please complete the following information):

OS: Windows 10 using wsl with Ubuntu 20.04
Version [e.g. 20.04]

Additional context
Found this problem when trying to construct the critical path of the workflow.

Make WfCommons compatible with workflow-schema version 1.2

Font size parameter to generate fit plots

Added a font_size parameter (optional) to _generate_fit_plots

Naming issue for .yml files during Seismology/Montagev3 execution preventing PythonLogsParser from working out of the box

WorkflowHub Information

WorkflowHub version: [e.g., 0.3, master branch, etc.] 5.0
Python Version: [e.g., 3.5] 3.6.9

Describe the bug
Seismology and Montage v3 are not producing yml/txt files that can work with PythonLogsParser out of the box. Seismology produces a braindump.yml file, but that needs to be renamed to braindump.txt to work with legacy=True. Legacy shouldn't be required as it was run on version 5.0. Montage v3 produces both braindump.yml and montage-workflow.yml, the latter of which needs to be renamed to workflow.yml for the parser to run.

To Reproduce
After running either seismology or montagev3 workflows, you can attempt to create a json using the process at https://docs.workflowhub.org/en/latest/parsing_logs.html#pegasus-wms . The script with legacy=False needs a workflow.yml file, or if legacy=True it needs a braindump.txt file. Neither is created by default on execution of the workflow.

Expected behavior
Seismology and Montage producing workflow.yml files during execution.

Desktop (please complete the following information):
Ubuntu 18.04.04

Generate WfInstances from dot files

Request from Svetlana Kulagina:

This Recipe can be generated with wfchef from an executable file. What if I don't have an executable, but rather have a .dot description of the workflow? Can I somehow generate a recipe from it? Is there maybe a way to manually transform the .dot file into the recipe?

generated output DAG json strcuture not similar to workflowhub

Hi
previously, I used workflowhub to generate some realworld workflow application, now it seems it merged with wfcommons package
I did some simple tests and the generated output json file is not clear
let me to explain this with a simple example, for EpigenomicsRecipe, we have

        "jobs": [
            {
                "name": "fastqSplit_00000001",
                "type": "compute",
                "runtime": 878.473,
                "parents": [],
                "children": [
                    "filterContams_00000002",
                    "filterContams_00000006",
                    "filterContams_00000010"
                ],
                "files": [
                    {
                        "link": "input",
                        "name": "06252281-89da-4385-b6cd-025b55f91d56.sfq",
                        "size": 57233202
                    },
                    {
                        "link": "output",
                        "name": "314ac45e-0b2b-447d-b7b3-e44806bcd60a.sfq",
                        "size": 12060453
                    },
                    {
                        "link": "output",
                        "name": "03c46ee5-7d81-48e8-b738-6a52a3f02044.sfq",
                        "size": 10733270
                    },
                    {
                        "link": "output",
                        "name": "3f8c84bf-2c61-4a30-a891-4aeda1de6fd2.sfq",
                        "size": 12346046
                    }
                ],
                "cores": 1
            },
            ...
            ...
            {
                "name": "filterContams_00000002",
                "type": "compute",
                "runtime": 12.196,
                "parents": [
                    "fastqSplit_00000001"
                ],
                "children": [
                    "sol2sanger_00000003"
                ],
                "files": [
                    {
                        "link": "input",
                        "name": "314ac45e-0b2b-447d-b7b3-e44806bcd60a.sfq",
                        "size": 12060453
                    },
                    {
                        "link": "output",
                        "name": "2b441ab8-e098-46d1-834f-dc11513ee8ec.sfq",
                        "size": 2747304
                    }
                ],
                "cores": 1
            },

as it can be seen, file 314ac45e-0b2b-447d-b7b3-e44806bcd60a.sfq is marked as the output file of task fastqSplit_00000001 and the input file for task sol2sanger_00000003

however, by using wfcommons we don't have this structure
for example,

            {
                "name": "fastqSplit_00000021",
                "type": "compute",
                "runtime": 878.473,
                "parents": [],
                "children": [
                    "filterContams_00000022",
                    "filterContams_00000023",
                    "filterContams_00000024",
                    "filterContams_00000025",
                    "filterContams_00000026",
                    "filterContams_00000027",
                    "filterContams_00000028",
                    "filterContams_00000029",
                    "filterContams_00000030"
                ],
                "files": [
                    {
                        "link": "input",
                        "name": "a22a4e96-1955-4395-8049-aad709e7e2c0.sfq",
                        "size": 367561779
                    },
                    {
                        "link": "output",
                        "name": "5c162d9d-72b6-443b-982b-4c503cbafa0a.sfq",
                        "size": 11562952
                    }
                ],
                "cores": 1
            },
            ...
            ...
            {
                "name": "filterContams_00000022",
                "type": "compute",
                "runtime": 40.919,
                "parents": [
                    "fastqSplit_00000021"
                ],
                "children": [
                    "sol2sanger_00000004"
                ],
                "files": [
                    {
                        "link": "input",
                        "name": "f1fa831a-c037-4f0d-b468-87cecff9004d.sfq",
                        "size": 5489748
                    },
                    {
                        "link": "output",
                        "name": "4c0098e8-ea3b-435b-967d-cbe3d6a5c06e.sfq",
                        "size": 1360589
                    }
                ],
                "cores": 1
            },

as you can see, for task fastqSplit_00000021, child filterContams_00000022 has input file f1fa831a-c037-4f0d-b468-87cecff9004d.sfq but it did not listed as the output file of task fastqSplit_00000021

is this a bug @tainagdcoleman ? I think, similar to workflowhub, the file name should be listed twice, once for parent as output file and once as input file for the child

Add logs parser for Pegasus 4.9.x or earlier releases

Migrate the Pegasus parser from https://github.com/wrench-project/pegasus/blob/master/tools/wrench-pegasus-parser.py to WorkflowHub