Code Monkey home page Code Monkey logo

pipelines's Introduction

Pipelines

build build latest build tag build stable

GitHub release (latest SemVer including pre-releases)

The project experiments with ways to generate data processing pipelines. The aim is to generate some re-usable building blocks that can be piped together into more functional pipelines. Their prime initial use is as executors for the Squonk Computational Notebook (http://squonk.it) though it is expected that they will have uses in other environments.

As well as being executable directly they can also be executed in Docker containers (separately or as a single pipeline). Additionally they can be executed using Nextflow (http://nextflow.io) to allow running large jobs on HPC-like environments.

Currently it has some python scripts using RDKit (http://rdkit.org) to provide basic cheminformatics and comp chem functionality, though other tools will be coming soon, including some from the Java ecosystem.

  • See here for more info on the RDKit components.
  • See here for more info on running these in Nextflow.

Note: this is experimental, subject to change, and there are no guarantees that things work as expected! That said, it's already proved to be highly useful in the Squonk Computational Notebook, and if you are interested let us know, and join the fun.

The code is licensed under the Apache 2.0 license.

Pipeline Utils

In Jan 2018 some of the core functionality from this repository was broken out into the pipeline-utils repository. This included utility Python modules, as well as creation of a test framework that makes it easier to create and test new modules. This change also makes it easier to create additonal pipeline-like projects. See the Readme in the pipeline-utils repo for more details.

General principles

Modularity

Each component should be small but useful. Try to split complex tasks into reusable steps. Think how the same steps could be used in other workflows. Allow parts of one component to be used in another component where appropriate but avoid over use. For example see the use of functions in rdkit/conformers.py to generate conformers in o3dAlign.py

Consistency

Consistent approach to how components function, regarding:

  1. Use as simple command line tools that can be piped together
  2. Input and outputs either as files of using STDIN and STDOUT
  3. Any info/logging written to STDERR to keep STDOUT free for output
  4. Consistent approach to command line arguments across components

Generally use consistent coding styles e.g. PEP8 for Python.

Input and output formats

We aim to provide consistent input and output formats to allow results to be passed between different implementations. Currently all implementations handle chemical structures so SD file would typically be used as the lowest common denominator interchange format, but implementations should also try to support Squonk's JSON based Dataset formats, which potentially allow richer representations and can be used to describe data other than chemical structures. The utils.py module provides helper methods to handle IO.

Thin output

In addition implementations are encouraged to support "thin" output formats where this is appropriate. A "thin" representation is a minimal representation containing only what is new or changed, and can significantly reduce the bandwith used and avoid the need for the consumer to interpret values it does not need to understand. It is not always appropriate to support thin format output (e.g. when the structure is changed by the process).

In the case of SDF thin format involves using an empty molecule for the molecule block and all properties that were present in the input or were generated by the process (the empty molecule is used so that the SDF syntax remains valid).

In the case of Squonk JSON output the thin output would be of type BasicObject (e.g. containing no structure information) and include all properties that were present in the input or were generated by the process.

Implicit in this is that some identifier (usually a SD file property, or the JSON UUID property) that is present in the input is included in the output so that the full results can be "reassembled" by the consumer of the output. The input would typically only contain additional information that is required for execution of the process e.g. the structure.

For consistency implementations should try to honor these command line switches relating to input and output:

-i and --input: For specifying the location of the single input. If not specified then STDIN should be used. File names ending with .gz should be interpreted as gzipped files. Input on STDIN should not be gzipped.

-if and --informat: For specifying the input format where it cannot be inferred from the file name (e.g. when using STDIN). Values would be sdf or json.

-o and --output: For specifying the base name of the ouputs (there could be multiple output files each using the same base name but with a different file extension. If not specified then STDOUT should be used. Output file names ending with .gz should be compressed using gzip. Output on STDOUT would not be gzipped.

-of and --outformat: For specifying the output format where it cannot be inferred from the file name (e.g. when using STDOUT). Values would be sdf or json.

--meta: Write additional metadata and metrics (mostly relevant to Squonk's JSON format - see below). Default is not to write.

--thin: Write output in thin format (only present where this makes sense). Default is not to use thin format.

UUIDs

The JSON format for input and oputput makes heavy use of UUIDs that uniquely identify each structure. Generally speaking, if the structure is not changed (e.g. properties are just being added to input structures) then the existing UUID should be retained so that UUIDs in the output match those from the input. However if new structures are being generated (e.g. in reaction enumeration or conformer generation) then new UUIDs MUST be generated as there is no longer a straight relationship between the input and output structures. Instead you probably want to store the UUID of the source structure(s) as a field(s) in the output. To allow correlation of the outputs to the inputs (e.g. for conformer generation output the source molecule UUID as a field so that each conformer identifies which source molecule it was derived from.

When not using JSON format the need to handle UUIDs does not necessarily apply (though if there is a field named 'uuid' in the input it will be respected accordingly). To accommodate this you are recommended to ALSO specify the input molecule number (1 based index) as an output field independent of whether UUIDs are being handled as a "poor man's" approach to correlating the outputs to the inputs.

Filtering

When a service that filters molecules special attention is needed to ensure that the molecules are output in the same order as the input (obviously skipping structures that are filtered out). Also the service descriptor (.dsd.json) file needs special care. For instance take a look at the "thinDescriptors" section of src/pipelines/rdkit/screen.dsd.json

When using multi-threaded execution this is especially important as results will usually not come back in exactly the same order as the input.

Metrics

To provide information about what happened you are strongly recommended to generate a metrics output file (e.g. output_metrics.txt). This file allows to provide feedback about what happened. The contents of this file are fairly simple, each line having a

key=value

syntax. Keys beginning and ending with __ (2 underscores) have magical meaning. All other keys are treated as metrics that are recorded against that execution. The current magical values that are recognised are:

  • InputCount: The total count of records (structures) that are processed
  • OutputCount: The count of output records
  • ErrorCount: The number of errors encountered

Here is a typical metrics file:

__InputCount__=360
__OutputCount__=22
PLI=360

It defines the input and output counts and specifies that 360 PLI 'units' should be recorded as being consumed during execution.

The purpose of the metrics is primarily to be able to chage for utilisation, but even if not charging (which is often the case) then it is still good practice to record the utilisation.

Metadata

Squonk's JSON format requires additional metadata to allow proper handling of the JSON. Writing detailed metadata is optional, but recommended. If not present then Squonk will use a minimal representation of metadata, but it's recommended to provide this directly so that additional information can be added.

At the very minimum Squonk needs to know the type of dataset (e.g. MoleculeObject or BasicObject), but this should be handled for you automatically if you use the utils.default_open_output* methods. Better though to also specify metadata for the field types when you do this. See e.g. conformers.py for an example of how to do this.

Deployment to Squonk

The service descriptors need to to POSTed to the Squonk coreservices REST API.

Docker

A shell script can be used to deploy the pipelines to a running containerised Squonk deployment: -

$ ./post-service-descriptors.sh

OpenShift/OKD

The pipelines and service-descriptor container images are built using gradle in this project. The are deployed from the Squonk project using Ansible playbooks.

A discussion about the deployment of pipelines can be found in the Posting Squonk pipelines section of Squonk's OpenShift Ansible README.

Running tests

The test running is in the pipelines-utils repo and tests are run from there. For full details consult that repo.

But as a quick start you should be able to run the the tests in a conda environment like this:

Create a conda environment containing RDKit:

conda env create -f environment-rdkit-utils.yml

Now activate that environment:

conda activate pipelines-utils

Note: this environment includes pipeline-utils and pipeline-utils-rdkit from PyPi. If you need to use changes from these repos you will need to create a conda environment that does not contain these and instead set your PYTHONPATH environment variable to include the pipelines-utils and pipelines-utils-rdkit sources (adjusting /path/to/ to whatever is needed):

export PYTHONPATH=/path/to/pipelines-utils/src/python:/path/to/pipelines-utils-rdkit/src/python

Move into the pipelines-utils repo (this should be alongside pipelines and pipelines-utils-rdkit):

cd ../pipelines-utils

Run tests:

./gradlew runPipelineTester -Pptargs=-opipelines

Contact

Any questions contact:

Tim Dudgeon [email protected]

Alan Christie [email protected]

pipelines's People

Contributors

abradle avatar alanbchristie avatar kinow avatar tdudgeon avatar waztom avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pipelines's Issues

Build-time tags for pipelines images and service descriptors

In OpenShift (and Docker), if an existing pipelines image has been deployed re-running the poster image adds the modified or new service descriptors but the underlying image is not necessarily re-pulled.

To solve the problem in OpenShift the imagePullPolicy could be set to Always but this would introduce significant execution delays, especially as pipeline image layers can be substantial.

Anther idea is to use explicit tags on the service descriptor's image reference (the imageName property). And, more importantly, set these at build time. We could have a tag formed from a short-form of the build date. If we expected to produce just one official copy of the pipeline each day the image tag could be 2018-11-14. We set that in the service descriptor imageName value as it's written to the poster image. Finally, we push the corresponding pipelines image using the same tag.

CMD changes for latest OpenShiftRunner

The runner does not set the container CMD any more. Instead it populates the container with an execute script (in the squonk work directory) and expects the container to run this automatically. This is achieved by relying on CMD being set to ./execute and the WORKDIR being set to the location of the execute script.

Change the container to...

  1. Provide a helpful (default) execute script in the home directory
  2. Ensure WORKDIR is set to the execute script's directory
  3. CMD is set to ./execute

Some issues with the pipelines

While wrapping some of the python pipelines for Galaxy we discovered some of the following issues:

  • Two of the scripts (show_feats.py and constrained_conf_gen.py) don't have main() functions defined; this would be handy for defining entry points to the scripts, as in the conda recipe we wrote here: https://github.com/bioconda/bioconda-recipes/pull/22348/files#diff-a937f0d7f1f974a955b3aac5cbef8db1R16. The same issue would arise with a PyPI package, I expect.
  • The meta flag does not work in pbf_ev.py (maybe deliberate)
  • The screen.py returns the number of screened molecules as the exit code. Maybe the intention is to pass this information to the next pipeline somehow, but surely there's a better way of doing it? It should return 0 as long as the script completes successfully.
  • Running gzip on the output of constrained_conf_gen.py gives an unexpected end of file error.

Documentation Update

A task to adjust documentation/scripts for early deployment experiments (scaleway and virtual-box)

Enhancement to rDock docking

  1. incorporate active site generation and avoid need to upload zip file (inputs would be receptor mol2 file, reference ligand as molfile and ligands to be docked as SDF)
  2. allow to filter poses based or score relative to reference ligand

ClusterFps treshold

Hi thank you for making this repo available.
Can I ask you why do you set 1- args.treshold, instead of args.treshold?

clusters, dists, matrix = ClusterFps(fps, args.metric, 1.0 - args.threshold)

dists.extend([1-x for x in sims])

And why u use 1-x instead of x for similarity score? ?

I think I got: It is because the RDKit implementation is for Euclidian distances rather than the OG implementation form '99.

Python tests

Define a good approach for testing the python (RDKit) components. Currently there is a shell script that can be use to run the scripts which provides very basic testing capabilities (basically just that the scripts run, not that they work correctly), but something better is needed.
Need to work out how best to structure the scripts so that they can be properly tested.

I tried creating a python unittest (rdkit/TestScreen.py), but it doesn't work yet and there may be better approaches.

Allow to run Nextflow pipelines that us containers that run as non-root users

Currently there are issues when running a Nextflow pipeline that uses docker images that run as a non-root user. This is because the workdir created by Nextflow that is used for execution of that process has permissions of drwxr-xr-x. and (in Squonk execution) is owned by root.root so the container cannot write to that directory.
To work around this it is possible to set docker run options for Nextflow, but currently these apply globally (but see nextflow-io/nextflow#415).
A better solution might be for there to be an option in Nextflow that allows the permissions on the work dir to be specified.

This problem is seen in the rdock.nsd.nf file which uses the informaticsmatters/rdock image that runs as the rdock (501) user.

Use of utils v2

Changes to support pipelines-utils (and pipelines-utisl-rdkit) v2.

Refactor

Refactor to align with the other pipeline repos.

Improve test coverage

Using the pipeline tester we need to improve test coverage to ensure that problems seen restructuring the pipelines utils are caught. Even if this is simply a test to exercise more lines of code - anything to improve Python line coverage.

Introduce pipeline versioning

We need to support multiple pipeline implementation images. At the moment we have an implementation image (latest) and a poster image (latest). We need to support versioning of the pipelines at the base image and pipelines implementation levels.

The proposal is to: -

  1. Put explicit image tags in each service descriptor
  2. Ensure that the pipeline base images (i.e. RDKit) contain a base version using fields in the image name.
  3. Produce pipelines images uses tags that define the implementation version
  4. Post pipelines directly from source (rather than using a “poster” image)

An image reference in a service descriptor will therefore look like this:

informaticsmatters/rdkit_pipelines_2019-1:1

Where: -

  • informaticsmatters/rdkit-pipelines is the base image
  • 2019-1 is its base version (i.e. anything after the last “_” in the image name)
  • 7 is the pipelines implementation version

Base and implementation versions

The base image version changes whenever code affecting the base image changes.

The implementation version changes whenever a pipeline (or the utilities it uses) changes or when the base image changes. For example if pipeline image informaticsmatters/rdkit_pipelines_2019-1:1 is updated with a new informaticsmatters/rdkit_pipelines_2019-2 base image, the new pipeline image becomes informaticsmatters/rdkit_pipelines_2019-2:2.

Pipelines in a released image can refer to different implementation versions. For example, it’s perfectly reasonable to have service descriptors referring to many different implementation or base image versions.

Pipeline images, once released (to Docker Hub for example) must not be removed, any previously released pipeline may already be in use at a customer site.

Upon release (from the master branch of a pipelines repository) the master branch must be tagged with the implementation version using the format rN where N is the release version (an integer).

Pipeline versions

Pipeline implementation version numbers increase but not necessarily in an unbroken sequence.

As pipelines often share a base image they must also share the implementation tag. It is quite possible that the implementation has not changed despite a number of new pipelines image releases. A pipeline may refer to a “old” image version for a considerable time. When the pipeline is eventually updated the implementation tag it refers to may be significantly larger than its existing version. Implementation numbers used for a given pipeline service descriptor can therefore change from 1 to 64 in a single release.

## Posting pipelines

This was previously accomplished using a “poster” image. For this feature we will “post” pipelines to a squonk deployment (via the existing REST service) directly from a check-out of a pipelines source repository - specifically from source cloned from a repository tag (i.e. r17).

Neutralise function

Would be useful to add the ability to neutralise molecules to complement the fragmentation process in rdkit/filter.py

See the RDKit cookbook for example of how to do so.

Screen using SSS

Create a component similar to the current rdkit/screen.py component , but that works using substructure search instead of similarity search.

Initially could take just a single structure as query (as smiles or molfile) but should be extended to allow multiple substructure queries to be specified as SDF or smarts file. The output would be structures matching any of the query structures. Thought will need to be given for how to record which input structure matches when generating the results.

screen.py should also be extended to handle multiple query structures.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.