Code Monkey home page Code Monkey logo

chemflow's Introduction

ChemFlow Logo

Documentation Status

ChemFlow is a series of computational chemistry workflows designed to automatize and simplify the drug discovery pipeline and scoring function benchmarking.

The workflows allow the user to spend more time thinking, i.e. running benchmarks or experiments, analyzing the data, and taking decisions, rather than programming/testing/debugging their own scripts.

It consists of BASH and PYTHON scripts that can be launched locally (serial or with GNU parallel) or on a compute cluster via PBS.

  • DockFlow : Docking and Virtual Screening
  • LigFlow : Computing AM1-BCC and/or RESP charges for the docked compounds to rescore
  • ScoreFlow : Rescoring using PLANTS, Vina, or MM(PB,GB)SA

Requirements for ChemFlow

We do not provide any of the licensed softwares used by ChemFlow. It is up to the user to acquire and install PLANTS, Vina,QVina, Smina, Amber and the other softwares that might be added in future releases of ChemFlow.

PLANTS and SPORES are both available under a free academic license.

How to get started

Go to the docs/ folder and follow the installation instructions written in the file "installation.rst"

chemflow's People

Contributors

adriencerdan avatar cbouy avatar diegoenry avatar donadef avatar kgalentino avatar marionsisquellas avatar mcecchini75 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

chemflow's Issues

Review initial MOL2 creation (DockFlow, ScoreFlow, LigFlow)

Currently Tripos mol2 it the default input format for both receptor and ligand in DockFlow and ScoreFlow, and for compounds in LigFlow.

RDKit does not write a .mol2 file, sticking to .sdf which is the industry-standard. That's not an issue for most modern software but as far as we want to support VINA and SEED we need properly convert the initial SMILES into MOL2.

Currently we do this conversion using openbabel (arrrg) and looks like it doesn't not follow SYBYL all the time, producing incompatible atoms types such as "S.O2" instead of "S.o2", tiny inconsistencies which may break some parts of the codes.

We found this after frequently finding out PLANTS "Du" (dummy) atoms both for protein and ligand, that could severly compromise the Docking outcome.

Finally if these DUMMY atoms progress into MM/GBSA rescoring or QM parametrization, they'll fail to produce correct parameters.

The latest ChemBase (from CN+DrugBank, July 2018) was based on SDF->MOL2 using OPENBABEL, so once again we may probably need to trash it. We put a quickfix inside Dock/ScoreFlow using antechamber to convert the files.

Finally, for ChemBase we can keep using .sdf and may migrate to amber "database" command.

Writing on the same file in parallel

This didn't cause any problem with GNU parallel and 8 cores for now, but on mazinger the final results in the CSV tables (ranking.csv) are completely messed up.
We need to find a proper and secure way to write the final CSV tables for parallel computing, or do it in serial...

Eliminate OPENBABEL

Babel is great but a big pain as it may not follow the appropriate formating of the files.

We must remove as many intermediate steps as possible.

Add example folder for DockFlow

Provide an example input, containing at least one complex with known solution for "redock"

HIV-1 is not the best but can be the case.

DockFlow has some possible scenarios that should be addressed by the notebooks.

  • Docking Protocol validation - Simple docking: with known results for one complex.
  • Docking Protocol validation - Cross docking: with known results for more than one complex.
  • Protocol validation - Simple docking: with known results for one complex.
    ** Docking Notebooks must be able to compare runs with different parameters.
  • Virtual Screening Protocol validation - with known results for some complexes, actives, inactives, DUD-E

Default evaluation metrics (please update)

  • Docking
    ** RMSD to reference structure (if available)
    ** RMSDFlow options to allow flexibility
  • Virtual Screening
    ** Area under the receiver operating characteristic curve (ROC)
    ** Area under the accumulation curve (AUAC)
    ** Average rank of actives (Fraction of actives among Top3,5,10... )
    ** Enrichment factor (EF)
    ** Robust initial enhancement (RIE) (Future)

Example of the expected output

ScoreFlow MMGBSA in parallel on local computer.

MMGBSA needs "ante-MMPBSA.py" to generate the input files,
It takes some time because it's a python script, and could be easly done in parallel.

In addition, post-processing MD simulations of a SINGLE system, one may benefit from MMGBSA.py.MPI

The solution could be the same as always: echo all commands, including "cd ${RUNDIR}/${LIGAND}" into a "mmpbsa.xargs" file then running them all according to the number of available CPUs.

ReportFlow

Leave this open until we finish a decent description of features.

Also a how-to would be great.

Allow user to change parameters in MMGBSA.

Enable GBSA models igb=1,2,5,8
Implement PBSA (igb=RTFM)

Most GB/PB model parameters are auto-assigned during runtime (AMBER) for Minimization and MD simulations after "igb" is defined.

For MM(PB,GB)SA post-processing it is mandatory to modifty the TOPOLOGY using "ante-MMGBSA.py" .
Make sure to use the proper setup for each model.

DockFlow : Implement more docking softwares

Right now, DockFlow is not optimized to easily implement a new docking software.

  • Add more freedom to the user for PLANTS
  • Refactor code similarly to ScoreFlow
  • Implement Vina

Fix "rewrite ligands" in DockFlow.

DockFlow asks to "rewrite original ligands", if we say NO and the molecules aren't @LigFlow folder, if behaves weirdly.

Everytime we run, it should verify if each ligand is there already, or else just write it there.
If it finds the ligand there, ask to rewrite a specific ligand or --rewrite-all.

--postdoc selections.

Implement selections for DockFlow output using --postdock.

The actions currently in "ChemFlow_tools.bash" and allow the user to select and output:

  • ALL compounds, ranked by energy + ALL docking poses ( --postdock default )
  • SOME compounds, ranked by energy + SOME docking poses

Re-implement ChemFlow config file, start with ScoreFlow.

I suggest to start by AMBER parameters.

  1. ChemFlow --write-config to write a configuration file.

  2. Something like ChemFlow --config project_protocol_target_chemflow.config should be enough to run.

  3. Include ALL mandatory parameters such as --recetor, --ligand. and short comments about their need.

  4. After running a WorkFlow, make sure a new configuration file is written, with all used parameters, including defaults.

Make all combinations in the GUI are covered !

Optimize file I/O.

File Input/Output is not an issue with small libraries with hundreds of files and local SSD disks.
But when we move to a cluster/supercomputer like mesocentre that becomes an important bottleneck.

Not only reading and writing but our CHECKPOINTS are very I/O demanding. Notice we frequently check if some files exists to contine.

Input dockflow

Dockflow should not read automatically .config file.
Dockflow should Error if you pass both config file and arguments, or combine them if possible.
Dockflow should fatal error if you don't provide arguments neither a config file.

Antechamber cannot be run in parallel in the same folder.

I just realized that antechamber can not be executed at the concurrently at the same folder.

For each ligand it generates a number of temporary files with the same name

Quick solution could be running from a temporary folder, example here:
dbarreto@hpc-login1:~/ChemFlow_paper/Benchmark_meso$ vim LigFlow_names.bash

cd $(mktemp -d ) ; antechamber -i ${RUNDIR}/${LIGAND}.mol2 ...

Flexibilize MM-GBSA calculation Part 1: Simulation.

There are multiple flavors for MM-(GB)SA, one should be abble to access at least some of the parameters within ScoreFlow during simulation.

  1. Choice of simulation engine:
    AMBER serial: sander or pmemd ($$$)
    AMBER parallel: sander.MPI or pmemd.MPI ($$$)
    AMBER GPU: pmemd.cuda ($$$)
    GROMACS: gmx mdrun (free)
    GROMACS: gmx_mpi mdrun (free)

  2. Choise of solvation model
    implicit: igb=2,5,8 for GBSA, igb=1,3 for PBSA.
    explicit solvent: Water model and box-size, counter ions must be added.

  3. Simulation options.
    3.1) Minimization, number of steps (nsteps).
    3.2) Minimization (nsteps) + MD (nsteps)

MD must include a heating and equilibration protocol, followed by production

[DockFlow] Flexibilize PostDock

During --postdock a user may select, the final result will be a filtered.

Possible add an --output_mol2 and --output_csv file .mol2 and .csv after filters.

  1. Number of poses / ligand. (it is always sorted ! so get the 1st one)
  2. A total Number of LIGANDS/POSES to output, ranked BY energy.
  3. ENERGY cut off. (default to nothing)

ChemFlow_tools used to do it using python ❤️

Write Description for AffinityFlow

AffinityFlow is the one to perform MM(PB,GB)SA and probably LIE on MD trajectories.
It is an specialization of ScoreFlow, but we "market" it as a separate tool.

LigFlow/ScoreFlow search for ChemBase parametrized molecules

LigFlow/ScoreFlow should first search for a compound in ChemBase for parameters instead of recomputing all over again.

This is especially important for am1-bcc and even more for RESP which take a lot of time.

Another highlight here if to prevent recomputing the same parameters for the SAME molecule, in case a user wants to MMGBSA rescore multiple docking conformations.

Review ScoreFlow

Go through the code to simplify as much as possible, without adding or removing functionality.
Probe for bugs, and missing features, plan enhancements.

Licence

As mentioned by Simone some time ago, in due time we will need to check with the University which licence should be used.

Flexibilize MMGBSA Part 2: Post-processing.

  1. Choice of implicit model:
    igb=2,5,8 for GBSA
    igb=1,3 for PBSA.

  2. Choice of interval:
    Default 10.

Input file for running GB2
&general
verbose=0,keep_files=0,interval=${INTERVAL}
/
&gb
igb=${IGB}, saltcon=0.150
/

Write DockFlow tutorial & files - for a simple docking

A simple docking and validation of protocol using DUD-E as decoys must be provided in the tutorial

1 - Brief DockFlow description
2 - What's the tutorial about: simple docking and validation of protocol using Scoring, ROC. (short description)

Notebook should be able to report False Positives and False negatives, Top 3,5,10 lists and ROC.

Review DockFlow

Go through the code to simplify as much as possible, without adding or removing functionality.
Probe for bugs, and missing features, plan enhancements.

Write DockFlow tutorial & files - for a virtual screening

A virtual screening with multiple known ligands could be included (maybe also including DUD-E as decoys) must be provided in the tutorial.

1 - Brief DockFlow VS mode description
2 - What's the tutorial about, because we pretend we have no idea about the affinity of most molecules in the library.
3 - That would be useful if we include allow user to include a list of "known binders", basically will be a repetition of DUD-E runs. Than we conclude with a ROC and Ranking curve.

Review .config

The .config file is way too complicated.
I think we can put the details in the manual and create a tool to build this .config file.

The truth is that most users won't read the manual, expecting a "comand -h" or a "configure" to give/prepare them all they need.

Minor improvements for Tools

  • ConfigFlow : command line version
  • RMSDFlow : sort poses by RMSD before plotting
  • splitmol : make a proper distribution of compounds per MOL2 file
  • ChemFlow : list the workflow's / tools available in ChemFlow, with a short description and maybe show a pdf version of the doc
  • ReportFlow : list the different notebooks available, and based on the user's choice, a notebook is then copied to the run folder and automatically opened.

Extract MOL2 in python

Eventually we need to extract a list of molecules from a BIG mol2 file. Currently we do it with "ChemFlow_extract_mol2.f90" but since it's a Fortran (arrg) code, it needs do be compiled.

Modernize this into a beautiful python script.

SmilesToMol2

If SmilesToMol2 fails with a random "smiles" may not generate any output.
Ideally it should continue and report the errors.

ChemFlow executable

TODO :
A simple tool to :

  • list the workflow's / tools available in ChemFlow, with a short description
  • and maybe show a pdf version of the doc

Merge HGFlow into ChemFlow

HGFlow came from two separate projects. Paulina and I had different do-it-all scripts but we rewrote it together to a single and more powerfull one.

Naturally HGFlow example files will come from SAMPL4 5 and 6 challenges and we should keep updating it.

In the NEAR future HGFlow outputs will be part of the HostGuest database (HGBase), our website with ALL well standadized free energy runs and results.

dockflow vina

Ligand preparation (PDBQT) do not work with other version of MGLTools than 1.5.6.
tested 1.5.4 and 1.57
Also MGLtools is not listed in the required softwares

ChemFlow header

Dear ChemFlow developers,
we should not forget to put a header (when running the code) indicating our affiliation (University of Strasbourg), link to the IFM website, relevant publications and the listing of contributors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.