ifmlab / chemflow Goto Github PK

Computational Chemistry Workflows

Shell 39.72% Python 58.80% QMake 0.34% C++ 0.17% Fortran 0.54% CSS 0.44%

computational-chemistry computational-science docking scoring-functions workflow mmpbsa mmgbsa middleware

chemflow's Introduction

ChemFlow

ChemFlow is a series of computational chemistry workflows designed to automatize and simplify the drug discovery pipeline and scoring function benchmarking.

The workflows allow the user to spend more time thinking, i.e. running benchmarks or experiments, analyzing the data, and taking decisions, rather than programming/testing/debugging their own scripts.

It consists of BASH and PYTHON scripts that can be launched locally (serial or with GNU parallel) or on a compute cluster via PBS.

DockFlow : Docking and Virtual Screening
LigFlow : Computing AM1-BCC and/or RESP charges for the docked compounds to rescore
ScoreFlow : Rescoring using PLANTS, Vina, or MM(PB,GB)SA

Requirements for ChemFlow

We do not provide any of the licensed softwares used by ChemFlow. It is up to the user to acquire and install PLANTS, Vina,QVina, Smina, Amber and the other softwares that might be added in future releases of ChemFlow.

PLANTS and SPORES are both available under a free academic license.

How to get started

Go to the docs/ folder and follow the installation instructions written in the file "installation.rst"

chemflow's People

Contributors

Stargazers

Watchers

Forkers

masterwhook unixjunkie biocheming minghao2016

chemflow's Issues

Review initial MOL2 creation (DockFlow, ScoreFlow, LigFlow)

Currently Tripos mol2 it the default input format for both receptor and ligand in DockFlow and ScoreFlow, and for compounds in LigFlow.

RDKit does not write a .mol2 file, sticking to .sdf which is the industry-standard. That's not an issue for most modern software but as far as we want to support VINA and SEED we need properly convert the initial SMILES into MOL2.

Currently we do this conversion using openbabel (arrrg) and looks like it doesn't not follow SYBYL all the time, producing incompatible atoms types such as "S.O2" instead of "S.o2", tiny inconsistencies which may break some parts of the codes.

We found this after frequently finding out PLANTS "Du" (dummy) atoms both for protein and ligand, that could severly compromise the Docking outcome.

Finally if these DUMMY atoms progress into MM/GBSA rescoring or QM parametrization, they'll fail to produce correct parameters.

The latest ChemBase (from CN+DrugBank, July 2018) was based on SDF->MOL2 using OPENBABEL, so once again we may probably need to trash it. We put a quickfix inside Dock/ScoreFlow using antechamber to convert the files.

Finally, for ChemBase we can keep using .sdf and may migrate to amber "database" command.

Enable explicit Mini/MD in ScoreFlow

Writing on the same file in parallel

This didn't cause any problem with GNU parallel and 8 cores for now, but on mazinger the final results in the CSV tables (ranking.csv) are completely messed up.
We need to find a proper and secure way to write the final CSV tables for parallel computing, or do it in serial...

Images for ChemFlow

PNG

JPG

PNG 200x176

PNG 130x114

Eliminate OPENBABEL

Babel is great but a big pain as it may not follow the appropriate formating of the files.

We must remove as many intermediate steps as possible.

Reimplement "ChemFlow/bin/bonding_shape.py" for VINA

Bonding shape outputs a radius or XYZ leghts to cover the full input.mol2 molecule adding a radius to it's dimentions.

Ideally we need this XYZ lenghts to run our dockings with VINA properly.

Add example folder for DockFlow

Provide an example input, containing at least one complex with known solution for "redock"

HIV-1 is not the best but can be the case.

DockFlow has some possible scenarios that should be addressed by the notebooks.

Docking Protocol validation - Simple docking: with known results for one complex.
Docking Protocol validation - Cross docking: with known results for more than one complex.
Protocol validation - Simple docking: with known results for one complex.
** Docking Notebooks must be able to compare runs with different parameters.
Virtual Screening Protocol validation - with known results for some complexes, actives, inactives, DUD-E

Default evaluation metrics (please update)

Docking
** RMSD to reference structure (if available)
** RMSDFlow options to allow flexibility
Virtual Screening
** Area under the receiver operating characteristic curve (ROC)
** Area under the accumulation curve (AUAC)
** Average rank of actives (Fraction of actives among Top3,5,10... )
** Enrichment factor (EF)
** Robust initial enhancement (RIE) (Future)

Example of the expected output

ScoreFlow MMGBSA in parallel on local computer.

MMGBSA needs "ante-MMPBSA.py" to generate the input files,
It takes some time because it's a python script, and could be easly done in parallel.

In addition, post-processing MD simulations of a SINGLE system, one may benefit from MMGBSA.py.MPI

The solution could be the same as always: echo all commands, including "cd ${RUNDIR}/${LIGAND}" into a "mmpbsa.xargs" file then running them all according to the number of available CPUs.

ReportFlow

Leave this open until we finish a decent description of features.

Also a how-to would be great.

MERGE devel to master by the end of July.

We will close modifications to DockFlow & ScoreFlow by end of July.

We must preserve Tools, MDFlow and HGFlow originals and merge the rest.

Write ScoreFlow tutorial & files - for processing docking results (or mol2/pdb list)

ScoreFlow should work in method

Notebook should be able to compare Scoring functions, and report False Positives and False negatives, Top 3,5,10 lists and ROC.

Allow user to change parameters in MMGBSA.

Enable GBSA models igb=1,2,5,8
Implement PBSA (igb=RTFM)

Most GB/PB model parameters are auto-assigned during runtime (AMBER) for Minimization and MD simulations after "igb" is defined.

For MM(PB,GB)SA post-processing it is mandatory to modifty the TOPOLOGY using "ante-MMGBSA.py" .
Make sure to use the proper setup for each model.

DockFlow : Implement more docking softwares

Right now, DockFlow is not optimized to easily implement a new docking software.

Add more freedom to the user for PLANTS
Refactor code similarly to ScoreFlow
Implement Vina

Fix "rewrite ligands" in DockFlow.

DockFlow asks to "rewrite original ligands", if we say NO and the molecules aren't @LigFlow folder, if behaves weirdly.

Everytime we run, it should verify if each ligand is there already, or else just write it there.
If it finds the ligand there, ask to rewrite a specific ligand or --rewrite-all.

--postdoc selections.

Implement selections for DockFlow output using --postdock.

The actions currently in "ChemFlow_tools.bash" and allow the user to select and output:

ALL compounds, ranked by energy + ALL docking poses ( --postdock default )
SOME compounds, ranked by energy + SOME docking poses

Re-implement ChemFlow config file, start with ScoreFlow.

I suggest to start by AMBER parameters.

ChemFlow --write-config to write a configuration file.
Something like ChemFlow --config project_protocol_target_chemflow.config should be enough to run.
Include ALL mandatory parameters such as --recetor, --ligand. and short comments about their need.
After running a WorkFlow, make sure a new configuration file is written, with all used parameters, including defaults.

Make all combinations in the GUI are covered !

Optimize file I/O.

File Input/Output is not an issue with small libraries with hundreds of files and local SSD disks.
But when we move to a cluster/supercomputer like mesocentre that becomes an important bottleneck.

Not only reading and writing but our CHECKPOINTS are very I/O demanding. Notice we frequently check if some files exists to contine.

How are ligands protonated? [question]

What tool and parameters are you using?

Migrate HGFlow tutorial here (and update it)

We're writing HGFlow tutorials for AMBER and GROMACs so far.

Joel is working on a CP2K version.

Input dockflow

Dockflow should not read automatically .config file.
Dockflow should Error if you pass both config file and arguments, or combine them if possible.
Dockflow should fatal error if you don't provide arguments neither a config file.

Antechamber cannot be run in parallel in the same folder.

I just realized that antechamber can not be executed at the concurrently at the same folder.

For each ligand it generates a number of temporary files with the same name

Quick solution could be running from a temporary folder, example here:
dbarreto@hpc-login1:~/ChemFlow_paper/Benchmark_meso$ vim LigFlow_names.bash

cd $(mktemp -d ) ; antechamber -i ${RUNDIR}/${LIGAND}.mol2 ...

Update ScoreFlow tutorial with theoretical background

What's a scoring function, how different can they be for docking, virtual screening and rescoring.
What else can use (that actually works) to rank docking poses ?

Flexibilize MM-GBSA calculation Part 1: Simulation.

There are multiple flavors for MM-(GB)SA, one should be abble to access at least some of the parameters within ScoreFlow during simulation.

Choice of simulation engine:
AMBER serial: sander or pmemd ($$$)
AMBER parallel: sander.MPI or pmemd.MPI ($$$)
AMBER GPU: pmemd.cuda ($$$)
GROMACS: gmx mdrun (free)
GROMACS: gmx_mpi mdrun (free)
Choise of solvation model
implicit: igb=2,5,8 for GBSA, igb=1,3 for PBSA.
explicit solvent: Water model and box-size, counter ions must be added.
Simulation options.
3.1) Minimization, number of steps (nsteps).
3.2) Minimization (nsteps) + MD (nsteps)

MD must include a heating and equilibration protocol, followed by production

[DockFlow] Flexibilize PostDock

During --postdock a user may select, the final result will be a filtered.

Possible add an --output_mol2 and --output_csv file .mol2 and .csv after filters.

Number of poses / ligand. (it is always sorted ! so get the 1st one)
A total Number of LIGANDS/POSES to output, ranked BY energy.
ENERGY cut off. (default to nothing)

ChemFlow_tools used to do it using python ❤️

Write Description for AffinityFlow

AffinityFlow is the one to perform MM(PB,GB)SA and probably LIE on MD trajectories.
It is an specialization of ScoreFlow, but we "market" it as a separate tool.

LigFlow/ScoreFlow search for ChemBase parametrized molecules

LigFlow/ScoreFlow should first search for a compound in ChemBase for parameters instead of recomputing all over again.

This is especially important for am1-bcc and even more for RESP which take a lot of time.

Another highlight here if to prevent recomputing the same parameters for the SAME molecule, in case a user wants to MMGBSA rescore multiple docking conformations.

Review ScoreFlow

Go through the code to simplify as much as possible, without adding or removing functionality.
Probe for bugs, and missing features, plan enhancements.

Licence

As mentioned by Simone some time ago, in due time we will need to check with the University which licence should be used.

Flexibilize MMGBSA Part 2: Post-processing.

Choice of implicit model:
igb=2,5,8 for GBSA
igb=1,3 for PBSA.
Choice of interval:
Default 10.

Input file for running GB2
&general
verbose=0,keep_files=0,interval=${INTERVAL}
/
&gb
igb=${IGB}, saltcon=0.150
/

ScoreFlow - Incude SEED as scoring function

Write DockFlow tutorial & files - for a simple docking

A simple docking and validation of protocol using DUD-E as decoys must be provided in the tutorial

1 - Brief DockFlow description
2 - What's the tutorial about: simple docking and validation of protocol using Scoring, ROC. (short description)

Notebook should be able to report False Positives and False negatives, Top 3,5,10 lists and ROC.

Review DockFlow

Go through the code to simplify as much as possible, without adding or removing functionality.
Probe for bugs, and missing features, plan enhancements.

Update DockFlow tutorial with theoretical background.

What's docking, False Positive and False Negatives, ROC, ligand efficiency, enrichment, this kind of things.

Write DockFlow tutorial & files - for a virtual screening

A virtual screening with multiple known ligands could be included (maybe also including DUD-E as decoys) must be provided in the tutorial.

1 - Brief DockFlow VS mode description
2 - What's the tutorial about, because we pretend we have no idea about the affinity of most molecules in the library.
3 - That would be useful if we include allow user to include a list of "known binders", basically will be a repetition of DUD-E runs. Than we conclude with a ROC and Ranking curve.

Review .config

The .config file is way too complicated.
I think we can put the details in the manual and create a tool to build this .config file.

The truth is that most users won't read the manual, expecting a "comand -h" or a "configure" to give/prepare them all they need.

Minor improvements for Tools

ConfigFlow : command line version
RMSDFlow : sort poses by RMSD before plotting
splitmol : make a proper distribution of compounds per MOL2 file
ChemFlow : list the workflow's / tools available in ChemFlow, with a short description and maybe show a pdf version of the doc
ReportFlow : list the different notebooks available, and based on the user's choice, a notebook is then copied to the run folder and automatically opened.

Extract MOL2 in python

Eventually we need to extract a list of molecules from a BIG mol2 file. Currently we do it with "ChemFlow_extract_mol2.f90" but since it's a Fortran (arrg) code, it needs do be compiled.

Modernize this into a beautiful python script.

Write description for MDFlow

I must incorporate it to ChemFlow but 1st must addequate it to the .config format

SmilesToMol2

If SmilesToMol2 fails with a random "smiles" may not generate any output.
Ideally it should continue and report the errors.

ChemFlow executable

TODO :
A simple tool to :

list the workflow's / tools available in ChemFlow, with a short description
and maybe show a pdf version of the doc

Merge HGFlow into ChemFlow

HGFlow came from two separate projects. Paulina and I had different do-it-all scripts but we rewrote it together to a single and more powerfull one.

Naturally HGFlow example files will come from SAMPL4 5 and 6 challenges and we should keep updating it.

In the NEAR future HGFlow outputs will be part of the HostGuest database (HGBase), our website with ALL well standadized free energy runs and results.

It's a know issue.
rdkit/rdkit#1617

Add example folder for ScoreFlow

Create a simple folder with docking poses of known results to run ScoreFlow.