Code Monkey home page Code Monkey logo

machine-learning-meets-pka's Introduction

Machine learning meets pKa

Prerequisites

The Python dependencies are:

  • Python >= 3.7
  • NumPy >= 1.18
  • Scikit-Learn >= 0.22
  • RDKit >= 2019.09.3
  • Pandas >= 0.25
  • XGBoost >= 0.90
  • JupyterLab >= 1.2
  • Matplotlib >= 3.1
  • Seaborn >= 0.9

For the data preparation pipeline, ChemAxon Marvin[1] is required, to use the prediction model with the included Python script, ChemAxon Marvin[1] is not required. By default OpenEye QUACPAC/Tautomers[2] is used for tautomer and charge standardization. If you want to use RDKit[3] instead, you can use the --no-openeye flag for the run_pipeline.sh script as well as for the train_model.py and predict_sdf.py scripts.

Of course, you also need the code from this repository folder.

Installing

First of all you need a working Miniconda/Anaconda installation. You can get Miniconda at https://conda.io/en/latest/miniconda.html.

Now you can create an environment named "ml_pka" with all needed dependencies and activate it with:

conda env create -f environment.yml
conda activate ml_pka

You can also create a new environment by yourself and install all dependencies without the environment.yml file:

conda create -n ml_pka -c conda-forge python=3.10
conda activate ml_pka
conda install -c conda-forge scikit-learn rdkit xgboost jupyterlab matplotlib seaborn

Usage

Preparation pipeline

To use the data preparation pipeline you have to be in the repository folder and your conda environment have to be activated. Additionally, the Marvin commandline tool cxcalc and, if you don't use the --no-openeye flag, the QUACPAC commandline tool tautomers have to be contained in your PATH variable.

Also, the environment variables OE_LICENSE (containing the path to your OpenEye license file) if used and JAVA_HOME (referring to the Java installation folder, which is needed for cxcalc) have to be set.

After preparation, you can display a small usage information with bash run_pipeline.sh -h. Example call:

bash run_pipeline.sh --train datasets/chembl25.sdf --test datasets/AvLiLuMoVe.sdf

Prediction tool

First of all you have to be in the repository folder and your conda environment have to be activated. To use the prediction tool you have to retrain the machine learning model. Therefore, just call the training script, it will train the 5-fold cross-validated Random Forest machine learning model using 12 cpu cores. If you want to adjust the number of cores you can use the parameter --num-processes. If you want to use the dataset that was prepared without the usage of QUACPAC/Tautomers you can use the --no-openeye flag. Example call:

python train_model.py

If you used QUACPAC/Tautomers for dataset preparation it has to be available to use the prediction tool as it was mentioned in the chapter above. If not, you have to use the --no-openeye flag for the prediction tool as well.

Now you can call the python script with a SDF file and an output path:

python predict_sdf.py my_test_file.sdf my_output_file.sdf

NOTE: This model was build for monoprotic structures regarding a pH range of 2 to 12. If the model is used with multiprotic structures, the predicted values will probably not be correct.

Datasets

  1. AvLiLuMoVe.sdf - Manually combined literature pKa data[3]
  2. chembl25.sdf - Experimental pKa data extracted from ChEMBL25[4]
  3. datawarrior.sdf - pKa data shipped with DataWarrior[5]
  4. combined_training_datasets_unique.sdf - Preprocessed and combined data from datasets (2) and (3), used as training dataset and prepared with QUACPAC/Tautomers[2]
  5. combined_training_datasets_unique_no_oe.sdf - Preprocessed and combined data from datasets (2) and (3), prepared with RDKit MolVS instead of QUACPAC/Tautomers[2]
  6. AvLiLuMoVe_cleaned_mono_unique_notraindata.sdf - Preprocessed data from dataset (1), used as external testset
  7. novartis_cleaned_mono_unique_notraindata.sdf - Preprocessed data from an inhouse dataset provided by Novartis[6], used as external testset

Authors

Marcel Baltruschat - GitHub, E-Mail
Paul Czodrowski - GitHub, E-Mail

License

This project and its software are licensed under the MIT License - see the LICENSE.md file for details.

The datasets are licensed under the CC BY 4.0 license - see the datasets/LICENSE.txt file for details.

Citation

If you use this code or the datasets in your research, please cite the following publications:

  • Code: DOI
  • Datasets: DOI

References

[1] Marvin 20.1.0, 2020, ChemAxon, http://www.chemaxon.com
[2] QUACPAC 2.0.2.2: OpenEye Scientific Software, Santa Fe, NM. http://www.eyesopen.com
[3] Settimo, L., Bellman, K. & Knegtel, R.M.A. Pharm Res (2014) 31: 1082. https://doi.org/10.1007/s11095-013-1232-z
[4] Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017) 'The ChEMBL database in 2017.' Nucleic Acids Res., 45(D1) D945-D954.
[5] Thomas Sander, Joel Freyss, Modest von Korff, Christian Rufener. DataWarrior: An Open-Source Program For Chemistry Aware Data Visualization And Analysis. J Chem Inf Model 2015, 55, 460-473, doi 10.1021/ci500588j
[6] Richard A. Lewis, Stephane Rodde, Novartis Pharma AG, Basel, Switzerland

machine-learning-meets-pka's People

Contributors

czodrowskilab avatar hartingsdev avatar mrcblt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

machine-learning-meets-pka's Issues

ValueError: Input X contains NaN.

hi,

I tried to run the notebook, modeling.ipynb, and the previous cell units ran smoothly, but when it came to this section

est_cls = RandomForestRegressor
rf_params = dict(n_estimators=1000, n_jobs=est_jobs, verbose=verbose, random_state=seed)
name = 'RandomForest (n_estimators=1000)'

train_all_sets(est_cls, rf_params, name)

the error info:

ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Here the input is made of null values, but I see that none of the previous values have null values inside them. Why is there a null value here, or am I setting the parameter wrong here?

many thanks for your help

best,

Sh-Y

make data available under CCZero?

MIT is a software license. For data, a data license is better (data copyright and software copyright laws are often different). May I ask you to consider making a citable Zenodo or Figshare archive of the data (novartis_cleaned_mono_unique_notraindata.sdf etc) under a CCZero license (which is quite like the MIT license but then for data)?

No pre-trained model

Hi,

I have setup the conda environment, but when I run:

python predict_sdf.py pka_ligands.sdf pka_ligands_pred.sdf
Loading SDF...
2 molecules loaded
Loading model...
Traceback (most recent call last):
  File "predict_sdf.py", line 41, in <module>
    with open('RF_CV_FMorgan3_pKa.pkl', 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'RF_CV_FMorgan3_pKa.pkl'

Would you be able to share the pretrained model?

I understand if it has confidential data, this might not be possible, but I thought I would ask. :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.