Code Monkey home page Code Monkey logo

deepscreen's Introduction

DEEPScreen: Virtual Screening with Deep Convolutional Neural Networks Using Compound Images

  • Important notice: This is the new version of DEEPScreen developed using PyTorch framework, please go the master branch of this repository to reach the old version and other information presented in the paper. We advice to use this new version of DEEPScreen to train target-specific models. Please note that this version is planned to be improved further by adding more functionalities.
  • DEEPScreen is a large-scale DTI prediction system, for early stage drug discovery, using deep convolutional neural networks
  • One of the main advantages of DEEPScreen is employing readily available 2-D structural representations of compounds at the input level instead of conventional descriptors that display limited performance
  • DEEPScreen learns complex features inherently from the 2-D representations, thus producing highly accurate predictions.
  • More information can be obtained from DEEPScreen journal article.

alt text

Installation

DEEPScreen is a command-line prediction tool written in Python 3.7.1. DEEPScreen was developed and tested in MacOSx but it should run in any Unix-like operating system. Please run the below commands to install requirements for model training and testing. Dependencies are available in requirements.txt file which is located under bin directory.

conda create -n deepscreen_env python=3.7
source activate deepscreen_env
pip install -r requirements.txt

Descriptions of folders and files in the DEEPScreen repository

  • bin folder includes the source code of DEEPScreen.

  • training_files folder includes the files directly used in the training and testing of the system:

    • chembl27_preprocessed_filtered_bioactivity_dataset.tsv.zip updated version of ChEMBL preprocessed and filtered dataset contains drug/compound-target interactions from the ChEMBL database (v27) after the application of multiple filtering operations to obtain a clean training set,

    • chembl27_training_target_list.txt list of target chembl ids,

    • target_training_datasets contains a folder (e.g. CHEMBL286) for each target where each target folder contains

      • a json file named train_val_test_dict.json which includes train/validation/test compound ids,
      • a folder named imgs which holds images of compounds.
    • chembl27_preprocessed_filtered_act_inact_comps_10.0_20.0_blast_comp_0.2.txt contains the active and inactive compound information for each target protein in ChEMBL, after the similarity-based negative training dataset enrichment process. In this file, there are two lines for each target, in the following format:

      CHEMBL286_act	CHEMBL1818056,CHEMBL2115367,CHEMBL344651,CHEMBL62054, ...
      CHEMBL286_inact	CHEMBL288434,CHEMBL584926,CHEMBL406111,CHEMBL151055, ...
      

      The list of active/inactive compounds separated by commas (i.e., the second tab seperated column: CHEMBL1818056,C...) for the correnponding target (i.e., the first column: CHEMBL286_act),

    • chembl27_uniprot_mapping.txt contains the id mapping between UniProt accessions and ChEMBL ids for proteins, in tab-separated format (Target UniProt accession, Target ChEMBL id, Target protein name and Target type),

  • result_files folder contains results of various tests/analyses:

  • 2-D images of:

    • 409,311 ChEMBL compounds in the train/validation/test datasets of 812 target proteins of DEEPScreen can be downloaded from here
    • all compounds (~2M) in ChEMBL v27 can be downloaded from here
    • all drugs (~11K) in DrugBank v5.1.7 can be downloaded from here

How to train DEEPScreen models and get performance results

  • Clone the Git Repository

  • Download the compressed file for the target that you want to train here

  • Locate the zipped target file under training_files/target_training_datasets and unzip it

  • Run the main_training.py script as shown below

Explanation of Parameters

  • --targetid: Target to be trained (default: CHEMBL286)

  • --model: CNN architecture to be used (default: CNNModel1)

  • --fc1: number of neurons in the first fully-connected layer (default:512)

  • --fc2: number of neurons in the second fully-connected layer (default:256)

  • --lr:learning rate (default: 0.001)

  • --bs: batch size (default: 32)

  • --dropout: dropout rate (default: 0.1)

  • --epoch: number of epochs (default: 200)

  • --en: the name of the experiment (default: my_experiment)

To perform training for a target (CHEMBL286 in the below example):

python main_training.py --targetid CHEMBL286 --model CNNModel1 --fc1 256 --fc2 128 --lr 0.01 --bs 64 --dropout 0.25 --epoch 100 --en my_chembl286_training

Output of the scripts

main_training.py creates a folder named <experiment_name> (given as argument --en) under result_files/experiments folder. Two files are created under results_files/experiments/<experiment_name>:

  • best_val_test_predictions-<hyperparameters_seperated by dash>-<experiment_name>.txt contains predictions for independent test dataset.
  • best_val_test_performance_results-<hyperparameters_seperated by dash>-<experiment_name>.txt which contains the best test performance results. Sample output files for ChEMBL286 target is given under results_files/experiments/my_chembl286_training.

Article

If you use DEEPScreen please consider citing:

Rifaioglu, A. S., Nalbat, E., Atalay, V., Martin, M. J., Cetin-Atalay, R., & Doğan, T. (2020). DEEPScreen: high performance drug–target interaction prediction with convolutional neural networks using 2-D structural compound representations. Chemical Science, 11(9), 2531-2557.

License

DEEPScreen Copyright (C) 2020 CanSyL

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

deepscreen's People

Contributors

ahmetrifaioglu avatar dependabot[bot] avatar tuncadogan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepscreen's Issues

test_threshold problem and zip file problem

hello dear tuncadogan,

deepscreen_models_hyperparameters_performance_results.tsv does not have a column called 'test threshold' which will be needed in the program when predicting DTIs, could you please tell me what is the exact meaning of it, how can I give a valid value for it.

some zip files are damaged, I can not open it, how can I use it (these files are useful when training deepscreen system. )

thank you very much

Dataset in the Code

Hello there,

Thanks for sharing such a nice idea and the code. It is motivating!

Well, I am just beginning to reconstruct your code and have encountered an issue. Please correct me if I am wrong. According to README the file named 'chembl27_preprocessed_filtered_act_inact_comps_10.0_20.0_blast_comp_0.2.txt ' should be the training data set that you obtained through filtering ChEMBL v23 data(about 15M dataset), right?

So, I expected the number of data included in the file be 769,935, matching the one in the paper, but I found 2,292,989 target-ligand pairs in the file, which is nearly three times larger. Is it that you updated the file augmenting the data? or that I have to do some data processing in order to get 769,935 pairs? I am a little confused.

I'd appreciate if you could help me with this.

Thanks

Error for running

Hello, I'm interesting your work so I try to use a given training model.
However, I got error message by last epoch.

Epoch :99
Training mode: True
Epoch 99 training loss: 3.122581034898758
There was a problem during training performance calculation!
Validation mode: True
There was a problem during validation performance calculation!
There was a problem during test performance calculation!
Traceback (most recent call last):
File "./bin/main_training.py", line 69, in
args.dropout, args.epoch, args.en)
File "/home/njgoo/Data1/program/DEEPScreen/bin/train_deepscreen.py", line 184, in train_validation_test_training
best_val_test_result_fl.write("Test {}:\t{}\n".format(scr, best_test_performance_dict[scr]))
UnboundLocalError: local variable 'best_test_performance_dict' referenced before assignment

Thanks for your reply!

How to define “bioactivity values” ?

“we constructed positive (active) and negative (inactive) training datasets as follows: for each target, compounds with bioactivity values ≤10 μm were selected as positive training samples...”
Could you please explain how to define "bioactivity values"?
Looing forward to your reply!

ValueError: cannot reshape array of size 13797420 into shape (200,200,1)

I am attempting to reproduce the results in your paper and then train models on my own dataset, but several models failed to train, saying "ValueError: cannot reshape array"

Any idea on how to fix this??

Traceback (most recent call last): 
  File "trainDEEPScreenDUDE.py", line 226, in <module>
    trainModelTarget(model_name, trgt, optim, learning_rate, n_epoch, n_of_h1, n_of_h2, dropout_keep_rate, rotate,save_model)
  File "trainDEEPScreenDUDE.py", line 51, in trainModelTarget
X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 1)
ValueError: cannot reshape array of size 13797420 into shape (200,200,1)

Possible overfitting on the test set

Hello,

I was going over the code and noticed something strange in train_deepscreen.py. More specifically, I believe there is a problem in line 172
InkedScreenshot 2021-09-10 082116_LI

The code basically checks for every training epoch the performance on the validation and test sets and keeps the epoch with the highest Matthews correlation coefficient. The final performance printed by the model is the best possible test set performance, which suggests that the model overfits the test set.

I am wondering about the rationale behind the choice, so I would appreciate it if you could share more info.

Best,
Dimitrios

How do I use a pre-trained model to generate predictions?

I'm new to machine learning and am trying to use DEEPScreen to generate predictions for some new molecules. I want to use a pre-trained model and don't want to train it each time. How would you recommend I do it? I'm also unsure about how to read the input images. They're in a directory.

thank you!

Receiving "There was a problem during..." Error

Hi, I was trying out the steps listed in the README.md. But I realised the main_training.py is in the BIN folder. And if I enter the folder and run it, I get both the "There was a problem during..." error messages. Must I move the scripts out of the BIN folder for the command to work?

DEEPScreen Supporting Data for Output/results

DEEPScreen gives out results active or inactive.
Is there a data of binding affinity included in it. Also the accuracy of result will be lesser if 2D Image is taken rather than 3D conformation image or SMILES?
Is there a way that we run virtual docking prediction as well which gives out data of Binding Affinity Energy, Binding Site and Size of Predicted Binding Site.

training model _ another species

Hi,
if i have interactions (Drugs-targets) with other organisms (not human), is it possible to run the training model?
or it is specific for human?

thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.