Code Monkey home page Code Monkey logo

reinvent-scaffold-decorator's Introduction

Implementation of the SMILES-based scaffold decorator used in "SMILES-based deep generative scaffold decorator for de-novo drug design"

This repository holds all the code used to create, train and sample a SMILES-based scaffold decorator described in SMILES-based deep generative scaffold decorator for de-novo drug design. Additionally, it contains the code for pre-processing the training set, as explained in the manuscript.

The scripts and folders are the following:

  1. Python files in the main folder are all scripts. Run them with -h for usage information.
  2. ./training_sets folder: The two molecular sets used in the manuscript, already filtered and the data augmented versions.
  3. ./trained_models folder: One already trained model for both the DRD2 and ChEMBL models.

Requirements

The repository includes a Conda environment.yml file with the required libraries to run all the scripts. In some scripts Spark 2.4.3 is required and by default should run in local mode without any issues. For more complex set-ups, please refer to the Spark documentation. All models were tested on Linux with both a Tesla V-100 and a Geforce 2070. It should work just fine with other Linux setups and a mid-high range GPU.

Install

A Conda environment.yml is supplied with all the required libraries.

$> git clone <repo url>
$> cd <repo folder>
$> conda env create -f environment.yml
$> conda activate reinvent-scaffold-decorator
(reinvent-scaffold-decorator) $> ...

From here the general usage applies.

General Usage

Several tools are supplied and are available as scripts in the main folder. Further information about the tool's arguments, please run it with -h. All output files are in tsv format (the separator is \t).

Preprocessing the training sets

Any arbitrary molecular set has to be pre-processed before being used as a training set for a decorator model. This process is done in two steps:

  1. Slice (slice_db.py): This script accepts as input a SMILES file and it exhaustively slices given a set of slicing rules (Hussain-Rea, RECAP), a maximum number of attachment points and a set of conditions (see conditions.json.example for more information). Rules can be easily extended and new sets of conditions added. This script can output a SMILES file with (scaffold, dec1;dec2;...) which can be used in the next step and/or a parquet file with additional information.
  2. Create randomized SMILES (create_randomized_smiles.py): From the SMILES output of the first file, several randomized SMILES representations of the training set must be generated. Depending whether a single-step or a multi-step decorator model is wanted, different files are generated.

Notice that both scripts use Spark.

Creating, training and sampling decorator models

This code enables training of both single-step and multi-step decorator models. The training process is exactly the same, only changing the training set pre-processing.

  1. Create Model (create_model.py): Creates a blank model file.
  2. Train Model (train_model.py): Trains the model with the specified parameters. Tensorboard data may be generated.
  3. Sample Model (sample_from_model.py): Samples an already trained model for a given number of decorations given a list of scaffolds. It can also retrieve the log-likelihood in the process. Notice that this is not the preferred way of sampling the model (see below).
  4. Calculate NLL (calculate_nlls.py): Requires as input a TSV file with scaffolds and decorations and outputs the same list with the NLL calculated for each one.

Exhaustively decorating scaffolds

A special script (sample_scaffolds.py) to exhaustively generate a large number of decorations is supplied. This scripts can be used with both single-step and multi-step models. Notice that this script requires both a GPU and Spark. It works the following way:

  1. A SMILES file with the scaffolds with the attachment points numbered ([*] -> [*:N], where N is a number from 0 to the number of attachment points.
  2. Generates at most r randomized SMILES of each scaffold (-r to change how many SMILES are generated at each round).
  3. Samples n times each randomized SMILES generated in the previous step (-nto change the value).
  4. Joins the scaffolds with the generated decorations and removes duplicates/invalids.
  5. In the case of the single-step, nothing more is necessary, but in the multi-step model, a loop starting at step 1 is repeated until everything is fully decorated.
  6. Everything is written down in a parquet file for further analysis.

CAUTION: Large n and rparameters should be used for the single-step decorator model (for instance r=2048 and n=4096). In the case of the multi-step model, very low values should be used instead (e.g. r=32 and n=64).

Usage examples

Create the DRD2 dataset as described in the manuscript.

(reinvent-scaffold-decorator) $> mkdir -p drd2_decorator/models
(reinvent-scaffold-decorator) $> ./slice_db.py -i training_sets/drd2.excapedb.smi.gz -u drd2_decorator/excape.drd2.hr.smi -s hr -f conditions.json.example
(reinvent-scaffold-decorator) $> ./create_randomized_smiles.py -i drd2_decorator/drd2.hr.smi -o drd2_decorator/training -n 50 -d multi

Train the DRD2 model using the training set created before.

(reinvent-scaffold-decorator) $> ./create_model.py -i drd2_decorator/training/001.smi -o drd2_decorator/models/model.empty -d 0.2
(reinvent-scaffold-decorator) $> ./train_model.py -i drd2_decorator/models/model.empty -o drd2_decorator/models/model.trained -s drd2_decorator/training -e 50 -b 64 

Sample one scaffold exhaustively.

(reinvent-scaffold-decorator) $> echo "[*:0]C1CCCCC1[*:1]" > scaffold.smi
(reinvent-scaffold-decorator) $> ./sample_scaffolds.py -m drd2_decorator/models/model.trained.50 -i scaffold.smi -o generated_molecules.parquet -r 32 -n 32 -d multi

Notice: To change it to a single-step model, the -d single option must be used in all cases where -d multi appears.

Bugs, errors, improvements, suggestions, etc.

The software was thoroughly tested, although bugs may appear. Don't hesitate to contact us if you find any, or even better, send a pull request or open an issue. For other inquiries, please send an email to [email protected] and we will be happy to answer you ๐Ÿ˜„.

reinvent-scaffold-decorator's People

Contributors

undeadpixel avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.