Implementation of the SMILES-based scaffold decorator used in "SMILES-based deep generative scaffold decorator for de-novo drug design"

This repository holds all the code used to create, train and sample a SMILES-based scaffold decorator described in SMILES-based deep generative scaffold decorator for de-novo drug design. Additionally, it contains the code for pre-processing the training set, as explained in the manuscript.

The scripts and folders are the following:

Python files in the main folder are all scripts. Run them with -h for usage information.
./training_sets folder: The two molecular sets used in the manuscript, already filtered and the data augmented versions.
./trained_models folder: One already trained model for both the DRD2 and ChEMBL models.

Requirements

The repository includes a Conda environment.yml file with the required libraries to run all the scripts. In some scripts Spark 2.4.3 is required and by default should run in local mode without any issues. For more complex set-ups, please refer to the Spark documentation. All models were tested on Linux with both a Tesla V-100 and a Geforce 2070. It should work just fine with other Linux setups and a mid-high range GPU.

Install

A Conda environment.yml is supplied with all the required libraries.

$> git clone <repo url>
$> cd <repo folder>
$> conda env create -f environment.yml
$> conda activate reinvent-scaffold-decorator
(reinvent-scaffold-decorator) $> ...

From here the general usage applies.

General Usage

Several tools are supplied and are available as scripts in the main folder. Further information about the tool's arguments, please run it with -h. All output files are in tsv format (the separator is \t).

Preprocessing the training sets

Any arbitrary molecular set has to be pre-processed before being used as a training set for a decorator model. This process is done in two steps:

Slice (slice_db.py): This script accepts as input a SMILES file and it exhaustively slices given a set of slicing rules (Hussain-Rea, RECAP), a maximum number of attachment points and a set of conditions (see conditions.json.example for more information). Rules can be easily extended and new sets of conditions added. This script can output a SMILES file with (scaffold, dec1;dec2;...) which can be used in the next step and/or a parquet file with additional information.
Create randomized SMILES (create_randomized_smiles.py): From the SMILES output of the first file, several randomized SMILES representations of the training set must be generated. Depending whether a single-step or a multi-step decorator model is wanted, different files are generated.

Notice that both scripts use Spark.

Creating, training and sampling decorator models

This code enables training of both single-step and multi-step decorator models. The training process is exactly the same, only changing the training set pre-processing.

Create Model (create_model.py): Creates a blank model file.
Train Model (train_model.py): Trains the model with the specified parameters. Tensorboard data may be generated.
Sample Model (sample_from_model.py): Samples an already trained model for a given number of decorations given a list of scaffolds. It can also retrieve the log-likelihood in the process. Notice that this is not the preferred way of sampling the model (see below).
Calculate NLL (calculate_nlls.py): Requires as input a TSV file with scaffolds and decorations and outputs the same list with the NLL calculated for each one.

Exhaustively decorating scaffolds

A special script (sample_scaffolds.py) to exhaustively generate a large number of decorations is supplied. This scripts can be used with both single-step and multi-step models. Notice that this script requires both a GPU and Spark. It works the following way:

A SMILES file with the scaffolds with the attachment points numbered ([*] -> [*:N], where N is a number from 0 to the number of attachment points.
Generates at most r randomized SMILES of each scaffold (-r to change how many SMILES are generated at each round).
Samples n times each randomized SMILES generated in the previous step (-nto change the value).
Joins the scaffolds with the generated decorations and removes duplicates/invalids.
In the case of the single-step, nothing more is necessary, but in the multi-step model, a loop starting at step 1 is repeated until everything is fully decorated.
Everything is written down in a parquet file for further analysis.

CAUTION: Large n and rparameters should be used for the single-step decorator model (for instance r=2048 and n=4096). In the case of the multi-step model, very low values should be used instead (e.g. r=32 and n=64).

Usage examples

Create the DRD2 dataset as described in the manuscript.

(reinvent-scaffold-decorator) $> mkdir -p drd2_decorator/models
(reinvent-scaffold-decorator) $> ./slice_db.py -i training_sets/drd2.excapedb.smi.gz -u drd2_decorator/excape.drd2.hr.smi -s hr -f conditions.json.example
(reinvent-scaffold-decorator) $> ./create_randomized_smiles.py -i drd2_decorator/drd2.hr.smi -o drd2_decorator/training -n 50 -d multi

Train the DRD2 model using the training set created before.

(reinvent-scaffold-decorator) $> ./create_model.py -i drd2_decorator/training/001.smi -o drd2_decorator/models/model.empty -d 0.2
(reinvent-scaffold-decorator) $> ./train_model.py -i drd2_decorator/models/model.empty -o drd2_decorator/models/model.trained -s drd2_decorator/training -e 50 -b 64

Sample one scaffold exhaustively.

(reinvent-scaffold-decorator) $> echo "[*:0]C1CCCCC1[*:1]" > scaffold.smi
(reinvent-scaffold-decorator) $> ./sample_scaffolds.py -m drd2_decorator/models/model.trained.50 -i scaffold.smi -o generated_molecules.parquet -r 32 -n 32 -d multi

Notice: To change it to a single-step model, the -d single option must be used in all cases where -d multi appears.

Bugs, errors, improvements, suggestions, etc.

The software was thoroughly tested, although bugs may appear. Don't hesitate to contact us if you find any, or even better, send a pull request or open an issue. For other inquiries, please send an email to [email protected] and we will be happy to answer you 😄.

roysh / reinvent-scaffold-decorator Goto Github PK

reinvent-scaffold-decorator's Introduction

Implementation of the SMILES-based scaffold decorator used in "SMILES-based deep generative scaffold decorator for de-novo drug design"

Requirements

Install

General Usage

Preprocessing the training sets

Creating, training and sampling decorator models

Exhaustively decorating scaffolds

Usage examples

Bugs, errors, improvements, suggestions, etc.

reinvent-scaffold-decorator's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent