Code Monkey home page Code Monkey logo

gct-plus's Introduction

GCT-Plus

About the Project

In this project, we utilized the GCT model, originally a conditional variational autoencoder (CVAE) with a Transformer architecture, designed for generating molecules based on properties. We extended GCT's functionality to incorporate structure-based generation using the Bemis-Murcko scaffold, and we've named this enhanced version GCT-Plus.

We conducted extensive training of GCT-Plus on a dataset containing approximately 1.58 million neutral molecules, sourced from the MOSES benchmarking platform. We trained GCT-Plus for multiple tasks, including unconditioned generation, property-based generation, structure-based generation, and property-structure-based generation. Furthermore, we employed GCT-Plus for molecular interpolation, enabling the creation of new molecules with structures resembling those of two given starting molecules.

Getting Started

(1) Clone the repository:

git clone https://github.com/chaoting-sun/GCT-Plus.git

(2) Create an environment:

cd GCT-Plus
conda env create -n gct-plus -f ./env.yml # create a new environment named gct-plus
conda activate gct-plus

(3) Download the Models:

# 1. unconditioned GCT
gdown https://drive.google.com/uc?id=1k8HxI-h3Z9ZfJM4HZMFfZEw8Rh8bMElf -O ./Weights/vaetf/vaetf1.pt

# 2. property-based GCT
gdown https://drive.google.com/uc?id=1D5g3TF3-eFB34SXpylERSa-6L1u_SR5d -O ./Weights/pvaetf/pvaetf1.pt

# 3. structure-based GCT
gdown https://drive.google.com/uc?id=1emVfSViCVWugPda1utYaIBenbRucH_j1 -O ./Weights/scavaetf/scavaetf1.pt

# 4. property-structure-based GCT

# selected properties: logP, tPSA, QED
gdown https://drive.google.com/uc?id=10ojI90-Wrc0RTWUgOfAea6VjRk_GIPVH -O ./Weights/pscavaetf/pscavaetf1.pt

# selected properties: logP, tPSA, SAS
gdown https://drive.google.com/uc?id=1gA-woAsdYpUsDo_jQAO1n3Nf7WJS6g-D -O ./Weights/pscavaetf/pscavaetf1_molgpt.pt

# 5. property-based Transformer
gdown https://drive.google.com/uc?id=1ICK-p9p3WA4eOZfw0zPkPCP2LRks9hEg -O ./Weights/pscavaetf/pscavaetf1.pt

(4) Run Multiple Tasks

# unconditioned generation
Bashscript/infer/uc_sampling.sh

# property-based generation
Bashscript/infer/p_sampling.sh

# structure-based generation
Bashscript/infer/sca_sampling.sh

# property-structure-based generation
Bashscript/infer/psca_sampling.sh

# molecular interpolation
Bashscript/infer/mol_interpolation.sh

# visualize attention
Bashscript/infer/visualize_attention.sh

Implementation

(1) Preprocess the data

Bashscript/preprocess/preprocess.sh

(2) Re-train Models

# train a model for unconditioned generation
Bashscript/train/train_vaetf.sh

# train a model for property-based generation
Bashscript/train/train_pvaetf.sh

# train a model for structure-based generation
Bashscript/train/train_scavaetf.sh

# train a model for property-structure-based generation
Bashscript/train/train_pscavaetf.sh

(3) Model Selection

The model for unconditioned generation (vaetf) can be selected the best epochs.

Bashscript/infer/model_selection.sh

Explanation

(1) What is the difference between "train0.py" and "train1.py"?

One primary distinction between them lies in the batching method used during training. "train0.py" adopts the same approach as GCT, wherein SMILES with similar lengths are grouped together within each batch. Conversely, "train1.py" randomly assigns SMILES to batches. We observed that the second approach, used in GCT-Plus, contributes to a smoother latent space, which enhances molecular interpolation capabilities.

Furthermore, "train0.py" is limited to single-GPU training of the model, whereas "train1.py" supports parallel training across multiple GPUs.

(2) Why does "model_selection.py" only support GCT-Plus for unconditioned generation?

We employed the KL Divergence metric, as defined in GuacaMol, to identify the optimal epoch. In this process, we calculated a score (S) by comparing the similarity between the reference set (the test set in MOSES) and the set generated by GCT-Plus. A higher S score indicates more effective model learning.

We expected the model's KL divergence to exhibit a concave curve as it evolves with epochs, and we selected the epoch with the highest S score. Upon testing, we observed that only GCT-Plus for unconditioned generation selected a reasonable epoch (37-38). In contrast, the epochs chosen for other tasks with GCT-Plus were too small. For instance, the first epoch for GCT-Plus responsible for property-structure-based generation yielded the lowest S score.

Reference

gct-plus's People

Contributors

js0108 avatar chaoting-sun avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.