Code Monkey home page Code Monkey logo

molbert's Introduction

MolBERT

This repository contains the implementation of the MolBERT, a state-of-the-art representation learning method based on the modern language model BERT.

The details are described in "Molecular representation learning with language models and domain-relevant auxiliary tasks", presented at the Machine Learning for Molecules Workshop @ NeurIPS 2020.

Work done by Benedek Fabian, Thomas Edlich, Héléna Gaspar, Marwin Segler, Joshua Meyers, Marco Fiscato, Mohamed Ahmed

Installation

Create your conda environment first:

conda create -y -q -n molbert -c rdkit rdkit=2019.03.1.0 python=3.7.3

Then install the package by running the following commands from the cloned directory:

conda activate molbert
pip install -e . 

Run tests

To verify your installation, execute the tests:

python -m pytest . -p no:warnings

Load pretrained model

You can download the pretrained model here

After downloading the weights, you can follow scripts/featurize.py to load the model and use it as a featurizer (you just need to replace the path in the script).

Train model from scratch:

You can use the guacamol dataset (links at the bottom)

python molbert/apps/smiles.py \
    --train_file data/guacamol_baselines/guacamol_v1_train.smiles \
    --valid_file data/guacamol_baselines/guacamol_v1_valid.smiles \
    --max_seq_length 128 \
    --batch_size 16 \
    --masked_lm 1 \
    --num_physchem_properties 200 \
    --is_same_smiles 0 \
    --permute 1 \
    --max_epochs 20 \
    --num_workers 8 \
    --val_check_interval 1

Add the --tiny flag to train a smaller model on a CPU, or the --fast_dev_run flag for testing purposes. For full list of options see molbert/apps/args.py and molbert/apps/smiles.py.

Finetune

After you have trained a model, and you would like to finetune on a certain training set, you can use the FinetuneSmilesMolbertApp class to further specialize your model to your task.

For classification you can set can set the mode to classification and the output_size to 2.

python molbert/apps/finetune.py \
    --train_file path/to/train.csv \
    --valid_file path/to/valid.csv \
    --test_file path/to/test.csv \
    --mode classification \
    --output_size 2 \
    --pretrained_model_path path/to/lightning_logs/version_0/checkpoints/last.ckpt \
    --label_column my_label_column

For regression set the mode to regression and the output_size to 1.

python molbert/apps/finetune.py \
    --train_file path/to/train.csv \
    --valid_file path/to/valid.csv \
    --test_file path/to/test.csv \
    --mode regression \
    --output_size 1 \
    --pretrained_model_path path/to/lightning_logs/version_0/checkpoints/last.ckpt \
    --label_column pIC50

To reproduce the finetuning experiments we direct you to use scripts/run_qsar_test_molbert.py and scripts/run_finetuning.py. Both scripts rely on the Chembench and optionally the CDDD repositories. Please follow the installation instructions described in their READMEs.

Data

Guacamol datasets

You can download pre-built datasets here:

md5 05ad85d871958a05c02ab51a4fde8530 training
md5 e53db4bff7dc4784123ae6df72e3b1f0 validation
md5 677b757ccec4809febd83850b43e1616 test
md5 7d45bc95c33c10cb96ef5e78c38ac0b6 all

molbert's People

Contributors

bai-eng avatar bfabiandev avatar ebrevdo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.