lsj2408 / transformer-m Goto Github PK

View Code? Open in Web Editor NEW

197.0 6.0 23.0 5.58 MB

[ICLR 2023] One Transformer Can Understand Both 2D & 3D Molecular Data (official implementation)

Home Page: https://arxiv.org/abs/2210.01765

License: MIT License

Python 97.18% Cython 0.59% Shell 0.45% C++ 0.59% Cuda 1.07% Lua 0.12%

transformer general-purpose-molecular-model graph-neural-network graph-transformer molecular-modeling molecule

transformer-m's Introduction

One Transformer Can Understand Both 2D & 3D Molecular Data

This repository is the official implementation of “One Transformer Can Understand Both 2D & 3D Molecular Data”, based on the official implementation of Graphormer and Fairseq in PyTorch.

One Transformer Can Understand Both 2D & 3D Molecular Data

Shengjie Luo, Tianlang Chen*, Yixian Xu*, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, Di He

🔥 News

2023.03.31: The fine-tuning code of QM9 has been released.
2022.11.22: Congratulations! Transformer-M has been used by all Top-3 winners in PCQM4Mv2 Track, 2nd OGB Large-Scale Challenge, NeurIPS 2022!
- 1st Place winner, Team WeLoveGraphs from GraphCore, code & report.
- co-2nd Place winner, Team VisNet from Microsoft, code & report.
- co-2nd Place winner, Team NVIDIA-PCQM4Mv2 from NVIDIA, code & report.
2022.10.05: Codes and model checkpoints are released!

Overview

Transformer-M is a versatile and effective molecular model that can take molecular data of 2D or 3D formats as input and generate meaningful semantic representations. Using the standard Transformer as the backbone architecture, Transformer-M develops two separated channels to encode 2D and 3D structural information and incorporate them with the atom features in the network modules. When the input data is in a particular format, the corresponding channel will be activated, and the other will be disabled. Empirical results show that our Transformer-M can achieve strong performance on 2D and 3D tasks simultaneously, which is the first step toward general-purpose molecular models in chemistry.

Results on PCQM4Mv2, OGB Large-Scale Challenge

🚀Note: PCQM4Mv2 is also the benchmark dataset of the graph-level track in the 2nd OGB-LSC at NeurIPS 2022 competition track. As non-participants, we open source all the codes and model weights, and sincerely welcome participants to use our model. Looking forward to your feedback!

Installation

Clone this repository

git clone https://github.com/lsj2408/Transformer-M.git

Install the dependencies (Using Anaconda, tested with CUDA version 11.0)

cd ./Transformer-M
conda env create -f requirement.yaml
conda activate Transformer-M
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install torch_geometric==1.6.3
pip install torch_scatter==2.0.7
pip install torch_sparse==0.6.9
pip install azureml-defaults
pip install rdkit-pypi cython
python setup.py build_ext --inplace
python setup_cython.py build_ext --inplace
pip install -e .
pip install --upgrade protobuf==3.20.1
pip install --upgrade tensorboard==2.9.1
pip install --upgrade tensorboardX==2.5.1

Checkpoints

Model	File Size	Update Date	Valid MAE on PCQM4Mv2	Download Link
L12	189MB	Oct 04, 2022	0.0785	https://1drv.ms/u/s!AgZyC7AzHtDBdWUZttg6N2TsOxw?e=sUOhox
L18	270MB	Oct 04, 2022	0.0772	https://1drv.ms/u/s!AgZyC7AzHtDBdrY59-_mP38jsCg?e=URoyUK
L12_old	189MB	Mar 31, 2023	0.0787	https://1drv.ms/u/s!AgZyC7AzHtDBesDk9tZK1yvbtzE?e=5H91Zq

# create paths to checkpoints for evaluation

# download the above model weights (L12.pt, L18.pt) to ./
mkdir -p logs/L12
mkdir -p logs/L18
mv L12.pt logs/L12/
mv L18.pt logs/L18/

Datasets

Preprocessed data: download link

# create paths to datasets for evaluation/training

# download the above compressed datasets (pcqm4mv2-pos.zip) to ./
unzip pcqm4mv2-pos.zip -d ./datasets

You can also directly execute the evaluation/training code to process data from scratch.

Evaluation

export data_path='./datasets/pcq-pos'                # path to data
export save_path='./logs/{folder_to_checkpoints}'    # path to checkpoints, e.g., ./logs/L12

export layers=12                                     # set layers=18 for 18-layer model
export hidden_size=768                               # dimension of hidden layers
export ffn_size=768                                  # dimension of feed-forward layers
export num_head=32                                   # number of attention heads
export num_3d_bias_kernel=128                        # number of Gaussian Basis kernels
export batch_size=256                                # batch size for a single gpu
export dataset_name="PCQM4M-LSC-V2-3D"				   
export add_3d="true"
bash evaluate.sh

Training

# L12. Valid MAE: 0.0785
export data_path='./datasets/pcq-pos'               # path to data
export save_path='./logs/'                          # path to logs

export lr=2e-4                                      # peak learning rate
export warmup_steps=150000                          # warmup steps
export total_steps=1500000                          # total steps
export layers=12                                    # set layers=18 for 18-layer model
export hidden_size=768                              # dimension of hidden layers
export ffn_size=768                                 # dimension of feed-forward layers
export num_head=32                                  # number of attention heads
export batch_size=256                               # batch size for a single gpu
export dropout=0.0
export act_dropout=0.1
export attn_dropout=0.1
export weight_decay=0.0
export droppath_prob=0.1                            # probability of stochastic depth
export noise_scale=0.2                              # noise scale
export mode_prob="0.2,0.2,0.6"                      # mode distribution for {2D+3D, 2D, 3D}
export dataset_name="PCQM4M-LSC-V2-3D"
export add_3d="true"
export num_3d_bias_kernel=128                       # number of Gaussian Basis kernels
bash train.sh

Our model is trained on 4 NVIDIA Tesla A100 GPUs (40GB). The time cost for an epoch is around 10 minutes.

Downstream Task -- (QM9)

Download the checkpoint: L12-old.pt

export ckpt_path='./L12-old.pt'                # path to checkpoints
bash finetune_qm9.sh

Citation

If you find this work useful, please kindly cite following papers:

@article{luo2022one,
  title={One Transformer Can Understand Both 2D \& 3D Molecular Data},
  author={Luo, Shengjie and Chen, Tianlang and Xu, Yixian and Zheng, Shuxin and Liu, Tie-Yan and Wang, Liwei and He, Di},
  journal={arXiv preprint arXiv:2210.01765},
  year={2022}
}

@inproceedings{
  ying2021do,
  title={Do Transformers Really Perform Badly for Graph Representation?},
  author={Chengxuan Ying and Tianle Cai and Shengjie Luo and Shuxin Zheng and Guolin Ke and Di He and Yanming Shen and Tie-Yan Liu},
  booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
  year={2021},
  url={https://openreview.net/forum?id=OeWooOxFwDa}
}

@article{shi2022benchmarking,
  title={Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets},
  author={Yu Shi and Shuxin Zheng and Guolin Ke and Yifei Shen and Jiacheng You and Jiyan He and Shengjie Luo and Chang Liu and Di He and Tie-Yan Liu},
  journal={arXiv preprint arXiv:2203.04810},
  year={2022},
  url={https://arxiv.org/abs/2203.04810}
}

Contact

Shengjie Luo ([email protected])

Sincerely appreciate your suggestions on our work!

License

This project is licensed under the terms of the MIT license. See LICENSE for additional details.

transformer-m's People

Contributors

Stargazers

Watchers

transformer-m's Issues

question about chechpoint

Thanks for your amazing work!
I wonder why download link of checkpoint is not to a '*.pt' file?

How to load checkpoint of transformer M

Great piece of work.
Could you please mention, how to load your checkpoint?

Difficulties setting up the environment to reproduce results

Hi,

Thank you for the code and the surrounding instructions!

I was trying to reproduce the results but having some difficulty making the environment work.

I installed cuda and other package versions as mentioned but torch_scatter was erroring out with "'NoneType' object has no attribute 'origin'". Looking up online, I uninstalled your recommended version and installed another one with pip install --no-index torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cu110.html (even though I have PyTorch 1.7.1). But now, torch sparse decided to error out with

div(float a, Tensor b) -> (Tensor):
  Expected a value of type 'Tensor' for argument 'b' but instead found type 'int'.

  div(int a, Tensor b) -> (Tensor):
  Expected a value of type 'Tensor' for argument 'b' but instead found type 'int'.

The original call is:
  File "/nethome/yjakhotiya3/miniconda3/envs/Transformer-M/lib/python3.7/site-packages/torch_sparse/storage.py", line 316
        idx = self.sparse_size(1) * self.row() + self.col()

        row = torch.div(idx, num_cols, rounding_mode='floor')
              ~~~~~~~~~ <--- HERE
        col = idx % num_cols
        assert row.dtype == torch.long and col.dtype == torch.long

I also tried on other machines but was getting unknown cuda errors by torch distributed (which could be due to an unrelated driver version mismatch issue).

Did you encounter any of these issues or do you have any advice on how to navigate them?

About Steps Settings

Thank you for your release of Transformer-M. About the settings of steps, warmup_ steps and total_ steps, do I need to divide the settings of steps by the number of GPUs used? Because I observed during training that the number of steps in each epoch of multi -gpu parallel will decrease, but the program still counts the steps according to the single card.

如何在下游任务微调？

请问，在下游任务比如qm9上的微调脚本方便提供吗？

Details about PDBBind~

Thank you for your code! It's well written~
I have a few questions on finetuning task on PDB-Bind. I sincerely look forward to your kindest reply!
1. Inputs. Which features of protein are used as input? and whether pocket data (sub sequence) or full sequence are used?
2. Model architecture. Whether protein data and ligand data are sent into seperate encoders or they are sent into the same encoder? If different encoders are used, what is the type of protein encoder? and how the extracted features of protein and ligand gathered together for later prediction?
Thank you again for your clarifications for them!
btw, may I ask when the finetuning code for PDB bind will be released? Thanks

Why using cosine to calculate loss instead of the squared error?

Thank you for your excellent work!

May I know why you use cosine similarity to calculate loss instead of the squared error as in the paper PRE-TRAINING VIA DENOISING FOR MOLECULAR PROPERTY PREDICTION?

OSError: /home/miniconda3/envs/Transformer-M/lib/python3.7/site-packages/torch_sparse/_convert_cuda.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

Hi:
Thanks for sharing the code of your cool work.
The code is OK when I run the train.py in terminal of ubuntu 22.04. However, when I debug the code by VScode, an error occurred: OSError: /home/xx/miniconda3/envs/Transformer-M/lib/python3.7/site-packages/torch_sparse/_convert_cuda.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv.

/home/xx/miniconda3/envs/Transformer-M/lib/python3.7/site-packages/torch_sparse/_convert_cuda.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
File "/home/xx/Transformer-M-main/Transformer-M/data/dataset.py", line 3, in
from ogb.lsc import PCQM4Mv2Evaluator
File "/home/xx/Transformer-M-main/Transformer-M/tasks/graph_prediction.py", line 37, in
from ..data.dataset import (
File "/home/xx/Transformer-M-main/fairseq/tasks/init.py", line 119, in import_tasks
importlib.import_module(namespace + "." + task_name)
File "/home/xx/Transformer-M-main/fairseq/utils.py", line 511, in import_user_module
import_tasks(tasks_path, f"{module_name}.tasks")
File "/home/xx/Transformer-M-main/fairseq/options.py", line 237, in get_parser
utils.import_user_module(usr_args)
File "/home/xx/Transformer-M-main/fairseq/options.py", line 38, in get_training_parser
parser = get_parser("Trainer", default_task)
File "/home/xx/Transformer-M-main/fairseq_cli/train.py", line 493, in cli_main
parser = options.get_training_parser()
File "/home/xx/Transformer-M-main/train.py", line 14, in
cli_main()
OSError: /home/xx/miniconda3/envs/Transformer-M/lib/python3.7/site-packages/torch_sparse/_convert_cuda.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

model and dataset are unable download

Hello, I found the model checkpoint and datasets are unable download.
For example, I try to open the href: https://1drv.ms/u/s!AgZyC7AzHtDBdrY59-_mP38jsCg?e=URoyUK
But it can't download the model and show can't take a security connection with server.

How to fine-tune in the PDBbind

Hi,

It's a very interesting model. Would you mind providing the code for preprocessing the data of PDBbind and fine-tuning?

Recommendation for generating 3D molecule for general small molecules

Thanks for sharing the code of your enlightening work. I have a set of compounds that only have SMILES available. Would you recommend RDKit, Openbabel, or other tools for generating 3D structures as input into your model? Thank you!

code associating with finetuneing

论文作者您好！首先感谢您提供了开源的代码，对复现论文帮助很大，模型的效果也很赞！另外想请问下，不知道finetune相关的代码大概什么时候会上传呀？

load checkpoint when fine-tune QM9

Hi,
I met a problem when I tried to load 'L12-old.pt' in finetuneqm9.sh , the program told me that the checkpoint's structure mismatched, how can I solve this problem?

How to encode proteins in the PDBbind task?

Very enlightening work. Congratulations on your great achievements in the OGB Challenge! In addition, I noticed that you have made fine-tuning on the PDBbind dataset. How should you encode the protein information? Because proteins usually contain more heavy atoms, do you directly use Transformer-M to encode proteins?

Performance Issue: Slow read_csv() Function with pandas Version 1.3.4 for CSV Files

Issue Description:
Hello.
I have discovered a performance degradation in the read_csv function of pandas version 1.3.4 when handling CSV files with a large number of columns. This problem significantly increases the loading time from just a few seconds in the previous version 1.2.5 to several minutes, almost 60x diff. I found some discussions on GitHub related to this issue, including #44106 and #44192.
I found that Transformer-M/data/wrapper.py, examples/MMPT/scripts/video_feature_extractor/videoreader.py and examples/MMPT/mmpt/processors/dsprocessor.py used the influenced api.

Steps to Reproduce:

I have created a small reproducible example to better illustrate this issue.

# v1.3.4
import os
import pandas
import numpy
import timeit

def generate_sample():
    if os.path.exists("test_small.csv.gz") == False:
        nb_col = 100000
        nb_row = 5
        feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
        for i in range(nb_col):
            feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
        df = pandas.DataFrame(feature_list)
        df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")

def load_csv_file():
    col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
    types_dict = {col: numpy.float32 for col in col_names}
    types_dict.update({'sample': str})
    feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
    print("loaded dataframe shape:", feature_df.shape)

generate_sample()
timeit.timeit(load_csv_file, number=1)

# results
loaded dataframe shape: (5, 100000)
120.37690759263933

# v1.3.5
import os
import pandas
import numpy
import timeit

def generate_sample():
    if os.path.exists("test_small.csv.gz") == False:
        nb_col = 100000
        nb_row = 5
        feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
        for i in range(nb_col):
            feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
        df = pandas.DataFrame(feature_list)
        df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")

def load_csv_file():
    col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
    types_dict = {col: numpy.float32 for col in col_names}
    types_dict.update({'sample': str})
    feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
    print("loaded dataframe shape:", feature_df.shape)


generate_sample()
timeit.timeit(load_csv_file, number=1)

# results
loaded dataframe shape: (5, 100000)
2.8567268839105964

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 1.3.5 or exploring other solutions to optimize the performance of loading CSV files.
Any other workarounds or solutions would be greatly appreciated.
Thank you!

Question about inputting only 2D Data

Hi!

Thank you for introducing such an interesting model to us and sharing the code!

I'm trying to run the model only on 2D structures, would you mind providing a script for using only 2D structures to train the model (Like for PCQM4M-LSC-V2)?

I tried to change the dataset_name and set add_3D to false in the sample train script for 3D data in the readme file, but that doesn't work. I looked into the code and found that in the tasks/graph_prediction.py file , Class GraphPredictionTask, and load_dataset function, when calling BatchedDataDatset, when it set the dataset_version to "2D" for PCQM4M-LSC-V2, it gives the error in criterions/graph_predictions.py line 45: ori_pos = sample['net_input']['batched_data']['pos'], KeyError: 'pos'.

Thank you so much!

Errors when evaluate the model

I tried to evaluate the model by using L12 ckpt.
an error occured:
AttributeError: 'Namespace' object has no attribute 'load_qm9'

I added the parameter ‘--load-qm9’ in evaluate.sh and run 'bash evaluate.sh'. A new error is raised:
RuntimeError: Error(s) in loading state_dict for TransformerMModel:
Unexpected key(s) in state_dict: "encoder.molecule_encoder.atom_proc.q_proj.weight", "encoder.molecule_encoder.atom_proc.q_proj.bias", "encoder.molecule_encoder.atom_proc.k_proj.weight", "encoder.molecule_encoder.atom_proc.k_proj.bias", "encoder.molecule_encoder.atom_proc.v_proj.weight", "encoder.molecule_encoder.atom_proc.v_proj.bias", "encoder.molecule_encoder.atom_proc.force_proj1.weight", "encoder.molecule_encoder.atom_proc.force_proj1.bias", "encoder.molecule_encoder.atom_proc.force_proj2.weight", "encoder.molecule_encoder.atom_proc.force_proj2.bias", "encoder.molecule_encoder.atom_proc.force_proj3.weight", "encoder.molecule_encoder.atom_proc.force_proj3.bias".

How to obtain the embedding of molecular conformation

Great Work!
Input：atom_type + xyz
How can we get the embedding of atom and xyz?

unrecognized arguments error when using provided evaluation code

Hi, here is what confused me. I tried to run the code provided in README.md:

but I got the following error:

evaluate.py: error: unrecognized arguments: --add-3d --num-3d-bias-kernel 128 --droppath-prob 0.1 --act-dropout 0.1 --mode-prob 0.2,0.2,0.6

It seems strange, could you help me with it?

Training on QM9

Hi,

Would it be possible to provide the commands for training a model on QM9 from scratch? This is mentioned in appendix B5 when investigating the effectiveness of pre-training.

Kind regards,

Rob

tasks for finetuning qm9

Hi, Thank you for your code for finetuning qm9, I have a small question on the corresponding relationship between task_idx and specific task.
do you mean that the correspondence is as follows?

Thank you for your kindest reply.