zjunlp / molgen Goto Github PK

[ICLR 2024] Domain-Agnostic Molecular Generation with Chemical Feedback

Home Page: https://huggingface.co/spaces/zjunlp/MolGen

License: MIT License

Python 98.02% Shell 1.98%

language-model molecular-generation pre-trained-language-models pre-trained-model molgen molecule molecular-optimization selfies targeted-molecular-generation pre-training generation multitask huggingface pytorch iclr2024

molgen's Introduction

⚗️ MolGen

Domain-Agnostic Molecular Generation with Chemical Feedback

📃 Paper • 🤗 Model • 🔬 Space

🔔 News

2024-2 We've released ChatCell, a new paradigm that leverages natural language to make single-cell analysis more accessible and intuitive. Please visit our homepage and Github page for more information.
2024-1 Our paper Domain-Agnostic Molecular Generation with Chemical Feedback is accepted by ICLR 2024.
2024-1 Our paper Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models is accepted by ICLR 2024.
2023-10 We open-source MolGen-7b, which now supports de novo molecule generation!
2023-6 We open-source KnowLM, a knowledgeable LLM framework with pre-training and instruction fine-tuning code (supports multi-machine multi-GPU setup).
2023-6 We release Mol-Instructions, a large-scale biomolecule instruction dataset for large language models.
2023-5 We propose Knowledge graph-enhanced molecular contrAstive learning with fuNctional prOmpt (KANO) on Nature Machine Intelligence, exploiting fundamental domain knowledge in both pre-training and fine-tuning.
2023-4 We provide a NLP for science paper-list at https://github.com/zjunlp/NLP4Science_Papers.
2023-3 We release our pre-trained and fine-tuned model on 🤗 Hugging Face at MolGen-large and MolGen-large-opt.
2023-2 We provide a demo on 🤗 Hugging Face at Space.

📕 Requirements

To run the codes, You can configure dependencies by restoring our environment:

conda env create -f MolGen/environment.yml -n $Your_env_name$

and then：

conda activate $Your_env_name$

📚 Resource Download

You can download the pre-trained and fine-tuned models via Huggingface: MolGen-large and MolGen-large-opt.

Moreover, the dataset used for downstream tasks can be found here.

The expected structure of files is:

moldata
├── checkpoint 
│   ├── molgen.pkl              # pre-trained model
│   ├── syn_qed_model.pkl       # fine-tuned model for QED optimization on synthetic data
│   ├── syn_plogp_model.pkl     # fine-tuned model for p-logP optimization on synthetic data
│   ├── np_qed_model.pkl        # fine-tuned model for QED optimization on natural product data
│   ├── np_plogp_model.pkl      # fine-tuned model for p-logP optimization on natural product data
├── finetune
│   ├── np_test.csv             # nature product test data
│   ├── np_train.csv            # nature product train data
│   ├── plogp_test.csv          # synthetic test data for plogp optimization
│   ├── qed_test.csv            # synthetic test data for plogp optimization
│   └── zinc250k.csv            # synthetic train data
├── generate                    # generate molecules
├── output                      # molecule candidates
└── vocab_list
    └── zinc.npy                # SELFIES alphabet

🚀 How to run

Fine-tune
- First, preprocess the finetuning dataset by generating candidate molecules using our pre-trained model. The preprocessed data will be stored in the folder output.
```
    cd MolGen
    bash preprocess.sh
```
- Then utilize the self-feedback paradigm. The fine-tuned model will be stored in the folder checkpoint.
```
    bash finetune.sh
```
Generate

To generate molecules, run this script. Please specify the checkpoint_path to determine whether to use the pre-trained model or the fine-tuned model.
```
cd MolGen
bash generate.sh
```

🥽 Experiments

We conduct experiments on well-known benchmarks to confirm MolGen's optimization capabilities, encompassing penalized logP, QED, and molecular docking properties. For detailed experimental settings and analysis, please refer to our paper.

MolGen captures real-word molecular distributions

MolGen mitigates molecular hallucinations

Targeted molecule discovery

Constrained molecular optimization

Citation

If you use or extend our work, please cite the paper as follows:

@inproceedings{fang2023domain,
  author       = {Yin Fang and
                  Ningyu Zhang and
                  Zhuo Chen and
                  Xiaohui Fan and
                  Huajun Chen},
  title        = {Domain-Agnostic Molecular Generation with Chemical feedback},
  booktitle    = {{ICLR}},
  publisher    = {OpenReview.net},
  year         = {2024},
  url          = {https://openreview.net/pdf?id=9rPyHyjfwP}
}

molgen's People

Contributors

Stargazers

Watchers

Forkers

zju-fangyin mdcao mathcom segmenttree9 takshan fellowtraveler zza234s zhichenml

molgen's Issues

About preprocess

Hello, and great job!
Before running the finetuning process, you mentioned generating the candidate datasets with the pretrained model.
Is the pretrained model located at the path moldata/checkpoint/molgen.pkl?
I have downloaded the Hugging Face model but didn't find molgen.pkl.
moldata
├── checkpoint
│ ├── molgen.pkl # pre-trained model
Thank you!

Question about "loss_output" of "finetune.py"

Bravor job! I'm curious that if the size of "src_id" and "candidate_id" are "[bs, num_input, max_len]" and "[bs, num_cand, max_len]", in which num_cand = num_samples*num_input. However, when the 2nd dimension of them are different, the programme doesn't make sense.

By the way, "input_ids" and "decoder_input_ids" don't seem to be used in finetune.py.

Hope for your reply!

About the Vocabulary

I would like to ask, are the vocabularies used in NP dataset and synthetic dataset the same, that is, in the second stage of pre-training mentioned in the paper, but the Nature products are much longer than the synthetic? Can you tell me what sequence length you set for the encoder?

How to reproduce the impressive scores shown in Table 1? 如何复现Table 1的优异结果？

您好，我对你们的工作很感兴趣，并想要复现Table 1那些遥遥领先的测试分数。然而，使用提供的预训练模型 link1 生成分子进行测试，我并没有得到接近表格中的结果。

生成及测试方法

使用moses test数据集的前10K个分子作为指导分子。使用 generate.sh 脚本生成分子：

deepspeed --include localhost:0,1,2,3 generate_ds.py --dist 1 \
                                            --gpu 4 \
                                            --batch_size 5  \
                                            --exp_name generate \
                                            --exp_id qed \
                                            --return_num 20    \
                                            --max_len 100   \
                                            --min_len 20    \
                                            --top_k 30  \
                                            --top_p 1   \
                                            --beam 1300  \
                                            --process 'generate'  \
                                            --generate_mode 'topk'  \
                                            --checkpoint_path '../moldata/checkpoint/molgen.pkl' \
                                            --input_path '../moldata/finetune/test_first_10K.csv'  \
                                            --generate_path '../moldata/generate/optimize_f10K.csv' \
                                            --property 'qed' \
                                            --deepspeed \
                                            --deepspeed_config generate_config.json \

采用moses的测试函数对其进行测试。

import moses.metrics as mcs
file_path = "./MolGen/moldata/generate/optimize_f10K.csv"
data = pd.read_csv(file_path, dtype=str)
smiles = data['candidate_smiles']
metrics = mcs.get_all_metrics(smiles, device="cuda:0")

预期生成结果如Table 1，实际生成结果如下:

{'FCD/Test': 40.90308785709429,
 'FCD/TestSF': 42.028217393026495,
 'Filters': 0.02705060207140637,
 'Frag/Test': 0.28600698030425464,
 'Frag/TestSF': 0.27391998139090723,
 'IntDiv': 0.8993964364626843,
 'IntDiv2': 0.8804226826180745,
 'Novelty': 1.0,
 'QED': 0.6386171938374946,
 'SA': 3.5723966047348865,
 'SNN/Test': 0.17555409343519618,
 'SNN/TestSF': 0.1655550900856655,
 'Scaf/Test': 0.0,
 'Scaf/TestSF': 0.0,
 'logP': 2.028207020853999,
 'unique@1000': 0.875,
 'unique@10000': 0.8301,
 'valid': 0.99739,
 'weight': 123.78018379991674}

可以看到，很多分数与预期实验结果相差甚远。尝试修改return_num/max_len等参数也没有获得更好的结果。

希望您能指出我的测试方法存在的不足，并给出得到Table 1优异数据的方法步骤。我还是一个科研菜鸟，这是我第一次尝试复现前沿研究论文，如是简单错漏请多多包涵，万分感谢！祝你们以后的研究一直“遥遥领先”！

Reduntant import

Hi, I see that you have import numpy, warning and selfies multiple times in generate_ds.py

About loading the checkpoint file for generative task

Hi,

After I tried preprocessing and fine-tuning, I found that two checkpoint files were generated, one is called 'mp_rank_00_model_states.pt' and the other one is called 'zero_pp_rank_0_mp_rank_00_optim_states.pt'. It seems that both of them could be used for the following generative task as the checkpoint_path, and give me similar generative results. It would be highly appreciated if more details or explanations could be given.

Many thanks in advance!

Song

Finetuning

Hi,

Fascinating project! I was just wondering, how should I format the data for finetuning purposes? More specifically, what columns do I need? The CandidateDataset class asks for 'input' and 'candidates', but the files under moldata/finetune/ do not have these columns. If I have a custom property that I'd like to predict, what should I name it?

Thanks a lot!

About the molecule generation with optimized chemical properties

Hi,

I recently came across your work "Molecular Language Model as Multi-task Generator" and found it very interesting. I have a question regarding molecule generation with optimized chemical properties.

I noticed that in your code, the QED and PlogP features are calculated using RDKit, which relies on SMILES strings as input. However, I was wondering if it is possible to use other chemical features that do not rely on SMILES strings to achieve the generative task.

For example, if I have a dataset with SMILES and their chemical properties (which are experimental data), would it be possible to use your model to achieve the generative task?

I am very interested in exploring this idea further and would appreciate any insights or suggestions you may have. Thank you so much for your time.

Best regards,
Song

Pretrain and finetune datasets

Really thanks for your great code and essay.
When I was trying to do the project, questions came to me like the usage of the different datasets, the input/output things and the time it consumes.
For datasets confusions, it is obvious that 100M data from ZINC15 were selected for pretraining but there are several datasets files in dir "finetune" which confuse me. I am just wondering, in order to achieve the whole process, which one should be used for pretraining and which for finetune under the multi-task environment.
Otherwise, about the time, I checked 250,000 data running on our device(usage of 3 V100s, 16GB) and it turned out to be in 2-3 days to complete the whole process. And you mentioned 6 V100s were used for training, so i was wondering how long it took to pretrain your model on 100M ZINC15 dataset as my personal reference.
Last, I've seen the .sh files in your dir, it bothered me whether the defaulted parameters in it need to be fixed or not, including the datasets path... because I found some path or files' names didn't exist in the project...
Thanks for noticing this :)

Question about ”generation of two-dimensional molecules“

Hi, thanks for you great work.

In your paper, you mentioned that this work is focused on two-dimensional molecules. From my understanding, two-dimensional molecules refer to molecules represented using a graph data structure. I would like to know the reason behind using this term.

Details on Mol-Gen 7b

Hi Authors,

Congratulations on your outstanding work. MolGen is an invaluable contribution to the field of molecule generation.

I am particularly interested in understanding the training intricacies of the recently released LM Mol-Gen 7b. Specifically, I have a few questions:

Is it trained from scratch or initialized from Meta's Llama-7b?
Is it trained solely for molecule generation or trained jointly with other objectives?
How large is the training corpus?

Your insights would greatly enrich my understanding of the model's architecture and training strategy. I look forward to your response.

About the generation process

Hello, and great job!
1、When generating the 10K molecules in Table 1、Table2, or Table3, should we input some molecules, are they from the ZINK250K or MOSES?
2、MOLGEN can generate better molecules when gives the inputs, so the generation process is actually an optimization process, am i right?
Thank you very much!

为什么说 MPT 可以在不同 tasks 和 domains 之间共享知识呢?

你好, 请问一下啊, 文章中说 MPT 可以在不同 tasks 和 domains 之间共享知识是什么意思呢?

使用 MPT 方法, 不也是对每个 QED, p-logP 优化任务单独做参数调整吗?