rose-stl-lab / limo Goto Github PK

generative model for drug discovery

Python 100.00%

limo's Introduction

LIMO: Latent Inceptionism for Targeted Molecule Generation

This is the repository for our 2022 ICML paper "LIMO: Latent Inceptionism for Targeted Molecule Generation" by Peter Eckmann, Kunyang Sun, Bo Zhao, Mudong Feng, Michael K. Gilson, and Rose Yu.

Website

You can run LIMO online at https://www.limo-aimd.com/.

Installation

Please ensure that RDKit and Open Babel are installed. The following Python packages are also required (these can also be installed with pip install -r requirements.txt):

torch
pytorch-lightning
selfies
scipy
tqdm

Code was tested with Python 3.9, but will likely work on any version of Python 3.

Generating molecules

Call python generate_molecules.py to generate molecules with desired properties. Run --help to see the full list of supported properties and other parameters. The default setting is to perform multi-objective binding affinity maximiziation with the filtering step, but other properties can be optimized by specifying --prop (see "Training the property predictor" to optimize your own properties). Model files for penalized logP, binding affinity to ESR1 and ACAA1, QED, and SA are provided in the property_models folder. The default binding affinity model is ESR1 (which is binding_affinity.pt), but to optimize binding to another protein one must make sure the binding_affinity.pt file contains the model for that correct protein. For example, for 2IIK binding optimization one must remove/rename the original binding_affinity.pt file and rename 2iik_binding_affinity.pt to binding_affinity.pt. A trained VAE model is provided that was trained on the ZINC250K dataset, but any SMILES dataset can be used for training (see "Training the VAE").

To optimize molecules for binding affinity, an AutoDock-GPU executable must be compiled, and pointed to with the --autodock_executable flag when generating molecules, training the property predictor, or fine-tuning. A --protein_file must also be specified, but files for ESR1 (1err/1err.maps.fld) and ACAA1 (2iik/2iik.maps.fld) are already supplied. To generate your own protein files, see Steps 1-2 in the AutoDock4 User Guide. The AutoDock-GPU Wiki may also be helpful.

Fine-tuning

Run python fine-tune "<smiles>" to fine-tune a molecule for increased binding affinity. For each iteration, it will print out the binding affinity and SMILES of the best molecule. Run with 2>/dev/null to silence RDKit warnings.

Training the property predictor

Train the property predictor with python train_property_predictor.py --prop {logp, penalized_logp, qed, sa, binding_affinity}. Run with --help for a full list of options. When training for binding affinity, you must also provide a --protein_file, with the default being 1err/1err.maps.fld. For properties other than binding affinity, we suggest training with --num_mols 100000 for greater prediction accuracy. The printed r value should be >0.5 for most properties (except binding affinity, which is typically a lower value because it is a more difficult task), so running multiple times until you reach a suitable r value or training with more molecules is recommended.

Training the VAE

Before training, you must run python preprocess_data.py --smiles <file>.smi to get data ready. You can train with your own SMILES file, or use the provided zinc250k.smi, which has SMILES for the ZINC250k dataset. Then call python train_vae.py, which will save the trained model as vae.pt.

Cite

@article{eckmann2022limo,
   title={LIMO: Latent Inceptionism for Targeted Molecule Generation},
   author={Eckmann, Peter and Sun, Kunyang and Zhao, Bo and Feng, Mudong and Gilson, Michael K and Yu, Rose},
   booktitle={International Conference on Machine Learning},
   organization={PMLR},
   Year = {2022}
}

limo's People

Contributors

Stargazers

Watchers

Forkers

linboqiao mars-wei zzlsms mathcom rnaimehaom zju-fangyin bowen-gao ziq-li mkukielka3 mikelee-dev changzhijiang shreshthtuli hritikbharucha keiour

limo's Issues

About substructures

Hello~
In LIMO's paper, you mentioned that

We chose two molecules from ZINC250k to act as starting molecules and defined the substructures of these starting molecules to be fixed

i、How to define the substructures in string representation(SMILES or SELFIES), these string representations are continuous, and the substructure will locate in different places, so i am confused how to get the substructure.
ii、I know Recap can get some substructures, if it's the same way LIMO uses? Or are they from some websites producing important substructures?
iii、How to estimate whether the generated molecule does contain the whole substructure? I guess we can convert the molecule to MorganFingerPrint and use the function HasSubstructMatch in RDKit.
Thank you!
^ v ^

_pickle.UnpicklingError: invalid load key, 'v'.

python generate_molecules.py
Traceback (most recent call last):
File "/home/pharmapp/DD-2022/LIMO/generate_molecules.py", line 15, in
vae.load_state_dict(torch.load('vae.pt'))
File "/home/pharmapp/anaconda3/envs/env0/lib/python3.10/site-packages/torch/serialization.py", line 713, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/pharmapp/anaconda3/envs/env0/lib/python3.10/site-packages/torch/serialization.py", line 920, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.

Output of docking results

Hi, perhaps I am missing something here, but in line 214 of utils.py:

ps.append(subprocess.Popen(f'{autodock} -M {protein_file} -s 0 -B ligands/{device}/ligand*.pdbqt -N ../../outs/ -D {device + 1}', shell=True, stdout=subprocess.DEVNULL))

It seems that the output .dlg files for the docking results are being saved at: ../../outs/, and I have 2 questions:

Is the expected behavior supposed to be that AutoDock writes all of the docking results for each ligand file processed here in the batch list in the output folder specified by -N? because in the AutoDock documentation, -N refers to a single file output location. So won't these .dlg files get overwritten with each run of AutoDock?
later, in line 223 of utils.py: for file in tqdm(os.listdir('outs'), desc='extracting binding values'): this seems to be reading from ./outs instead of ../../outs/

I could be missing something here, but in my case none of the .dlg files are generated in either ./outs or ../../outs/

About property computation

Hello, and great job! I have a question regarding property computation.
For instance, in Table 1, how do we compute the PENALIZED LOGP and QED for the generated molecules using different methods?
Do we employ RDKit or the trained property prediction model gθ?
If we use the property prediction model gθ, how do other methods compute their properties?
Do they calculate the PENALIZED LOGP and QED using their own respective property prediction models?(if so, how about the fairity)
Thank you very much!

Smiles for Substructure-constrained logP

There are 2 molecules shown in task 4.5. Could you please provide smiles for original molecules and substructures for both of them?

Docking Scores for ACAA1 and ESR1

Hi Peter,

Would you happen to have docking scores saved for these 2 receptors? I would like to retrain the property predictors for binding affinity without re-computing docking scores. Perhaps a csv file formatted like, (smiles, docking score)

Thanks!

About Inference

The first stage is the training of the VAE.
The second stage is the training of the property predictor,
1、Should the third stage be viewed as the inference process?
2、Does LIMO generate molecules in this manner?
a. Get 1 input molecule from ZINC250k or MOSES, LIMO gets the lantent ( z )
b. LIMO does 10 times optimization of ( z )
c. LIMO uses the optimized ( z ) to generate optimized molecules

Thank you very much!

Problem with binding_affinity optimization

I am very interested in your work on LIMO and I am currently trying to reproduce your experiments with your code. However I have a problem when the property to optimize is the binding_affinity for ESR1 the three best values are always equal to 1.0

Hyperparameter Optimization for Property Predictors

Hi!

I was wondering if the default hyperparameters specified in the repository are the ones you determined after hyperparameter optimization. In the LIMO paper, it is written that hyperparameter optimization was performed for the property predictors predicting penalized logp, and reused for all other property predictors. I would like to double check because the paper also mentions the property predictors all have 3, linear 1000 dimensional layers, while in the repository there are 2, 1000 dimensional layers.

To match the publication, I have added 1 more 1000 dimensional linear layer and I retrained the property predictor for predicting penalized logp on 100,000 molecules, using the current default hyperparameters specified in the repository, and achieved an R-value of 0.58.

Basically, I would like work on improving the property predictors so I just wanted to double check if this is the expected performance for the current baseline model, before trying any further hyperparameter or model architecture changes.

About Reverse Optimization

Hello, excellent work! I have a question regarding the reverse optimization process. ^v^

In Section 3.3, you mentioned freezing the weights of f_dec and g_theta, while keeping z trainable, where z is a vector. My understanding is that backpropagation typically operates on the weights of neural networks. However, in this scenario, both the weights of the decoder and the property prediction model are frozen. Does this imply that we actually fine-tune the weights of the encoder to generate the desired z?

Alternatively, if there's anything I haven't understood clearly about backpropagation? That it can directly modify the vector z based on the loss between the predicted property and the ground truth property.

Thank you very much!

Generating new binding_affinity.pt files

Hi. In the README, you write:

The default binding affinity model is ESR1 (which is binding_affinity.pt), but to optimize binding to another protein one must make sure the binding_affinity.pt file contains the model for that correct protein.

Can you describe a bit more how you generate the file binding_affinity.pt for a new protein? Thanks!

Script to reproduce Table 1&2

Hi,

Thanks for the great contribution!

May I ask if the code to reproduce Table 1 and Table 2 is released?

Thanks!

The code runs very slowly after this point “subprocess.run(f'{autodock} -M {protein_file} -s 0 -L ligands/0/ligand0.pdbqt -N outs/ligand0', shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)”

I see nvidia-smi is GPU-Util:0%,but Memory-Usage is used. I have AutoDock-GPU installed and it compiles fine.Please give me some pointers, thanks.

Generate qed.pt file or binding_affinity.pt file

Hi! When I try to generate qed.pt or binding_affinity.pt file by the script train_property_predictor.py , I have a problem like that:

Epoch 4: 100%|█████████████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 172.03it/s, loss=nan, v_num=3]
property predictor trained, correlation of r = nan

I'm not sure whether this is my *.maps.fld problem, I have followed the instruction of Rigid Receptor PDBQT Files – the “Grid” Menu , when I tried to replace my.maps.fld file with 1err, I still have this problem)

Similarity-constrained Penalized logP Maximization

Hi,
I would like to reproduce your experiment on constrained optimization of plogp, but I could not find the corresponding code in your repo (Table 3). Is it possible to get it?
Thank in advance for your answer.

About the sampling process

Hello, Peter!
Me again~
Here you mentioned:

A. Experiment description and baselines
A.1. Tasks Random generation of molecules: Generate random molecules by sampling from the latent space

How to sample from the latent space?
Is it like sampling random noise from the normal distribution?
Thanks!

About the Substructure-constrained logP Extremization task

Dear Peter,

I'm back again. Could you please send me the two molecules (SMILES or SELFIES) and the selected substructure for TASK 4.5: Substructure-constrained logP Extremization? Thank you!

Best regards,
Fan

Substructure Optimisation: Mols don't maintain substructure

Hi !
I tried to run the optimisation procedure with the code you provided in this issue on the first molecule and its substructure as you provided here. However, none of the molecules seem to have maintained the substructure. I will attach the exact code I am running. Am I overlooking something?

Thanks in advance !

def create_mask(smile, substructure):
  orig_z = smiles_to_z([smile], vae)

  orig_x = torch.exp(vae.decode(orig_z))
  substruct = Chem.MolFromSmiles(substructure)
  selfies = list(sf.split_selfies(sf.encoder(smile)))
  mask = torch.zeros_like(orig_x)
  for i in range(len(selfies)):
    for j in range(len(dm.dataset.idx_to_symbol)):

      changed = selfies.copy()

      changed[i] = dm.dataset.idx_to_symbol[j]
      m = Chem.MolFromSmiles(sf.decoder(''.join(changed)))
      if not m.HasSubstructMatch(substruct):

	  mask[0][i * len(dm.dataset.idx_to_symbol) + j] = 1
  return mask, orig_z, orig_x


mask, orig_z, orig_x = create_mask(smile='CCCC1=NN(C)C(NC(CN2C(NC3(C2=O)CCC(CC3)C)=O)=O)=C1', substructure='O=CNC1=CC=NN1C')


z = orig_z.clone().detach().requires_grad_(True)
optimizer = torch.optim.Adam([z], lr=0.1)
smiles = []
logps = []
for epoch in tqdm(range(50000)): # 50000
	optimizer.zero_grad()
	x = torch.exp(vae.decode(z))
	loss = model(x) + 1000 * torch.sum(((x - orig_x.clone().detach()) * mask) ** 2)
	loss.backward()
	optimizer.step()
	if epoch % 1000 == 0:
		# x, logp = get_logp(z)
		# logps.append(logp.item())
		smiles.append(one_hot_to_smiles(x))

for s in smiles:
    m = Chem.MolFromSmiles(s)
    substruct = Chem.MolFromSmiles('O=CNC1=CC=NN1C') 
    print(m.HasSubstructMatch(substruct))

PyTorch Lightning Version

Hi!
Would it be possible for you to share which version of PyTorch Lightning you used?

It seems that a lot of functionality in PyTorch Lightning has been deprecated as of version 2.0, which is causing some breaking.
I have been able to solve some changes by searching for how to handle the deprecated funtionalities, but it would be easier to reproduce the code if the PyTorch Lightning version number was known

For example in line 46 of train_property_predictory.py: trainer = pl.Trainer(gpus=1, max_epochs=5, logger=pl.loggers.CSVLogger('logs'), enable_checkpointing=False, auto_lr_find=True) The attribute gpus does not exist in the Trainer class anymore for PyTorch version 2.0

gpus can be replaced by adding accelerator='gpu' and devices=1
There are some more cases, so it would be helpful to just use the same package version for better reproducibility of the code

Thanks for your help!

About the metric

Hello, i'm back again~

Take logp targetting as an example：

I generated molecules with a total count of N, where N1 are the molecules that meets the condition logp in (-2.5, -2).
M are the molecules in N that are duplicates of the training set, and N2 are the molecules that meet the criterion in (N-M).

When calculating the proportion of molecules falling within the desired range for the logP targeting task, which of the following ratios should be used?

N1/N
N2/(N-M)
N2/N

When calculating the similarity between generated molecules, which set should be used for this calculation?

Thank you!

Why is the data presented in the code I run different from what is presented in the paper. Could you help me take a look

parser.add_argument('--prop', choices=['logp', 'penalized_logp', 'qed', 'sa', 'binding_affinity', 'multi_objective_binding_affinity'], default='qed')
parser.add_argument('--num_mols', type=int, default=100000)
parser.add_argument('--top_k', type=int, default=3)
parser.add_argument('--sa_cutoff', type=float, default=5.5)
parser.add_argument('--qed_cutoff', type=float, default=0.4)
parser.add_argument('--optim_steps', type=int, default=22)#原来是10
parser.add_argument('--autodock_executable', type=str, default='AutoDock-GPU/bin/autodock_gpu_128wi')#这个必须有
parser.add_argument('--protein_file', type=str, default='1err/1err.maps.fld')#这个必须指定，还可以换成2iik/2iik.maps.fld
args = parser.parse_args()
These are the hyperparameter I use, vae.pt and qed.pt are provided by you

Substructure-constrained logP Extremization & logP Targeting

Hi, you mentioned Substructure-constrained logP Extremization & logP Targeting in your paper, but I don't seem to see either part in the code. Could you please provide more details about how to realise them? Thanks

table1&2

python generate_molecules.py --prop penalized_logp
generating molecules: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:38<00:00, 9.82s/it]
16.639477 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC1CC1
16.639477 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC1CC1
17.468054 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC=CCC

python generate_molecules.py --prop qed
generating molecules: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:35<00:00, 9.54s/it]
calculating QED: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:14<00:00, 671.45it/s]
0.8635939 CC=CC1=CC1NC(=O)C2CCCCCCCC3=CC=CC=C23
0.8734434 CC=CC1=CC1NC(=O)C2CCCC=CC=CC3=CC=CC=C23
0.88593656 CC=CC1=CC1NC(=O)C2CC=CC=CC=CC3=CC=CC=C23
Hello. This is the result of my run. I did not change the hyperparameter, but directly used the provided code to get the hyperparameter. I didn't train my own VAE model, but instead used the provided VAE.PT and the provided QED.PT. Can you help me check which step I made a mistake.