molml / moleculeace Goto Github PK

View Code? Open in Web Editor NEW

150.0 150.0 19.0 24.95 MB

A tool for evaluating the predictive performance on activity cliff compounds of machine learning models

License: MIT License

Python 77.69% R 22.31%

moleculeace's People

Contributors

Stargazers

Watchers

Forkers

deepsystemspharmacology sailfish009 drmaruyama confusedant awoonor sunnnymskang kiranfranklin999 arunraja-hub albertma1986 felixzzzxy shubhamzoro ardeat reginaib pixelatory juanxin saratolba void-echo mathildekretz unixjunkie

moleculeace's Issues

Unable to run README example

I've been unable to run the example. It doesn't seem possible to directly reproduce the environment you used, and I'm getting an exception when I try to run your code using an environment I created with.

conda create -n moleculeACE python=3.8
conda activate moleculeACE
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install tensorflow
conda install pyg -c pyg
pip install transformers

When I try to run the README example. I get an exception on
model.train(data.x_train, data.y_train)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 1
----> 1 model.train(data.x_train, data.y_train)

File ~/software/MoleculeACE/MoleculeACE/models/utils.py:82, in GNN.train(self, x_train, y_train, x_val, y_val, early_stopping_patience, epochs, print_every_n)
     78     break
     80 # As long as the model is still improving, continue training
     81 else:
---> 82     loss = self._one_epoch(train_loader)
     83     self.train_losses.append(loss)
     85     val_loss = 0

File ~/software/MoleculeACE/MoleculeACE/models/utils.py:119, in GNN._one_epoch(self, train_loader)
    116 self.optimizer.zero_grad()
    118 # Forward pass
--> 119 y_hat = self.model(batch.x.float(), batch.edge_index, batch.edge_attr.float(), batch.batch)
    121 # Calculating the loss and gradients
    122 loss = self.loss_fn(squeeze_if_needed(y_hat), squeeze_if_needed(batch.y))

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/software/MoleculeACE/MoleculeACE/models/mpnn.py:104, in MPNNmodel.forward(self, x, edge_index, edge_attr, batch)
    101     node_feats = node_feats.squeeze(0)
    103 # perform global pooling using a multiset transformer to get graph-wise hidden embeddings
--> 104 out = self.transformer(node_feats, batch, edge_index)
    106 # Apply a fully connected layer.
    107 for k in range(len(self.fc)):

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch_geometric/nn/aggr/base.py:131, in Aggregation.__call__(self, x, index, ptr, dim_size, dim, **kwargs)
    126         if index.numel() > 0 and dim_size <= int(index.max()):
    127             raise ValueError(f"Encountered invalid 'dim_size' (got "
    128                              f"'{dim_size}' but expected "
    129                              f">= '{int(index.max()) + 1}')")
--> 131 return super().__call__(x, index, ptr, dim_size, dim, **kwargs)

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch_geometric/nn/aggr/gmt.py:245, in GraphMultisetTransformer.forward(self, x, index, ptr, dim_size, dim, edge_index)
    243 for i, (name, pool) in enumerate(zip(self.pool_sequences, self.pools)):
    244     graph = (x, edge_index, index) if name == 'GMPool_G' else None
--> 245     batch_x = pool(batch_x, graph, mask)
    246     mask = None
    248 return self.lin2(batch_x.squeeze(1))

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch_geometric/nn/aggr/gmt.py:133, in PMA.forward(self, x, graph, mask)
    127 def forward(
    128     self,
    129     x: Tensor,
    130     graph: Optional[Tuple[Tensor, Tensor, Tensor]] = None,
    131     mask: Optional[Tensor] = None,
    132 ) -> Tensor:
--> 133     return self.mab(self.S.repeat(x.size(0), 1, 1), x, graph, mask)

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch_geometric/nn/aggr/gmt.py:59, in MAB.forward(self, Q, K, graph, mask)
     57 if graph is not None:
     58     x, edge_index, batch = graph
---> 59     K, V = self.layer_k(x, edge_index), self.layer_v(x, edge_index)
     60     K, _ = to_dense_batch(K, batch)
     61     V, _ = to_dense_batch(V, batch)

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch_geometric/nn/conv/gcn_conv.py:198, in GCNConv.forward(self, x, edge_index, edge_weight)
    195 x = self.lin(x)
    197 # propagate_type: (x: Tensor, edge_weight: OptTensor)
--> 198 out = self.propagate(edge_index, x=x, edge_weight=edge_weight,
    199                      size=None)
    201 if self.bias is not None:
    202     out = out + self.bias

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch_geometric/nn/conv/message_passing.py:392, in MessagePassing.propagate(self, edge_index, size, **kwargs)
    389     if res is not None:
    390         edge_index, size, kwargs = res
--> 392 size = self.__check_input__(edge_index, size)
    394 # Run "fused" message and aggregation (if applicable).
    395 if is_sparse(edge_index) and self.fuse and not self.explain:

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch_geometric/nn/conv/message_passing.py:216, in MessagePassing.__check_input__(self, edge_index, size)
    213         the_size[1] = size[1]
    214     return the_size
--> 216 raise ValueError(
    217     ('`MessagePassing.propagate` only supports integer tensors of '
    218      'shape `[2, num_messages]`, `torch_sparse.SparseTensor` or '
    219      '`torch.sparse.Tensor` for argument `edge_index`.'))

ValueError: `MessagePassing.propagate` only supports integer tensors of shape `[2, num_messages]`, `torch_sparse.SparseTensor` or `torch.sparse.Tensor` for argument `edge_index`.

Question regarding the DATASET

Hi, it is very helpful that you provided relevant datasets.

However, there is one thing I am concerned about. Do your benchmark dataset has clear relations of cliff molecules? In other words, can we know exactly which pair of molecules have close graph structures but significantly different properties?

Thanks,

Is there any validation split?

Hi, thanks for sharing the code.

However, as far as I am concerned, you only split the data into training and testing, but ignore the validation split.

It is important to have both train and validation, otherwise, you will not have enough knowledge to know when to stop training and what is the best model. I believe cross-validation cannot avoid the drawback of missing the validation split.

Can you please give me a potential answer?

Inconsistency between values of the data in CHEMBL2147_Ki

Hey, I really appreciate your work - thank you very much for sharing the code and the data.

I found an inconsistency that I couldn't wrap my head around, and would like to ask you to clarify directly:

When looking at the data here:
https://github.com/molML/MoleculeACE/blob/main/MoleculeACE/Data/benchmark_data/CHEMBL2147_Ki.csv

the file has a column called "exp_mean [nM]", and a "y" column which should be the -log10(exp_mean), according to visual inspection and to what you wrote in the paper: "The mean Ki or EC50 value for each molecule was computed and subsequently converted into pEC50/pKi values (as the negative logarithm of molar concentrations)"

However, there is an issue: Smiles with the same value of "exp_mean" (e.g. of 100 nM) have "y" values that are either positive or negative (e.g. 2 or -2 in the example below), and I haven't found any way to make sense of this!

smiles	exp_mean [nM]	y
Cc1cncc(-c2cc3c(-c4cccc(N5CCNCC5)n4)n[nH]c3cn2)n1	100	2
Cc1ccc(F)c(-c2nc(C(=O)Nc3cnn(C)c3N3CCCC@@HCC3)c(N)s2)c1F	100	2
Cn1ncc(NC(=O)c2nc(-c3ccccc3F)sc2N)c1N1CCC@HCC(F)(F)C1	100	2
Nc1sc(-c2c(F)cccc2F)nc1C(=O)Nc1cnn(C2CC2)c1N1CCC@HCC(F)(F)C1	100	2
C=C(C)c1ccc(-c2n[nH]c3cnc(-c4cccnc4)cc23)nc1N1CCCC@HC1	100	2
C#Cc1ccc(-c2n[nH]c3cnc(-c4cccnc4)cc23)nc1N1CCCC@HC1	100	2
Cn1ncc(NC(=O)c2nc(-c3ccc(C(F)(F)F)cc3F)sc2N)c1[C@@h]1CCC@@H C@@HCO1	100	2
CO[C@H]1COC@HCC[C@H]1N	100	2
Cn1ncc(NC(=O)c2csc(-c3c(F)cc(C4(F)COC4)cc3F)n2)c1[C@@h]1CCC@@H C@HCO1	100	2
Nc1sc(-c2c(F)cccc2F)nc1C(=O)Nc1cnccc1N1CCCC@HC1	100	2
CN1CCC(N(C)c2ccc3nnc(-c4cccc(C(F)(F)F)c4)n3n2)CC1	100	-2
Cn1c2ccccc2c2c3c(c4c5ccccc5n(CCC#N)c4c21)CNC3=O	100	-2
c1ccc(CNc2cc(-c3c[nH]c4ncccc34)ncn2)cc1	100	-2
CSc1ccc2nc3c(c(Cl)c2c1)CCNC3=O	100	-2
Cc1n[nH]c2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12	100	-2
O=c1[nH]c2sc3c(c2c2nc(-c4ccccc4)nn12)CCCC3	100	-2
O=C1NC(=O)C(c2c[nH]c3ccccc23)=C1c1nc(N2CCNCC2)nc2ccccc12	100	-2

Could you please clarify what is the origin of this inconsistency?

Thank you!

Problem executing Getting started example

Hi,

I was trying to run the first "Getting started" example in README.md and I ran into a problem executing the line

model = algorithm(hyperparameters)

It looks like the hyperparameters are not compatible.

Thanks in advance for your help

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 algorithm(hyperparameters)

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/MoleculeACE/models/mpnn.py:28, in MPNN.__init__(self, node_in_feats, node_hidden, edge_in_feats, edge_hidden, message_steps, dropout, transformer_heads, transformer_hidden, seed, fc_hidden, n_fc_layers, lr, epochs, *args, **kwargs)
     22 def __init__(self, node_in_feats: int = 37, node_hidden: int = 64, edge_in_feats: int = 6,
     23              edge_hidden: int = 128, message_steps: int = 3, dropout: float = 0.2,
     24              transformer_heads: int = 8, transformer_hidden: int = 128, seed: int = RANDOM_SEED,
     25              fc_hidden: int = 64, n_fc_layers: int = 1, lr: float = 0.0005, epochs: int = 300, *args, **kwargs):
     26     super().__init__()
---> 28     self.model = MPNNmodel(node_in_feats=node_in_feats, node_hidden=node_hidden, edge_in_feats=edge_in_feats,
     29                            edge_hidden=edge_hidden, message_steps=message_steps, dropout=dropout,
     30                            transformer_heads=transformer_heads, transformer_hidden=transformer_hidden, seed=seed,
     31                            fc_hidden=fc_hidden, n_fc_layers=n_fc_layers)
     33     self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
     34     self.loss_fn = torch.nn.MSELoss()

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/MoleculeACE/models/mpnn.py:62, in MPNNmodel.__init__(self, node_in_feats, node_hidden, edge_in_feats, edge_hidden, message_steps, dropout, transformer_heads, transformer_hidden, seed, fc_hidden, n_fc_layers, *args, **kwargs)
     59 self.node_in_feats = node_in_feats
     61 # Layer to project node features to hidden features
---> 62 self.project_node_feats = Sequential(Linear(node_in_feats, node_hidden), ReLU())
     64 # The 'learnable message function'
     65 edge_network = Sequential(Linear(edge_in_feats, edge_hidden), ReLU(),
     66                           Linear(edge_hidden, node_hidden * node_hidden))

File ~/anaconda3/envs/moleculeACE/lib/python3.8/site-packages/torch/nn/modules/linear.py:96, in Linear.__init__(self, in_features, out_features, bias, device, dtype)
     94 self.in_features = in_features
     95 self.out_features = out_features
---> 96 self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
     97 if bias:
     98     self.bias = Parameter(torch.empty(out_features, **factory_kwargs))

TypeError: empty(): argument 'size' must be tuple of SymInts, but found element of type dict at pos 2

Data Supplementary

early_stopping_patience issue

In GNN, if early_stopping_patience is default: None,
utils.py line 66
if patience is not None and patience >= early_stopping_patience:
will raise
'>=' not supported between instances of 'int' and 'NoneType'

Manual documentation install has wrong URL

Small thing. The documentation suggests "git clone https://github.com/derekvantilborg/MoleculeACE" but obviously it is moved to molML now.

Levenshtein similarity

MoleculeACE/MoleculeACE/benchmark/cliffs.py

Line 120 in 024ef21

    
           m[i, j] = levenshtein(smiles[i], smiles[j]) / max(len(smiles[i]), len(smiles[j]))

The code used to calculate the Levenshtein similarity appears to be problematic, it should be:
m[i, j] = 1- (levenshtein(smiles[i], smiles[j]) / max(len(smiles[i]), len(smiles[j])))

code issues

Hi
I want to ask if there is an issue with this line of code? Why are we using sigmoid function after obtaining the atomic mass features?

MoleculeACE/MoleculeACE/benchmark/featurization.py

Line 280 in 27f26eb

    
           atomic_weight = sigmoid(Chem.GetPeriodicTable().GetAtomicWeight(atom.GetSymbol()))

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.