gasteigerjo / dimenet Goto Github PK

DimeNet and DimeNet++ models, as proposed in "Directional Message Passing for Molecular Graphs" (ICLR 2020) and "Fast and Uncertainty-Aware Directional Message Passing for Non-Equilibrium Molecules" (NeurIPS-W 2020)

Home Page: https://www.daml.in.tum.de/dimenet

License: Other

Python 69.84% Jupyter Notebook 30.16%

architecture paper pretrained-models

dimenet's People

Contributors

Stargazers

Watchers

Forkers

zrqiao ammieqi silent567 bs3537 hyoukarin faris-k tianaix rollingstonezz agoscinski aspirincode sailfish009 xuanlin1991 qikuizhu liuyunwu sthayashi daniel1991zy trachote run97ze c-p-chou lizhongwei-stonewise yjparksnu zilong-yuan gaoshan2006 wesley-stone plin1112 dongcf spyt2h okin1234 garvit-k ys-arch cocteautwins mohanbing asclepiusinformatica lingsparkq chaoqu12 qc808082 yzhang1996 yuchanpei xinjianouyang tianyuzelin franklinqin0 meiirbek-islamov waqarahmed89 fkxie laura-rieger x1nyulu revealer233 maharjun yansonggu keuperj lchenyangl sudhanshu0703 shreyash-arya06 dlprojectyue ccccyf serein-ss fermiq xxxinc chenyongqing270

dimenet's Issues

[Notice] PyTorch version

I would like to share that I have ported this repo to PyTorch version. Though it's during training still, I have seen quite good result.
https://github.com/akirasosa/pytorch-dimenet

Feel free to close this post, as this is just a notice.

Thanks for great paper,
Akira

How did you make the diagrams?

Looks really nice, just curious what was used for it. https://github.com/klicperajo/dimenet/raw/master/architecture.svg?raw=true&sanitize=true

The link to COLL dataset is not working

Dear authors, the following link is not accessible. Any alternate link?

https://figshare.com/articles/dataset/COLL_Dataset_v1_2/13289165/1

Recomputing rbf and sbf each pass could memoization/caching speed up code?

is there any reason that you need to recompute the rbf and sbf features each pass through the model?

https://github.com/klicperajo/dimenet/blob/bf725c33755cd6fb87661fe03956b5fb30889742/dimenet/model/dimenet.py#L94

I haven't profiled the code so am not sure if this is significant but I would think that you could just pass Z, sbf and rbf as inputs to the model rather than Z and R?

How can we train dimenet on MD17? Any tutorial?

Hi, Klicperajo, Guten Tag! I would like to reproduce your results on MD17 but can't find where do you obtain the derivatives of energies to obtain the forces in the codes. Can you give me some instructions? Thank you so much.

Reason for the linear weight twice

Thank you for your work.

I found in DimNet++ and GemNet that the basis functions are linearly weighted twice in the model rather than just once.

Do you have any reason for it because the multiplication of two matrices can be represented as just one matrix?

I checked the dimension reduction in-between two matrix multiplications. This could reduce the number of parameters, but are there other reasons?

Detail property values in QM9

Where did you obtain the QM9 data?

For example, in qm9_eV.npz,
the 1st data corresponds to CH4 in QM9.
And its U0, U, H, G are -17.172180532853258, -17.286822102173918, -17.389653929960044, -16.151916825255373 in eV(?), respectively.

In QM9, U0, U, H, G are -1101.487799054311, -1101.4097567985575, -1101.384042038555, -1102.0229653876108 in eV, respectively, which means neither eV unit nor Hartree unit matches with the qm9_eV.npz data.

Or was some preprocessing applied?

questions about qm9

Hi, what do id, N, Z, R mean in qm9_eV.npz?
Thanks:)

KeyError: 'ceil'

model = DimeNet(emb_size=emb_size, num_blocks=num_blocks, num_bilinear=num_bilinear,
num_spherical=num_spherical, num_radial=num_radial,
cutoff=cutoff, envelope_exponent=envelope_exponent,
num_before_skip=num_before_skip, num_after_skip=num_after_skip,
num_dense_output=num_dense_output, num_targets=len(targets),
activation=swish)

When i try to run this i get the following error :

KeyError Traceback (most recent call last)

in ()
4 num_before_skip=num_before_skip, num_after_skip=num_after_skip,
5 num_dense_output=num_dense_output, num_targets=len(targets),
----> 6 activation=swish)

4 frames

/content/drive/My Drive/dimenet/dimenet/model/dimenet.py in init(self, emb_size, num_blocks, num_bilinear, num_spherical, num_radial, cutoff, envelope_exponent, num_before_skip, num_after_skip, num_dense_output, num_targets, activation, name, **kwargs)
55 num_radial, cutoff=cutoff, envelope_exponent=envelope_exponent)
56 self.sbf_layer = SphericalBasisLayer(
---> 57 num_spherical, num_radial, cutoff=cutoff, envelope_exponent=envelope_exponent)
58
59 # Embedding and first output block

/content/drive/My Drive/dimenet/dimenet/model/layers/spherical_basis_layer.py in init(self, num_spherical, num_radial, cutoff, envelope_exponent, name, **kwargs)
30 for i in range(num_spherical):
31 if i == 0:
---> 32 first_sph = sym.lambdify([theta], self.sph_harm_formulas[i][0], 'tensorflow')(0)
33 self.sph_funcs.append(lambda tensor: tf.zeros_like(tensor) + first_sph)
34 else:

/usr/local/lib/python3.6/dist-packages/sympy/utilities/lambdify.py in lambdify(args, expr, modules, printer, use_imps, dummify)
377 namespace = {}
378 for m in namespaces[::-1]:
--> 379 buf = _get_namespace(m)
380 namespace.update(buf)
381

/usr/local/lib/python3.6/dist-packages/sympy/utilities/lambdify.py in _get_namespace(m)
469 """
470 if isinstance(m, str):
--> 471 _import(m)
472 return MODULES[m][0]
473 elif isinstance(m, dict):

/usr/local/lib/python3.6/dist-packages/sympy/utilities/lambdify.py in _import(module, reload)
162 # Add translated names to namespace
163 for sympyname, translation in translations.items():
--> 164 namespace[sympyname] = namespace[translation]
165
166 # For computing the modulus of a sympy expression we use the builtin abs

KeyError: 'ceil'

I try this on colab and to solve the error need to update the scipy version to 1.5.

incorporating Dimenet++ intio Lammps

Hi, I'm working on creating an interface between dimenet and Lammps to carry out molecular dynamics simulations, and it seems that I need to convert the Dimenet code into a C++ version, which bothers me. I want to know if you have done similar works on this topic or could you provide some guidances to me?
Thanks a lot!

Coll dataset units

Dear authors,
Could you confirm that the units of the COLL dataset available here
https://figshare.com/articles/dataset/COLL_Dataset_v1_2/13289165/1
are eV and eV/A for energies and forces, respectively (some readme would be really helpful :) )?

Also, could you confirm that the errors in forces reported in DimeNet and GemNet papers are per a single xyz component?

Training time

Thank you for the great work.
So, how long did you train a single task model? And what early stopping value did you choose?

MD17 training problem

In your paper, you tested the model on the harder task of using only 1000 training samples. How did you select those 1000 samples, by random sampling, the first appeared 1000 configurations in the time line or some sampling method to achieve this? What's more , how many samples did you used for validation and testing? Glad if you can help.

Periodic DimeNet

Hi, very nice work on this! :)

I've been exploring the ML/deep learning landscape to find some inspiration for cool ideas that would be nice to play with during my PhD in materials science. I've seen lots of implementations of deep learning for molecules, but not so much for periodic structures such as crystals.

I would like to know if you have given any thought on how periodic conditions could work in a GNN and specifically in DimeNet. Maybe you have already implemented it and I have failed to found it (in that case, excuse me). I have some intuition about it, but I would like to know your thoughts about it, if it's not too much to ask.

From what I understood in your paper, the information about the atoms/nodes positions is only "stored" at the bonds/edges, encoded as the angles and bond lengths. Is this right? If so, my intuition is that, given a periodic system like this one:

you can say that, in the left border, atom 1 is effectively connected to atom 4 through a connection that is in the direction of bond 8 in this drawing. Then, in my naive view, this should account fully for the periodicity of the system, because atom 4 contains the information of the rest of the structure and a kind of loop will be created there.

I'd like to know if you think that this would make sense and if not, I would appreciate if you could share the reasons why this won't work.

Thanks in advance!

how can I processed the original qm9.csv into the qm9_ev.npz ?

GPU training problem

Would you mind if I ask you guys to post something much more concrete about the environment configuration? I came up with some problems when I am trying to train Dimenet++ using GPU. It seems the problem exists because the version of dependencies conflict with each other.

Btw, I am wondering whether the extensive option being True when training hom/lumo/gap is reasonable. It seems that avg aggregation should match the pyhsic's law when facing these 3 properties.

Questions w.r.t MD17

Hi, it is a very nice work. However, I have some questions w.r.t the results on MD17 dataset, which I didn't find in the papers. First, what's the cutoff radius for this dataset? Second, is the benchmark of energy the MAE per molecule or the MAE per atom? And is that of forces the MAE per molecule or MAE per atom, or MAE per atom per component? If the results are MAE per molecule, when the framework applied on supercell of a crystal, maybe the MAE will be very large, since the number of atoms is very large. These make me confused. Thank you very much.

How can I extract the final layer (vector) of the pre-trained DimeNet?

I would like to use the final layer of the pre-trained DimeNet as the input vector for other prediction tasks; in other words, this is a transfer learning proposed in the following paper.

https://pubs.acs.org/doi/abs/10.1021/acs.jpca.0c06231

This approach is very useful but the above authors did not use the DimeNet. I think the DimeNet is more effective for such transfer learning task.

How can I extract the final layer of the pre-trained DimeNet? Specially, for example given the following molecule data input,

F 0.015 0.06 -0.02
C -0.02 1.39 0.01
F 1.24 1.84 -0.02
F -0.64 1.82 -1.08
C -0.70 1.86 1.22
C -1.26 2.25 2.20
H -1.76 2.59 3.08

how can I obtain the the final layer of this molecule from your code?

How to plt the image

Hello, It is a great work. Thank you for the shared code. By the way, I want to know that how to plt the image (https://github.com/gasteigerjo/dimenet/blob/master/2dfilters_large_layer2.png). Thanks a lot.

RBF / SBF Known Issue

Hello, I saw the following mentioned in the known issues:

The radial basis functions in the interaction block actually use d_kj and not d_ji. The best way to fix this is by just using d_ji instead of d_kj in the SBF and leaving the RBF unchanged (DimeNet and DimeNet++).

From comparing with the diagram of the model, I can understand the first sentence. However, I don't completely grasp the suggested fix (and why interaction_pp_block.py:59 is highlighted) -- would you be able to explain the suggested fix a little bit further?

Thanks!

QM9

Dear

I follow the config.ymal and run the codes, however I only got logMAE -4.2913 for the test[QM9].

Is there some config to reproduce the logMAE in the paper.

Question about the Calculation of Angles

I found that the calculations of the angles used in the directional message passing are slightly different from the ones mentioned in your paper. Here the angles are calculated between R1 and R2, which are the angles between m_ji and m_ki:

https://github.com/klicperajo/dimenet/blob/bf725c33755cd6fb87661fe03956b5fb30889742/dimenet/model/dimenet.py#L81-L92

However, the angles defined in the paper are between m_ji and m_kj. I'm wondering which is a better way. Maybe they perform similarly to represent the directional information?

Would love to see a matbench submission for DimeNet++

Note that I'm unaffiliated

See https://matbench.materialsproject.org/

Matbench is an ImageNet for materials science; a curated set of 13 supervised, pre-cleaned, ready-to-use ML tasks for benchmarking and fair comparison. The tasks span a wide domain of inorganic materials science applications including electronic, thermodynamic, mechanical, and thermal properties among crystals, 2D materials, disordered metals, and more.

The Matbench python package provides everything needed to use Matbench with your ML algorithm in ~10 lines of code or less.

Are loss values scaled?

While running DimeNet++ with the pretrained weights, I noticed that the MAE for certain targets seem to be scaled, for example:

The reported test MAE (in the paper) for homo is 24.6, but the MAE reported by predict.ipynb is 0.0246.
The reported test MAE (in the paper) for zpve is 1.21, but the MAE reported by predict.ipynb is 0.00121.

Am I correct in thinking that there is some scaling going on, and could you point me to where it occurs? (i.e. if it was done during featurization, reporting, or somewhere else) Thanks!

each single target

Prediction of H2 using the pre-trained DimeNet++ model

I have trained the DimeNet++ model with the QM9 dataset and then predicted the hydrogen molecule, H2 which is the simplest molecule as a benchmark, using the pre-trained model. However, I obtained the poor result for the atomization energy of H2. My result may be something wrong, so please try and report your result.

Because the atomic distance (i.e. bond length) of H2 is 0.74 Å, the 3D structure of the hydrogen molecule can be written as

atom x y z
H 0.0 0.0 0.0
H 0.74 0.0 0.0

and the atomization energy is 4.54 eV (see https://wiki.fysik.dtu.dk/gpaw/dev/tutorials/H2/atomization.html). The prediction by the pre-trained DimeNet++ model, however, was about 9.79 eV and its error is 9.79 - 4.54 = 5.25 eV = 120 kcal/mol. This very poor result for the simplest molecule seems to be something wrong because the DimeNet++ model learned and predicted the atomization energy of molecules in the QM9 dataset with less than MAE = 0.01 eV = 0.23 kcal/mol.

Probably, the main reason is that the QM9 dataset does not include the "diatomic molecules" such as H2, N2, and O2. Even If a machine learning model achieved a low MAE on such QM9 dataset, if the error for the simplest hydrogen molecule H2 is over 100 kcal/mol, can we say that the model could capture the molecular energy?

Reimplementation in graph mode using TF1

Hi, I am trying to implement the model in graph mode using tensorflow 1 and I am facing a problem which seems to be associated with the model's dependence on eager mode for calculating certain features. The getitem part of the data container receives the existing features in form of batch and process it as a batch to generate other fields which are variable shapes. However, in graph mode this is not feasible as data has to be of similar share for each data point. Do you see this possible using graph mode? If so, how do you suggest to implement this is graph mode.

Creation of Custom Dataset-OMDB

Dear All,

First of all, thank you for the effort to create this awesome model, Dimenet.

I have been working with SchNetpack to predict bandgap using OMDB(Open Material Database) dataset. It is quite similar to QM9. Xyz file that includes coordinates and csv file that includes bandgaps and so on. You can find the dataset below:

https://omdb.mathub.io/dataset

I am working with big molecules (consisting of around 140-150 atoms). that is why I am not able to use QM9 because my model does not give good result with small molecules in QM9. OMDB average is 82 atom per molecule.

I wonder how I can create custom dataset from OMDB to use DimeNet? Do you have any plan to implement other datasets? I saw QM9.npz file in the "data" folder but I am not sure how you have created it.

I appreciate your help.
Mirac

How many learning parameters does the pre-trained DimeNet++ model have?

How many learning parameters does the pre-trained DimeNet++ model have?
How can I count this?
I would like to compare the number of parameters with other models like SchNet and MEGNet.

One-off error in QM9 data?

Hi @gasteigerjo, I'm looking at the file https://github.com/klicperajo/dimenet/blob/master/data/qm9_eV.npz and am trying to recreate it from the original raw QM9 data. When I look at the list of uncharacterized molecules, i.e. the ones that failed to converge in DFT, I think there might be a one-off error. So the QM9 data set indexing starts at 1 for how they label molecules. The first uncharacterized index is 58. So e.g. the value for U0 in dsgdb9nsd_000058.xyz is -242.19573 Ha and after subtracting the isolated atom energies and converting to eV you'd get -34.008354871077934 eV. This matches the 58-th entry (index 57) in the npz data files you uploaded: -34.008354871077934. But this is the index that should be excluded because it's one of the unconverged molecules. This leads me to believe there might be a one-off error in your data creation process as this will likely be repeated for all of the 3054 unconverged molecules? If not, maybe you could let me know where I'm thinking wrong here?

Request for a pretrained model

Hi @klicperajo

As you mentioned the model takes around 20 days to train on a single GPU (a 1080Ti), it would be nice if you could share the pre-trained model so that one can easily run and restore the checkpoint to reproduce the results. Otherwise, 20 days of training is a long time.

Thanks :)