open-catalyst-project / ocp Goto Github PK

FAIR Chemistry's library of machine learning methods for chemistry

Home Page: https://opencatalystproject.org/

License: Other

Python 99.75% Shell 0.14% Ada 0.11%

ocp's Introduction

`fairchem` by FAIR Chemistry

fairchem is the FAIR Chemistry's centralized repository of all its data, models, demos, and application efforts for materials science and quantum chemistry.

Documentation

If you are looking for Open-Catalyst-Project/ocp, it can now be found at fairchem.core. Visit its corresponding documentation here.

The repository is organized into several directories to help you find what you are looking for:

fairchem.core: State of the art machine learning models for materials science and chemistry
fairchem.data: Dataset downloads and input generation codes
fairchem.demo: Python API for the Open Catalyst Demo
fairchem.applications: Follow up applications and works (AdsorbML, CatTSunami, etc.)

Installation

Packages can be installed in your environment by the following:

pip install -e packages/fairchem-{fairchem-package-name}

fairchem.core requires you to first create your environment

Installation Guide

Quick Start

Pretrained models can be used directly with ASE through our OCPCalculator interface:

from ase.build import fcc100, add_adsorbate, molecule
from ase.optimize import LBFGS
from fairchem.core import OCPCalculator

# Set up your system as an ASE atoms object
slab = fcc100('Cu', (3, 3, 3), vacuum=8)
adsorbate = molecule("CO")
add_adsorbate(slab, adsorbate, 2.0, 'bridge')

calc = OCPCalculator(
    model_name="EquiformerV2-31M-S2EF-OC20-All+MD",
    local_cache="pretrained_models",
    cpu=False,
)
slab.calc = calc

# Set up LBFGS dynamics object
dyn = LBFGS(slab)
dyn.run(0.05, 100)

If you are interested in training your own models or fine-tuning on your datasets, visit the documentation for more details and examples.

Why a single repository?

Since many of our repositories rely heavily on our other repositories, a single repository makes it really easy to test and ensure consistency across repositories. This should also help simplify the installation process for users who are interested in integrating many of the efforts into one place.

LICENSE

fairchem is available under a MIT License.

ocp's People

Contributors

Stargazers

Watchers

Forkers

thgautham aspirincode gsaltintas rameshoswal wood-b open-eramet adeeshkolluru pdkyll gravityphy albertvillanova joshes 00mjk borao allenai mosheman5 baranwa2 ruskinkot1 erssebaggala lixinyuu flash-jaehyun jg8610 elliottower sgbaird gwanyeong kaaiian brandstetter-johannes melisandeteng ciflord joskid jacklu0 brookwander a3ahmad yanbingtao ianbenlolo goldsmith-lab sparticlesteve hyunp2 alexhernandezgarcia cocteautwins mansurcompai kayahans martaskrt xdotproduct marimuda borisnaz chris-price19 sudarsan-surendralal alexnkorovin nerdyjock gasteigerjo nimashoghi anuroopsriram gaoshan2006 lmj1029123 bnl-ml-group jxzhangjhu sohaibumr filipekstrm atefar2 abdalazizrashid taherzad mversionpenny joshuawicks sxie22 junhyunb shan-zhu hoon-ock kaselby pablo-unzueta yjparksnu quocdat32461997 wenbintum rolnicklab shawlen polymath-is riannevdberg yilunliao dnight5 yan421535112 birthbaum alchem0x2a zhjj-github jmusiel irlirion rtadijar terencehernandez emsunshine tum-di-lab-graph-scaling chelseajohn yasufumisakai nazaninkhazra kunalghosh stpayu liyihang1024 ramanarayan86 satvik-r-kashyap ftherrien kruskallin tongjin0521 vmm221313

ocp's Issues

installation of ocp-models

Hello OCP team,

I was interested to use your OCP-models by creating my own dataset.
However, I have issue on the installation step that the ocp-models doesn't activate (conda activate ocp-models). I follow the installation procedure that you provided here: https://github.com/Open-Catalyst-Project/ocp. I am trying to install on CPU machine and I downloaded your env.common.yml and env.cpu.yml. The following steps seems worked:

"conda-merge env.common.yml env.cpu.yml > env.yml"
"conda env create -f env.yml"

However, it seems that the conda environment for ocp-models doesn't created.
Is there something that I missed? I would really appreciate if you could help for my installation issue.

Need clarification on --ref-energy for rattled/md downloads

Could you please clarify whether --ref-energy is needed when downloading the "rattled" and "md" datasets?

Include some estimate of run-time in Leaderboard

This looks like a great project and I am excited to see how it develops.

In the dataset paper Table 2 doesn't give an estimate of speed. The speed-accuracy trade-off is perhaps the most critical aspect of a S2EF model and I was wondering if the leaderboard is going to take this into account somehow. This benchmark paper doi:10.1021/acs.jpca.9b08723 (I am not associated) shows this trade-off clearly and an leaderboard showing this trade-off would be very informative.

Can this be added to the project?

Error with Schnet notebook

Hi,

I've tried to run the demo notebook of training Schnet https://github.com/Open-Catalyst-Project/ocp/blob/master/docs/source/tutorials/train_s2ef_example.ipynb and encountered error when running cell 11 trainer.train():

It says cannot pickle Environment object. For reference, I've followed the instruction to install the repo for CPU-only machines.

Thanks
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_29512/4032920361.py in
----> 1 trainer.train()

c:\users\ocp\ocp-master\ocpmodels\trainers\forces_trainer.py in train(self, disable_eval_tqdm)
401 self.train_sampler.set_epoch(epoch_int)
402 skip_steps = self.step % len(self.train_loader)
--> 403 train_loader_iter = iter(self.train_loader)
404
405 for i in range(skip_steps, len(self.train_loader)):

~\miniconda3\envs\ocp-models\lib\site-packages\torch\utils\data\dataloader.py in iter(self)
353 return self._iterator
354 else:
--> 355 return self._get_iterator()
356
357 @Property

~\miniconda3\envs\ocp-models\lib\site-packages\torch\utils\data\dataloader.py in _get_iterator(self)
299 else:
300 self.check_worker_number_rationality()
--> 301 return _MultiProcessingDataLoaderIter(self)
302
303 @Property

~\miniconda3\envs\ocp-models\lib\site-packages\torch\utils\data\dataloader.py in init(self, loader)
912 # before it starts, and del tries to join but will get:
913 # AssertionError: can only join a started process.
--> 914 w.start()
915 self._index_queues.append(index_queue)
916 self._workers.append(w)

~\miniconda3\envs\ocp-models\lib\multiprocessing\process.py in start(self)
119 'daemonic processes are not allowed to have children'
120 _cleanup()
--> 121 self._popen = self._Popen(self)
122 self._sentinel = self._popen.sentinel
123 # Avoid a refcycle if the target function holds an indirect

~\miniconda3\envs\ocp-models\lib\multiprocessing\context.py in _Popen(process_obj)
222 @staticmethod
223 def _Popen(process_obj):
--> 224 return _default_context.get_context().Process._Popen(process_obj)
225
226 class DefaultContext(BaseContext):

~\miniconda3\envs\ocp-models\lib\multiprocessing\context.py in _Popen(process_obj)
325 def _Popen(process_obj):
326 from .popen_spawn_win32 import Popen
--> 327 return Popen(process_obj)
328
329 class SpawnContext(BaseContext):

~\miniconda3\envs\ocp-models\lib\multiprocessing\popen_spawn_win32.py in init(self, process_obj)
91 try:
92 reduction.dump(prep_data, to_child)
---> 93 reduction.dump(process_obj, to_child)
94 finally:
95 set_spawning_popen(None)

~\miniconda3\envs\ocp-models\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
58 def dump(obj, file, protocol=None):
59 '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60 ForkingPickler(file, protocol).dump(obj)
61
62 #

TypeError: cannot pickle 'Environment' object`

How to use the S2EF model as ASE calcualtor with CUDA (GPU)

Hello! I am starting to use a neural network potential in Open-Catalyst-Project, the brilliant neural network potential.

I learned that from https://github.com/Open-Catalyst-Project/ocp/blob/master/tutorials/OCP_Tutorial.ipynb,
and found that working well with GPU RTX 2080 / 3090 (CUDA).
However, I would like to use the OCP NNP combined with ASE and CUDA,
while I could only use the NNP with ASE and CPU,
maybe because there is no setting related to CUDA in ocpmodels/common/relaxation/ase_utils.py.

So... of course, if not trobling you, let me ask a fever modify the as ase_utils.py to be suited to CUDA conditions.

Best regards,
Teruo

if specified, OCP test sets cause training to break

So we have a record of this bug...OCP test sets don't have targets available, so running validate() on them breaks the trainers. An example of the energy_trainer is shown below.

device 0:   0%|                                                                                      | 0/63 [00:00<?, ?it/s]Traceback (most recent call last):
  File "main.py", line 115, in <module>
    main(config)
  File "main.py", line 52, in main
    trainer.train()
  File "/home/jovyan/projects/ocp/ocpmodels/trainers/energy_trainer.py", line 270, in train
    self.validate(split="test", epoch=epoch)
  File "/home/jovyan/projects/ocp/ocpmodels/trainers/base_trainer.py", line 502, in validate
    loss = self._compute_loss(out, batch)
  File "/home/jovyan/projects/ocp/ocpmodels/trainers/energy_trainer.py", line 284, in _compute_loss
    [batch.y_relaxed.to(self.device) for batch in batch_list], dim=0
  File "/home/jovyan/projects/ocp/ocpmodels/trainers/energy_trainer.py", line 284, in <listcomp>
    [batch.y_relaxed.to(self.device) for batch in batch_list], dim=0
AttributeError: 'Batch' object has no attribute 'y_relaxed'

@abhshkdz we discussed not having validate() called on the test sets during training and instead save predictions with predict. Do we still want to move in that direction? I'll have some cycles to fix this tomorrow if no one else is already working on it. @anuroopsriram

Cannot create an environment

Hi,

I am trying to create ocp environment but got following error.

Preparing transaction: done
Verifying transaction: done
Executing transaction: / By downloading and using the CUDA Toolkit conda packages, you accept the terms and conditions of the CUDA End User License Agreement (EULA): https://docs.nvidia.com/cuda/eula/index.html

done
Installing pip dependencies: / Ran pip subprocess with arguments:
['/home/boris/anaconda3/envs/ocp-models/bin/python', '-m', 'pip', 'install', '-U', '-r', '/home/boris/ocp/condaenv.wlfdfjny.requirements.txt']
Pip subprocess output:
Looking in links: https://pytorch-geometric.com/whl/torch-1.8.0+cu102.html
Collecting git+https://github.com/rusty1s/pytorch_geometric.git@4ea63d3 (from -r /home/boris/ocp/condaenv.wlfdfjny.requirements.txt (line 4))
  Cloning https://github.com/rusty1s/pytorch_geometric.git (to revision 4ea63d3) to /tmp/pip-req-build-lh_oslmq
  Resolved https://github.com/rusty1s/pytorch_geometric.git to commit 4ea63d3
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: Pillow in /home/boris/anaconda3/envs/ocp-models/lib/python3.8/site-packages (from -r /home/boris/ocp/condaenv.wlfdfjny.requirements.txt (line 2)) (8.4.0)
Collecting Pillow
  Using cached Pillow-9.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB)
Collecting demjson
  Using cached demjson-2.2.4.tar.gz (131 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'

Pip subprocess error:
  Running command git clone --filter=blob:none --quiet https://github.com/rusty1s/pytorch_geometric.git /tmp/pip-req-build-lh_oslmq
  WARNING: Did not find branch or tag '4ea63d3', assuming revision or ref.
  Running command git checkout -q 4ea63d3
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      error in demjson setup command: use_2to3 is invalid.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

failed

CondaEnvException: Pip failed

Before I didn't have any problems with creating the environment.
Would you please give me some advice?

Using distributed data parallel on single machine with multi-GPU

Hi, this issue is in reference to the conversation in #174 about performing inference/prediction on a single server, but using multiple GPUs to parallelize. I am attempting to run on an Ubuntu server with 2x 1080Ti and running into the following error:

$ python -u -m torch.distributed.launch --nproc_per_node=2 main.py --mode predict --config-yml configs/s2ef/200k/schnet/schnet.yml --checkpoint pretrained/schnet_200k.pt --num-gpu 2 --distributed

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
### Loading dataset: trajectory_lmdb
amp: false
cmd:
  checkpoint_dir: ./checkpoints/2021-01-13-11-14-08
  commit: 899de1c
  identifier: ''
  logs_dir: ./logs/tensorboard/2021-01-13-11-14-08
  print_every: 10
  results_dir: ./results/2021-01-13-11-14-08
  seed: 0
  timestamp: 2021-01-13-11-14-08
dataset:
  grad_target_mean: 0.0
  grad_target_std: 2.887317180633545
  normalize_labels: true
  src: data/s2ef/200k/train/
  target_mean: -0.7554450631141663
  target_std: 2.887317180633545
gpus: 2
logger: tensorboard
model: schnet
model_attributes:
  cutoff: 6.0
  hidden_channels: 1024
  num_filters: 256
  num_gaussians: 200
  num_interactions: 3
  use_pbc: true
optim:
  batch_size: 32
  eval_batch_size: 8
  force_coefficient: 100
  lr_gamma: 0.1
  lr_initial: 0.0005
  lr_milestones:
  - 5
  - 8
  - 10
  max_epochs: 30
  num_workers: 64
  warmup_epochs: 3
  warmup_factor: 0.2
task:
  dataset: trajectory_lmdb
  description: Regressing to energies and forces for DFT trajectories from OCP
  eval_on_free_atoms: true
  grad_input: atomic forces
  labels:
  - potential energy
  metric: mae
  train_on_free_atoms: true
  type: regression
test_dataset:
  src: data/s2ef/all/val_id/
val_dataset:
  src: data/s2ef/all/val_id/

### Loading dataset: trajectory_lmdb
### Loading model: schnet
### Loaded SchNet with 5704193 parameters.
NOTE: model gradient logging to tensorboard not yet supported.
### Loading checkpoint from: pretrained/schnet_200k.pt
### Loading checkpoint from: pretrained/schnet_200k.pt
### Predicting on test.
device 0:   0%|                                                                                                                                                                                 | 0/62500 [00:13<?, ?it/s]
device 1:   0%|                                                                                                                                                                                 | 0/62500 [00:00<?, ?it/sTraceback (most recent call last):                                                                                                                                                   | 1/62500 [00:23<408:21:09, 23.52s/it]
  File "main.py", line 140, in <module>
    Runner()(config)
  File "main.py", line 80, in __call__
    disable_tqdm=False,
  File "/mnt/work/chaitanya/ocp-models/ocpmodels/trainers/forces_trainer.py", line 283, in predict
    out = self._forward(batch_list)
  File "/mnt/work/chaitanya/ocp-models/ocpmodels/trainers/forces_trainer.py", line 454, in _forward
    out_energy, out_forces = self.model(batch_list)
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/work/chaitanya/ocp-models/ocpmodels/common/data_parallel.py", line 50, in forward
    return self.module(batch_list[0].to(f"cuda:{self.device_ids[0]}"))
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/work/chaitanya/ocp-models/ocpmodels/models/schnet.py", line 116, in forward
    h = h + interaction(h, edge_index, edge_weight, edge_attr)
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch_geometric/nn/models/schnet.py", line 301, in forward
    x = self.conv(x, edge_index, edge_weight, edge_attr)
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch_geometric/nn/models/schnet.py", line 327, in forward
    x = self.propagate(edge_index, x=x, W=W)
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 236, in propagate
    out = self.message(**msg_kwargs)
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch_geometric/nn/models/schnet.py", line 332, in message
    return x_j * W
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.92 GiB total capacity; 242.76 MiB already allocated; 10.31 MiB free; 272.00 MiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/chaitanya/miniconda3/envs/ocp-models/bin/python', '-u', 'main.py', '--local_rank=1', '--mode', 'predict', '--config-yml', 'configs/s2ef/200k/schnet/schnet.yml', '--checkpoint', 'pretrained/schnet_200k.pt', '--num-gpu', '2', '--distributed']' returned non-zero exit status 1.

Running the code creates three processes: two on GPU0, one on GPU1. However, it crashes before any computation takes place.

I am able to run prediction on a single GPU, but I would like to leverage multiple GPUs to speed things up (e.g. tqdm estimates prediction on val_id will take about an hour).

CUDA illegal memory access error when running relaxations

When running relaxations, it will sometimes run into a CUDA illegal memory access error. This seems to be stochastic, as it doesn't consistently happen on the same relaxation.

Example command where I encountered the error: python main.py --mode run-relaxations --config-yml configs/s2ef/all/dimenet_plus_plus/dpp_forceonly.yml --checkpoint /checkpoint/abhshkdz/ocp_baselines_run/checkpoints/2020-11-08-18-31-31-dpp1.8M_forceonly_all_restart_ep4.5/checkpoint.pt --distributed --num-gpus 8 --num-nodes 8 --slurm-partition learnfair --submit

Stack trace:

[omitted: submitit methods]
  File "main.py", line 91, in __call__
    trainer.run_relaxations()
  File "/private/home/janlan/code/ocp-models/ocpmodels/trainers/forces_trainer.py", line 655, in run_relaxations
    transform=None,
  File "/private/home/janlan/code/ocp-models/ocpmodels/common/relaxation/ml_relaxation.py", line 57, in ml_relax
    relaxed_batch = optimizer.run(fmax=fmax, steps=steps)
  File "/private/home/janlan/code/ocp-models/ocpmodels/common/relaxation/optimizers/lbfgs_torch.py", line 90, in run
    r0, f0, e0 = self.step(iteration, r0, f0, H0, rho, s, y)
  File "/private/home/janlan/code/ocp-models/ocpmodels/common/relaxation/optimizers/lbfgs_torch.py", line 123, in step
    e, f = self.get_forces()
  File "/private/home/janlan/code/ocp-models/ocpmodels/common/relaxation/optimizers/lbfgs_torch.py", line 52, in get_forces
    energy, forces = self.model.get_forces(self.atoms, apply_constraint)
  File "/private/home/janlan/code/ocp-models/ocpmodels/common/relaxation/optimizers/lbfgs_torch.py", line 161, in get_forces
    predictions = self.model.predict(atoms, per_image=False)
  File "/private/home/sidgoyal/.conda/envs/ocp-models/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/janlan/code/ocp-models/ocpmodels/trainers/forces_trainer.py", line 284, in predict
    out = self._forward(batch_list)
  File "/private/home/janlan/code/ocp-models/ocpmodels/trainers/forces_trainer.py", line 485, in _forward
    out_energy, out_forces = self.model(batch_list)
  File "/private/home/sidgoyal/.conda/envs/ocp-models/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/private/home/sidgoyal/.conda/envs/ocp-models/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/private/home/sidgoyal/.conda/envs/ocp-models/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/private/home/janlan/code/ocp-models/ocpmodels/common/data_parallel.py", line 50, in forward
    return self.module(batch_list[0].to(f"cuda:{self.device_ids[0]}"))
  File "/private/home/sidgoyal/.conda/envs/ocp-models/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/private/home/janlan/code/ocp-models/ocpmodels/models/dimenet_plus_plus.py", line 453, in forward
    energy = self._forward(data)
  File "/private/home/janlan/code/ocp-models/ocpmodels/common/utils.py", line 90, in cls_method
    return f(self, *args, **kwargs)
  File "/private/home/sidgoyal/.conda/envs/ocp-models/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/janlan/code/ocp-models/ocpmodels/models/dimenet_plus_plus.py", line 411, in _forward
    edge_index, num_nodes=data.atomic_numbers.size(0)
  File "/private/home/janlan/code/ocp-models/ocpmodels/models/dimenet_plus_plus.py", line 311, in triplets
    adj_t_row = adj_t[row]
  File "/private/home/sidgoyal/.conda/envs/ocp-models/lib/python3.6/site-packages/torch_sparse/tensor.py", line 467, in __getitem__
    out = out.index_select(dim, item)
  File "/private/home/sidgoyal/.conda/envs/ocp-models/lib/python3.6/site-packages/torch_sparse/index_select.py", line 98, in <lambda>
    SparseTensor.index_select = lambda self, dim, idx: index_select(self, dim, idx)
  File "/private/home/sidgoyal/.conda/envs/ocp-models/lib/python3.6/site-packages/torch_sparse/index_select.py", line 24, in index_select
    device=col.device).repeat_interleave(rowcount)
RuntimeError: CUDA error: an illegal memory access was encountered
NCCL error in: /opt/conda/conda-bld/pytorch_1595629416375/work/torch/lib/c10d/../c10d/NCCLUtils.hpp:69, unhandled cuda error, NCCL version 2.4.8

cc @abhshkdz

Loading npz dictionaries using `allow_pickle=True` poses security risk

We save out forces (in S2EF) and positions (in IS2RS) as numpy objects which requires np.load(.., allow_pickle=True) while loading. This is unsafe since it can allow users to execute malicious code (on the evaluation server for example).

https://github.com/Open-Catalyst-Project/ocp/blob/master/ocpmodels/trainers/forces_trainer.py#L329
https://github.com/Open-Catalyst-Project/ocp/blob/master/ocpmodels/trainers/forces_trainer.py#L688

Error Loading pre-trained checkpoints

Hi! It seems that the updates to the codebase have made the pre-trained checkpoints released earlier to be incompatible with the current models. Here, I tried loading a pre-trained SchNet downloaded for S2EF from MODELS.md and following the instructions in TRAIN.md:

$ python main.py --mode predict --config-yml configs/s2ef/200k/schnet/schnet.yml --checkpoint pretrained/schnet_200k.pt 
amp: false
cmd:
  checkpoint_dir: ./checkpoints/2021-01-08-12-16-00
  identifier: ''
  logs_dir: ./logs/tensorboard/2021-01-08-12-16-00
  print_every: 10
  results_dir: ./results/2021-01-08-12-16-00
  seed: 0
  timestamp: 2021-01-08-12-16-00
dataset:
  grad_target_mean: 0.0
  grad_target_std: 2.887317180633545
  normalize_labels: true
  src: data/s2ef/200k/train/
  target_mean: -0.7554450631141663
  target_std: 2.887317180633545
logger: tensorboard
model: schnet
model_attributes:
  cutoff: 6.0
  hidden_channels: 1024
  num_filters: 256
  num_gaussians: 200
  num_interactions: 3
  use_pbc: true
optim:
  batch_size: 32
  eval_batch_size: 16
  force_coefficient: 100
  lr_gamma: 0.1
  lr_initial: 0.0005
  lr_milestones:
  - 5
  - 8
  - 10
  max_epochs: 30
  num_workers: 64
  warmup_epochs: 3
  warmup_factor: 0.2
task:
  dataset: trajectory_lmdb
  description: Regressing to energies and forces for DFT trajectories from OCP
  eval_on_free_atoms: true
  grad_input: atomic forces
  labels:
  - potential energy
  metric: mae
  train_on_free_atoms: true
  type: regression
test_dataset:
  src: data/s2ef/all/val_id/
val_dataset:
  src: data/s2ef/all/val_id/

### Loading dataset: trajectory_lmdb
### Loading model: schnet
### Loaded SchNet with 5704193 parameters.
NOTE: model gradient logging to tensorboard not yet supported.
### Loading checkpoint from: pretrained/schnet_200k.pt
Traceback (most recent call last):
  File "main.py", line 140, in <module>
    Runner()(config)
  File "main.py", line 59, in __call__
    trainer.load_pretrained(config["checkpoint"])
  File "/mnt/work/chaitanya/ocp-models/ocpmodels/trainers/base_trainer.py", line 352, in load_pretrained
    self.model.load_state_dict(checkpoint["state_dict"])
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for OCPDataParallel:
        Missing key(s) in state_dict: "module.atomic_mass", "module.embedding.weight", "module.distance_expansion.offset", "module.interactions.0.mlp.0.weight", "module.interactions.0.mlp.0.bias", "module.interactions.0.mlp.2.weight", "module.interactions.0.mlp.2.bias", "module.interactions.0.conv.lin1.weight", "module.interactions.0.conv.lin2.weight", "module.interactions.0.conv.lin2.bias", "module.interactions.0.conv.nn.0.weight", "module.interactions.0.conv.nn.0.bias", "module.interactions.0.conv.nn.2.weight", "module.interactions.0.conv.nn.2.bias", "module.interactions.0.lin.weight", "module.interactions.0.lin.bias", "module.interactions.1.mlp.0.weight", "module.interactions.1.mlp.0.bias", "module.interactions.1.mlp.2.weight", "module.interactions.1.mlp.2.bias", "module.interactions.1.conv.lin1.weight", "module.interactions.1.conv.lin2.weight", "module.interactions.1.conv.lin2.bias", "module.interactions.1.conv.nn.0.weight", "module.interactions.1.conv.nn.0.bias", "module.interactions.1.conv.nn.2.weight", "module.interactions.1.conv.nn.2.bias", "module.interactions.1.lin.weight", "module.interactions.1.lin.bias", "module.interactions.2.mlp.0.weight", "module.interactions.2.mlp.0.bias", "module.interactions.2.mlp.2.weight", "module.interactions.2.mlp.2.bias", "module.interactions.2.conv.lin1.weight", "module.interactions.2.conv.lin2.weight", "module.interactions.2.conv.lin2.bias", "module.interactions.2.conv.nn.0.weight", "module.interactions.2.conv.nn.0.bias", "module.interactions.2.conv.nn.2.weight", "module.interactions.2.conv.nn.2.bias", "module.interactions.2.lin.weight", "module.interactions.2.lin.bias", "module.lin1.weight", "module.lin1.bias", "module.lin2.weight", "module.lin2.bias". 
        Unexpected key(s) in state_dict: "module.module.atomic_mass", "module.module.embedding.weight", "module.module.distance_expansion.offset", "module.module.interactions.0.mlp.0.weight", "module.module.interactions.0.mlp.0.bias", "module.module.interactions.0.mlp.2.weight", "module.module.interactions.0.mlp.2.bias", "module.module.interactions.0.conv.lin1.weight", "module.module.interactions.0.conv.lin2.weight", "module.module.interactions.0.conv.lin2.bias", "module.module.interactions.0.conv.nn.0.weight", "module.module.interactions.0.conv.nn.0.bias", "module.module.interactions.0.conv.nn.2.weight", "module.module.interactions.0.conv.nn.2.bias", "module.module.interactions.0.lin.weight", "module.module.interactions.0.lin.bias", "module.module.interactions.1.mlp.0.weight", "module.module.interactions.1.mlp.0.bias", "module.module.interactions.1.mlp.2.weight", "module.module.interactions.1.mlp.2.bias", "module.module.interactions.1.conv.lin1.weight", "module.module.interactions.1.conv.lin2.weight", "module.module.interactions.1.conv.lin2.bias", "module.module.interactions.1.conv.nn.0.weight", "module.module.interactions.1.conv.nn.0.bias", "module.module.interactions.1.conv.nn.2.weight", "module.module.interactions.1.conv.nn.2.bias", "module.module.interactions.1.lin.weight", "module.module.interactions.1.lin.bias", "module.module.interactions.2.mlp.0.weight", "module.module.interactions.2.mlp.0.bias", "module.module.interactions.2.mlp.2.weight", "module.module.interactions.2.mlp.2.bias", "module.module.interactions.2.conv.lin1.weight", "module.module.interactions.2.conv.lin2.weight", "module.module.interactions.2.conv.lin2.bias", "module.module.interactions.2.conv.nn.0.weight", "module.module.interactions.2.conv.nn.0.bias", "module.module.interactions.2.conv.nn.2.weight", "module.module.interactions.2.conv.nn.2.bias", "module.module.interactions.2.lin.weight", "module.module.interactions.2.lin.bias", "module.module.lin1.weight", "module.module.lin1.bias", "module.module.lin2.weight", "module.module.lin2.bias".

Any idea for implementation of GemNet-Q

Hi! Thanks for your great implementation of GemNet-T. I wonder if you are planning for implementation of GemNet-Q. Actually I try to implement GemNet-Q from GemNet-T, but I found that the number of quadruplets is very large in OCP, e.g, I can obtain 18,000,000+ quadruplets from 390,000+ triplets (for one batch), which is difficult to train in one card. Any idea about the implementation?

GemNet not saving checkpoint for s2ef task

Hi,

When I use GemNet for the s2ef task, the program creates the checkpoint folder but there is no checkpoint file saved even after the training is done. On the other hand, for SchNet, the checkpoint files are saved properly.

Thanks,

Mingjie

What's the training loss (as opposed to validation loss) like?

Specifically, I'm interesting in the 200k CGCNN S2EF split. Going to try training and opening with tensorboard, but it's my first time using tensorboard. Didn't see much by way of training loss in the papers (maybe I missed it).

Missing keyword 'epoch' in validate method

Hi! When I run the script

python main.py --mode train --config-yml configs/is2re/10k/schnet/schnet.yml

it would throw an exception

Traceback (most recent call last):
  File "main.py", line 126, in <module>
    Runner()(config)
  File "main.py", line 66, in __call__
    self.task.run()
  File "/home/grads/k/kruskallin/ocp/ocpmodels/tasks/task.py", line 35, in run
    self.trainer.train(
  File "/home/grads/k/kruskallin/ocp/ocpmodels/trainers/energy_trainer.py", line 332, in train
    val_metrics = self.validate(
  File "/home/grads/k/kruskallin/anaconda3/envs/ocp-models/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
TypeError: validate() got an unexpected keyword argument 'epoch'

I track the files and it shows that the method in base_train.py misses keywork 'epoch'.

Usage of EMA

Hi.
How do I use EMA in the new version.
I have set ema_decay: 0.999 in the config.
In the validation step during the training what weights are used?
How I make prediction using ema weights?
Does it work for both checkpoint.pt and best_pointcheckpoint.pt?

Thanks

`run_relaxations` computes predictions on some systems multiple times

ML-driven relaxations on the test split saves out predictions on 99904 systems in total while it should've been 99835 systems. Some systems were repeated (the sid that gets saved has those repeats). Complete breakdown below.

Prediction count:

ID: 24960
OOD_ADS: 24960
OOD_CAT: 24992
OOD_BOTH: 24992

Just looking at these numbers, it seems like it has something to do with world size and distributed sampler (I was running it on 4 8-GPU nodes).

Actual split sizes, verified against LMDBs:

ID: 24951
OOD_ADS: 24931
OOD_CAT: 24967
OOD_BOTH: 24986

Learning Rate Forcenet

Dear OCP Team,

I would like to better understand the learning rate scheduler when using ForceNet.
When I execute my learning code, the output tells me that the learning rate is slowly increased, from 1.00e-4 to 4.56e-4 after 50 epochs. Why does this happen and can I switch this off? Please find my code attached.
I would also like to better understand the learning rate setup as given in the forcenet configuration file at
ocp/blob/main/configs/s2ef/200k/forcenet/fn_forceonly.yml
There it says:
*** Important note ***
The total number of gpus used for this run was 8.
If the global batch size (num_gpus * batch_size) is modified
the lr_milestones and warmup_steps need to be adjusted accordingly.
[...]
lr_gamma: 0.1
lr_milestones: # steps at which lr_initial <- lr_initial * lr_gamma
- 15625
- 25000
- 31250
warmup_steps: 9375
warmup_factor: 0.2

How is a "step" defined that determines when the learning rate is reduced? How does the warmup work?

I also want to mention that forcenet did not immediately work after installing ocp from the conda yaml files (ocp/tree/main/tests/models/test_forcenet.py failed). I solved this by upgrading to pytorch 1.10 (the conda yaml file requests pytorch 1.9) and reinstalling the pytorch-geometric components accordingly.

Many thanks in advance and best
Thorren
input_reaxFF_water_rescaled_1000_v2.extxyz.txt
learn_forcenet.py.txt

md5 checksum for s2ef_train_20M.tar

I just downloaded s2ef_train_2M.tar and s2ef_train_20M.tar and got the following MD5 checksum:

953474cb93f0b08cdc523399f03f7c36 s2ef_train_2M.tar
863bc983245ffc0285305a1850e19cf7  s2ef_train_20M.tar

So it seems that the checksum values in DATASET.md needs to be swapped.

Utilizing GPU/CUDA on Windows Subsystem for Linux 2 (WSL 2)

In #207 (see also #206), it came up that @mshuaibii has used WSL before. I'm curious, have you had any success with doing OCP using both CUDA and WSL 2? I'm looking into doing this as described in GPU CUDA in WSL.

received 0 items of ancdata / Pin memory thread exited unexpectedly, while running 10k isre schnet config

When I try to train the simplest is2re dataset using schnet, with this commandline: python main.py --mode train --config-yml configs/is2re/10k/schnet/schnet.yml

I get the following errors :/

amp: false
cmd:
checkpoint_dir: ./checkpoints/2020-10-20-15-51-25
identifier: ''
logs_dir: ./logs/tensorboard/2020-10-20-15-51-25
print_every: 10
results_dir: ./results/2020-10-20-15-51-25
seed: 0
timestamp: 2020-10-20-15-51-25
dataset:
normalize_labels: true
src: /data/ocp/is2re/10k/train/data.lmdb
target_mean: -1.525913953781128
target_std: 2.279365062713623
logger: tensorboard
model: schnet
model_attributes:
cutoff: 6.0
hidden_channels: 256
num_filters: 128
num_gaussians: 100
num_interactions: 3
regress_forces: false
use_pbc: true
optim:
batch_size: 64
eval_batch_size: 32
lr_gamma: 0.1
lr_initial: 0.005
lr_milestones:

10
15
20
max_epochs: 30
num_workers: 32
warmup_epochs: 3
warmup_factor: 0.2
task:
dataset: single_point_lmdb
description: Relaxed state energy prediction from initial structure.
labels:
relaxed energy
metric: mae
type: regression
val_dataset:
src: /data/ocp/is2re/all/val_id/data.lmdb

Loading dataset: single_point_lmdb

Loading model: schnet

Loaded SchNet with 541697 parameters.

NOTE: model gradient logging to tensorboard not yet supported.
energy_mae: 134.6836, energy_mse: 36312.4453, energy_within_threshold: 0.0000, loss: 59.0882, epoch: 0.0064
Exception in thread Thread-2:
Traceback (most recent call last):
File "/home/anis/anaconda3/envs/ocp-models/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/anis/anaconda3/envs/ocp-models/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/anis/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/home/anis/anaconda3/envs/ocp-models/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/anis/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
fd = df.detach()
File "/home/anis/anaconda3/envs/ocp-models/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/anis/anaconda3/envs/ocp-models/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/home/anis/anaconda3/envs/ocp-models/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata

Traceback (most recent call last):
File "main.py", line 115, in
main(config)
File "main.py", line 52, in main
trainer.train()
File "/home/anis/OneDrive/Projects/3rdPartyLibs/ocp/ocpmodels/trainers/energy_trainer.py", line 236, in train
for i, batch in enumerate(self.train_loader):
File "/home/anis/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/home/anis/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 974, in _next_data
idx, data = self._get_data()
File "/home/anis/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 936, in _get_data
raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly

Possible Custom Model/Registry Bug

Hello,

To help myself better understand the codebase, I tried copying the CGCNN.py model to a new CGCNN_throwaway.py model and registering a "cgcnn_throwaway" key in the registry. I then attempt to call this copy of the CGCNN model by calling "cgcnn_throwaway" from a custom .yml file. I am trying to do this with the intent of adding my own custom models to OCP later.

Because the code is the exact same between the base CGCNN.py and CGCNN_throwaway.py with the exception of changing the class names from class CGCNN(BaseModel) to class CGCNN_Throwaway(BaseModel) and @registry.register_model("cgcnn") to @registry.register_model("cgcnn_throwaway"), I think the potential bug is not in the construction of the CGCNN model. I can run the base CGCNN.py model that comes with the codebase with no issue.

This raises a confusing error, which motivated my belief that this could be a bug. I might just lack understanding about how to do this; if so, I would love some clarification. Thanks!

Error:
`

Loading dataset: trajectory_lmdb

Loading model: cgcnn_throwaway

Loaded CGCNN_Throwaway with 245889 parameters.

NOTE: model gradient logging to tensorboard not yet supported.
Exception in thread Thread-2:
Traceback (most recent call last):
File "/home/cameron/anaconda3/envs/ocp-models/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/cameron/anaconda3/envs/ocp-models/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/cameron/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/home/cameron/anaconda3/envs/ocp-models/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/cameron/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
fd = df.detach()
File "/home/cameron/anaconda3/envs/ocp-models/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/cameron/anaconda3/envs/ocp-models/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/home/cameron/anaconda3/envs/ocp-models/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata

Traceback (most recent call last):
File "main.py", line 144, in
Runner()(config)
File "main.py", line 69, in call
trainer.train()
File "/media/cameron/DATA/ocp/ocpmodels/trainers/forces_trainer.py", line 381, in train
batch = next(train_loader_iter)
File "/home/cameron/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/cameron/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
idx, data = self._get_data()
File "/home/cameron/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1029, in _get_data
raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly
`
Attached is the CGCNN_throwaway.py, the corresponding cgcnn_throwaway.yml, base.yml, and the terminal printout.

Miscellaneous Information: The database being called in the .yml files was split from a 200k database downloaded using download_data.py. Edges were pre-computed. I also tried running python cgcnn_throwaway.py directly to put the duplicate CGCNN model in the registry before training. Same error

Possible_Registry_Bug.zip

Use S2EF model as a calcualtor in ASE to run the structure relaxation job

Hi, I am trying to use the s2ef model as a calculator in a structure relaxation job to get the relaxed structure for a system of CO adsorbed on Pt, and I was wondering is there an example somewhere that I could learn from?

Unable to run demo notebook for SchNet

Hi! I was running the demo notebook for training the basic SchNet model after downloading the training set 200k S2EF, and I encountered the following error on running cell [9]:

amp: false
cmd:
  checkpoint_dir: ./checkpoints/2021-01-06-13-13-36-SchNet-example
  identifier: SchNet-example
  logs_dir: ./logs/tensorboard/2021-01-06-13-13-36-SchNet-example
  print_every: 10
  results_dir: ./results/2021-01-06-13-13-36-SchNet-example
  seed: 0
  timestamp: 2021-01-06-13-13-36-SchNet-example
dataset:
  normalize_labels: false
  src: /mnt/work/chaitanya/ocp-models/data/s2ef/200k/train
logger: tensorboard
model: schnet
model_attributes:
  cutoff: 6.0
  hidden_channels: 1024
  num_filters: 256
  num_gaussians: 200
  num_interactions: 3
optim:
  batch_size: 4
  eval_batch_size: 8
  force_coefficient: 100
  lr_gamma: 0.1
  lr_initial: 0.0001
  lr_milestones:
  - 15
  - 20
  max_epochs: 1
  num_workers: 8
  warmup_epochs: 10
  warmup_factor: 0.2
task:
  dataset: trajectory_lmdb
  description: Regressing to energies and forces for DFT trajectories from OCP
  eval_on_free_atoms: true
  grad_input: atomic forces
  labels:
  - potential energy
  metric: mae
  train_on_free_atoms: true
  type: regression
test_dataset:
  src: /mnt/work/chaitanya/ocp-models/data/s2ef/200k/train
val_dataset:
  src: /mnt/work/chaitanya/ocp-models/data/s2ef/200k/train

### Loading dataset: trajectory_lmdb
### Loading model: schnet
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-a7641ee43cda> in <module>
     12     logger="tensorboard", # logger of choice (tensorboard and wandb supported)
     13     local_rank=0,
---> 14     amp=False, # use PyTorch Automatic Mixed Precision (faster training and less memory usage)
     15 )

/mnt/work/chaitanya/ocp-models/ocpmodels/trainers/forces_trainer.py in __init__(self, task, model, dataset, optimizer, identifier, run_dir, is_debug, is_vis, print_every, seed, logger, local_rank, amp, cpu)
     93             amp=amp,
     94             cpu=cpu,
---> 95             name="s2ef",
     96         )
     97 

/mnt/work/chaitanya/ocp-models/ocpmodels/trainers/base_trainer.py in __init__(self, task, model, dataset, optimizer, identifier, run_dir, is_debug, is_vis, print_every, seed, logger, local_rank, amp, cpu, name)
    119         if distutils.is_master():
    120             print(yaml.dump(self.config, default_flow_style=False))
--> 121         self.load()
    122 
    123         self.evaluator = Evaluator(task=name)

/mnt/work/chaitanya/ocp-models/ocpmodels/trainers/base_trainer.py in load(self)
    127         self.load_logger()
    128         self.load_task()
--> 129         self.load_model()
    130         self.load_criterion()
    131         self.load_optimizer()

/mnt/work/chaitanya/ocp-models/ocpmodels/trainers/base_trainer.py in load_model(self)
    309             bond_feat_dim,
    310             self.num_targets,
--> 311             **self.config["model_attributes"],
    312         ).to(self.device)
    313 

TypeError: 'NoneType' object is not callable

I haven't edited anything in the notebook apart from the dataset paths, so I was wondering how I may fix this. The error seems related to the model configurations, as the model itself is unable to load.

Energy information in test LMDBs for the IS2RE tasks

Hello,

I downloaded the datasets for the IS2RE tasks from this link: https://dl.fbaipublicfiles.com/opencatalystproject/data/is2res_train_val_test_lmdbs.tar.gz

I can access the y_init, y_relaxed and pos_relaxed attributes in the data objects from train/val LMDBs. However, test LMDBs do not contain these attributes. Meanwhile, important adsorbates for CO2 reduction like *CO are contained in test LMDBs, the absorption energies of them will be helpful for further research. So, could I access these attributes in test LMDBs?

Thank you in advance!

Checkpointing broken when running without validation set

This call to trainer save is inconsistent with the function signature:
https://github.com/Open-Catalyst-Project/ocp/blob/f34a0af36671442754c833c21b857f4c307437b7/ocpmodels/trainers/forces_trainer.py#L488

I think this is just a corner case that isn't usually tested, but it gets hit when I run without a validation dataset (e.g. for some performance measurements).

torch package issue

Hi,

I tried to run the code with our IBM Power system and have some issue with torch_sparse package, would you please give me some advice?

Singularity> python main.py --mode train --config-yml configs/s2ef/200k/base.yml 
Traceback (most recent call last):
  File "main.py", line 18, in <module>
    from ocpmodels.common.utils import (
  File "/root/ocp/ocpmodels/common/utils.py", line 27, in <module>
    from torch_geometric.utils import remove_self_loops
  File "/opt/miniconda/lib/python3.7/site-packages/torch_geometric/__init__.py", line 5, in <module>
    import torch_geometric.data
  File "/opt/miniconda/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
    from .data import Data
  File "/opt/miniconda/lib/python3.7/site-packages/torch_geometric/data/data.py", line 8, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/opt/miniconda/lib/python3.7/site-packages/torch_sparse/__init__.py", line 15, in <module>
    f'{library}_{suffix}', [osp.dirname(__file__)]).origin)
AttributeError: 'NoneType' object has no attribute 'origin'

and my conda env list is

# packages in environment at /opt/miniconda:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
absl-py                   0.10.0                   pypi_0    pypi
alabaster                 0.7.12                     py_0    conda-forge
appdirs                   1.4.4                      py_0  
argon2-cffi               20.1.0           py37h7b6447c_1  
ase                       3.21.1             pyhd8ed1ab_0    conda-forge
asn1crypto                1.3.0                    py37_1  
async_generator           1.10             py37h28b3542_0  
attrs                     20.2.0                     py_0  
babel                     2.9.1              pyh44b312d_0    conda-forge
backcall                  0.2.0                      py_0  
basemap                   1.2.0            py37h705c2d8_0  
black                     21.5b0             pyhd8ed1ab_0    conda-forge
blas                      1.0                    openblas  
bleach                    3.2.1                      py_0  
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2020.12.5            h1084571_0    conda-forge
cachetools                4.1.1                    pypi_0    pypi
certifi                   2020.12.5        py37h35e4cab_1    conda-forge
cffi                      1.14.0           py37h2e261b9_0  
cfgv                      3.3.0                    pypi_0    pypi
chardet                   3.0.4                 py37_1003  
click                     8.0.1            py37h35e4cab_0    conda-forge
cloudpickle               1.6.0                    pypi_0    pypi
cmake                     3.14.0               h52cb24c_0  
colorama                  0.4.4              pyh9f0ad1d_0    conda-forge
conda                     4.10.1           py37h35e4cab_0    conda-forge
conda-merge               0.1.5                      py_0    conda-forge
conda-package-handling    1.6.0            py37h7b6447c_0  
configparser              5.0.2                    pypi_0    pypi
cryptography              2.8              py37h1ba5d50_0  
cudatoolkit               11.0.221             h6bb024c_0  
cycler                    0.10.0                   py37_0  
cython                    0.29.21          py37h29c3540_0  
dataclasses               0.6                      pypi_0    pypi
decorator                 4.4.2                      py_0  
dedent                    0.5                      pypi_0    pypi
defusedxml                0.6.0                      py_0  
demjson                   2.2.4              pyh9f0ad1d_0    conda-forge
distlib                   0.3.1                    py37_0  
docker-pycreds            0.4.0                    pypi_0    pypi
docutils                  0.16                     pypi_0    pypi
entrypoints               0.3                      py37_0  
expat                     2.2.10               he6710b0_2  
filelock                  3.0.12                     py_0  
flask                     1.1.2              pyh9f0ad1d_0    conda-forge
freetype                  2.10.4               h5ab3b9f_0  
fsspec                    0.8.4                    pypi_0    pypi
future                    0.18.2                   py37_1  
geos                      3.6.2                hf484d3e_2  
gitdb                     4.0.7                    pypi_0    pypi
gitpython                 3.1.17                   pypi_0    pypi
gmp                       6.1.2                h7f7056e_2  
google-auth               1.22.1                   pypi_0    pypi
google-auth-oauthlib      0.4.1                    pypi_0    pypi
googledrivedownloader     0.4                      pypi_0    pypi
grpcio                    1.33.1                   pypi_0    pypi
h5py                      2.10.0           py37h7918eee_0  
hdf5                      1.10.4               hb1b8bf9_0  
hypothesis                5.37.4                     py_0  
identify                  2.2.5                    pypi_0    pypi
idna                      2.8                      py37_0  
imagesize                 1.2.0                      py_0    conda-forge
importlib-metadata        2.0.0                      py_1  
importlib_metadata        2.0.0                         1  
iniconfig                 1.1.1              pyh9f0ad1d_0    conda-forge
ipykernel                 5.3.4            py37h5ca1d4c_0  
ipython                   7.18.1           py37h5ca1d4c_0  
ipython_genutils          0.2.0              pyhd3eb1b0_1  
isodate                   0.6.0                    pypi_0    pypi
itsdangerous              2.0.1              pyhd8ed1ab_0    conda-forge
jedi                      0.17.2           py37h6ffa863_1  
jinja2                    2.11.2                     py_0  
joblib                    0.17.0                     py_0  
jpeg                      9b                   hcb7ba68_2  
json5                     0.9.5                      py_0  
jsonschema                3.2.0                      py_2  
jupyter_client            6.1.7                      py_0  
jupyter_core              4.6.1                    py37_0  
jupyterlab                2.2.6                      py_0  
jupyterlab_pygments       0.1.2                      py_0  
jupyterlab_server         1.2.0                      py_0  
kiwisolver                1.2.0            py37hfd86e86_0  
krb5                      1.18.2               h597af5e_0  
lcms2                     2.11                 h396b838_0  
ld_impl_linux-ppc64le     2.33.1               h0f24833_7  
libcurl                   7.71.1               h20c2e04_1  
libedit                   3.1.20191231         h14c3975_1  
libffi                    3.2.1             hf484d3e_1007  
libgcc-ng                 8.2.0                h822a55f_1  
libgfortran-ng            7.3.0                h822a55f_1  
libopenblas               0.3.10               h5a2b251_0  
libpng                    1.6.37               hbc83047_0  
libsodium                 1.0.18               h7b6447c_0  
libssh2                   1.9.0                h1ba5d50_1  
libstdcxx-ng              8.2.0                h822a55f_1  
libtiff                   4.1.0                h2733197_0  
llvmlite                  0.36.0           py37hf484d3e_0    numba
lmdb                      0.9.29               h29c3540_0  
markdown                  3.3.2                    pypi_0    pypi
markupsafe                1.1.1            py37h7b6447c_0  
matplotlib                3.3.2                h6ffa863_0  
matplotlib-base           3.3.2            py37h76a9e4f_0  
mistune                   0.8.4            py37h7b6447c_0  
monty                     2021.5.9                 pypi_0    pypi
more-itertools            8.8.0              pyhd8ed1ab_0    conda-forge
mpi                       1.0                     openmpi  
mpmath                    1.2.1                    pypi_0    pypi
mypy_extensions           0.4.3            py37h35e4cab_3    conda-forge
nbclient                  0.5.1                      py_0  
nbconvert                 6.0.7                    py37_0  
nbformat                  5.0.8                      py_0  
nbsphinx                  0.8.5              pyhd8ed1ab_1    conda-forge
ncurses                   6.2                  he6710b0_1  
nest-asyncio              1.4.1                      py_0  
networkx                  2.5.1                    pypi_0    pypi
ninja                     1.10.1           py37hfd86e86_0  
nodeenv                   1.6.0                    pypi_0    pypi
notebook                  6.1.4                    py37_0  
numba                     0.53.1                   pypi_0    pypi
numpy                     1.20.3                   pypi_0    pypi
numpy-base                1.19.1           py37h75fe3a5_0  
oauthlib                  3.1.0                    pypi_0    pypi
olefile                   0.46                     py37_0  
openmpi                   4.0.2                hb1b8bf9_1  
openssl                   1.1.1g               h6eb9509_0    conda-forge
packaging                 20.4                       py_0  
palettable                3.3.0                    pypi_0    pypi
pandas                    1.1.3            py37hdf5156a_0  
pandoc                    2.0.0.1                       1  
pandocfilters             1.4.2                    py37_1  
parso                     0.7.0                      py_0  
pathspec                  0.8.1              pyhd3deb0d_0    conda-forge
pathtools                 0.1.2                    pypi_0    pypi
pexpect                   4.8.0              pyhd3eb1b0_3  
pickleshare               0.7.5           pyhd3eb1b0_1003  
pillow                    8.2.0            py37h3f95422_0  
pip                       21.1.2             pyhd8ed1ab_0    conda-forge
plotly                    4.14.3                   pypi_0    pypi
pluggy                    0.13.1           py37h35e4cab_4    conda-forge
pre-commit                2.13.0                   pypi_0    pypi
proj4                     5.2.0                he6710b0_1  
prometheus_client         0.8.0                      py_0  
promise                   2.3                      pypi_0    pypi
prompt-toolkit            3.0.8                      py_0  
protobuf                  3.13.0                   pypi_0    pypi
psutil                    5.8.0                    pypi_0    pypi
ptyprocess                0.6.0              pyhd3eb1b0_2  
py                        1.10.0             pyhd3deb0d_0    conda-forge
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pybind11                  2.5.0            py37hfd86e86_0  
pycosat                   0.6.3            py37h7b6447c_1  
pycparser                 2.19                     py37_0  
pygments                  2.7.1                      py_0  
pymatgen                  2022.0.8                 pypi_0    pypi
pyopenssl                 19.1.0                     py_1  
pyparsing                 2.4.7                      py_0  
pyproj                    1.9.6            py37h14380d9_0  
pyrsistent                0.17.3           py37h7b6447c_0  
pyshp                     2.1.3              pyh44b312d_0    conda-forge
pysocks                   1.7.1                    py37_1  
pytest                    6.2.4            py37h35e4cab_0    conda-forge
python                    3.7.6                h4134adf_2  
python-dateutil           2.8.1                      py_0  
python-louvain            0.15                     pypi_0    pypi
python_abi                3.7                     1_cp37m    conda-forge
pytorch-lightning         1.1.3                    pypi_0    pypi
pytz                      2020.1                     py_0  
pyyaml                    5.3.1            py37h7b6447c_0  
pyzmq                     19.0.2           py37he6710b0_1  
rdflib                    5.0.0                    pypi_0    pypi
readline                  7.0                  h7b6447c_5  
regex                     2021.4.4         py37h140841e_0  
requests                  2.22.0                   py37_1  
requests-oauthlib         1.3.0                    pypi_0    pypi
retrying                  1.3.3                    pypi_0    pypi
rhash                     1.4.0                h1ba5d50_0  
rsa                       4.6                      pypi_0    pypi
ruamel_yaml               0.15.87          py37h7b6447c_0  
scikit-learn              0.23.2           py37h0573a6f_0  
scipy                     1.5.2            py37habc2bb6_0  
send2trash                1.5.0              pyhd3eb1b0_1  
sentencepiece             0.1.94                   pypi_0    pypi
sentry-sdk                1.1.0                    pypi_0    pypi
setuptools                50.3.0           py37h7557452_1  
shortuuid                 1.0.1                    pypi_0    pypi
six                       1.14.0                   py37_0  
smmap                     4.0.0                    pypi_0    pypi
snowballstemmer           2.1.0              pyhd8ed1ab_0    conda-forge
sortedcontainers          2.2.2                      py_0  
spglib                    1.16.1                   pypi_0    pypi
sphinx                    4.0.2              pyh6c4a22f_0    conda-forge
sphinx-rtd-theme          0.5.2                    pypi_0    pypi
sphinxcontrib-applehelp   1.0.2                      py_0    conda-forge
sphinxcontrib-devhelp     1.0.2                      py_0    conda-forge
sphinxcontrib-htmlhelp    2.0.0              pyhd8ed1ab_0    conda-forge
sphinxcontrib-jsmath      1.0.1                      py_0    conda-forge
sphinxcontrib-qthelp      1.0.3                      py_0    conda-forge
sphinxcontrib-serializinghtml 1.1.5              pyhd8ed1ab_0    conda-forge
sqlite                    3.31.1               h7b6447c_0  
submitit                  1.3.3                    pypi_0    pypi
subprocess32              3.5.4                    pypi_0    pypi
sympy                     1.8                      pypi_0    pypi
tabulate                  0.8.9                    pypi_0    pypi
tensorboard               2.3.0                    pypi_0    pypi
tensorboard-plugin-wit    1.7.0                    pypi_0    pypi
terminado                 0.9.1                    py37_0  
testpath                  0.4.4                      py_0  
threadpoolctl             2.1.0              pyh5ca1d4c_0  
tk                        8.6.8                hbc83047_0  
toml                      0.10.2             pyhd8ed1ab_0    conda-forge
torch                     1.7.0a0                  pypi_0    pypi
torch-cluster             1.5.9                    pypi_0    pypi
torch-geometric           1.7.0                    pypi_0    pypi
torch-scatter             2.0.6                    pypi_0    pypi
torch-sparse              0.6.9                    pypi_0    pypi
torch-spline-conv         1.2.1                    pypi_0    pypi
torchaudio                0.7.0a0+a853dff          pypi_0    pypi
torchtext                 0.6.0                    pypi_0    pypi
torchvision               0.8.0a0+2f40a48          pypi_0    pypi
tornado                   6.0.4            py37h7b6447c_1  
tqdm                      4.60.0             pyhd8ed1ab_0    conda-forge
traitlets                 5.0.5                      py_0  
typed-ast                 1.4.2            py37h140841e_1  
typing                    3.7.4.3                  py37_0  
typing_extensions         3.7.4.3                    py_0  
uncertainties             3.1.5                    pypi_0    pypi
urllib3                   1.25.8                   py37_0  
virtualenv                20.0.35                  py37_0  
wandb                     0.10.30                  pypi_0    pypi
wcwidth                   0.2.5                      py_0  
webencodings              0.5.1                    py37_1  
werkzeug                  1.0.1                    pypi_0    pypi
wheel                     0.34.2                   py37_0  
xz                        5.2.4                h14c3975_4  
yaml                      0.1.7                h1bed415_2  
zeromq                    4.3.3                he6710b0_3  
zipp                      3.3.1                      py_0  
zlib                      1.2.11               h7b6447c_3  
zstd                      1.3.7                h0b5b093_0

Loading Dataset Into Trainer

Dear OCP Team,

I tried to load a dataset into an OCP ForcesTrainer and I got the following error message:
File ...\ocpmodels\trainers\base_trainer.py:280, in BaseTrainer.load_datasets(self)
278 print(self.config["task"]["dataset"])#!#
279 print(self.config["dataset"])#!#
--> 280 self.train_dataset = registry.get_dataset_class(
281 self.config["task"]["dataset"]
282 )(self.config["dataset"])
283 self.train_sampler = self.get_sampler(
284 self.train_dataset,
285 self.config["optim"]["batch_size"],
286 shuffle=True,
287 )
288 self.train_loader = self.get_dataloader(
289 self.train_dataset,
290 self.train_sampler,
291 )

TypeError: 'NoneType' object is not callable

Note that line numbers may deviate, because I inserted some print commands above. It seems to me that in this line of code it is not yet decided whether the self.train_dataset should be obtained from self.config["task"]["dataset"] or from self.config["dataset"]. The code then tries to call the output of registry.get_dataset_class( self.config["task"]["dataset"] ) as a function with the parameter self.config["dataset"]. The relevant code to reproduce the error message is attached, I would appreciate any advice on how to fix the issue.

Many thanks and best
Thorren

input_reaxFF_water_10.extxyz.txt
create_lmdb_dataset.py.txt

Is it possible to specify energy-only with CGCNN?

The paper seems to have energy-only and forces-only for SchNet and DimeNet++. Is there an easy way to specify this for CGCNN? trainer=energy in base.yml gives an error:

Unknown option: -C
usage: git [--version] [--help] [-c name=value]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p|--paginate|--no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           <command> [<args>]
submitit ERROR (2021-04-06 17:49:15,714) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/ocp-models/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/ocp-models/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/ocp-models/lib/python3.6/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/ocp-models/lib/python3.6/site-packages/submitit/core/submission.py", line 71, in submitit_main
    process_job(args.folder)
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/ocp-models/lib/python3.6/site-packages/submitit/core/submission.py", line 64, in process_job
    raise error
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/ocp-models/lib/python3.6/site-packages/submitit/core/submission.py", line 53, in process_job
    result = delayed.result()
  File "/uufs/chpc.utah.edu/common/home/u1326059/software/pkg/miniconda3/envs/ocp-models/lib/python3.6/site-packages/submitit/core/utils.py", line 128, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "main.py", line 55, in __call__
    cpu=config.get("cpu", False),
  File "/uufs/chpc.utah.edu/common/home/sparks-thermoelectric/ocp/ocpmodels/trainers/energy_trainer.py", line 90, in __init__
    name="is2re",
  File "/uufs/chpc.utah.edu/common/home/sparks-thermoelectric/ocp/ocpmodels/trainers/base_trainer.py", line 146, in __init__
    self.load()
  File "/uufs/chpc.utah.edu/common/home/sparks-thermoelectric/ocp/ocpmodels/trainers/base_trainer.py", line 153, in load
    self.load_task()
  File "/uufs/chpc.utah.edu/common/home/sparks-thermoelectric/ocp/ocpmodels/trainers/energy_trainer.py", line 96, in load_task
    ), "EnergyTrainer requires single_point_lmdb dataset"
AssertionError: EnergyTrainer requires single_point_lmdb dataset
srun: error: notch086: task 0: Exited with exit code 1

Not sure if that was the right approach or if there's a flag somewhere I'm missing.

Example test set prediction using Schnet

I'm currently trying to understand your baseline metrics in further detail. To understand precisely how you evaluate the various S2EF matrics, I was wondering if you could supply the predictions for the test set for SchNet. I am in the process of creating these following https://github.com/Open-Catalyst-Project/ocp/blob/master/TRAIN.md, however, I would appreciate knowing that I obtained the same results for one of the test datasets with certainty.

Thanks you for making such a large DFT database publicly available!

AttributeError: 'Batch' object has no attribute 'cell' in train_example.py

Hi,

I tried to run train_example.py. However, I got the error:

AttributeError: 'Batch' object has no attribute 'cell'

It seems that the torch_geometric data forms have no 'cell' attribute. Can anyone please tell me how to fix this?

Thank you very much!

Application of ocp for the molecular adsorption on oxide surfaces

Hi.
Thank you for your exciting work. I am very interested in the Open-Catalyst-Project (ocp) . I was told that the neural network potential created by ocp is of general purpose and can be used for calculations beyond catalyst applications. This is wonderful if it is of truth!

If possible, I would like to try application of ocp to the molecular adsorption / dissociative adsorption on oxide surfaces, and also growth of layered materials on oxide surfaces. However, in the training data of ocp, I did not find any oxides but only metals or metallic alloys.

Could you please give me some suggestions on the application scope of ocp? Thank you very much.

Broken link in a tutorial

In tutorial https://github.com/Open-Catalyst-Project/ocp/blob/tutorials/docs/source/tutorials/data_preprocessing.ipynb it is mentioned that

To better understand the raw data contained within OC20, check out the following tutorial first: https://github.com/Open-Catalyst-Project/ocp/blob/master/docs/source/tutorials/data_playground.ipynb

I think this link is broken

torch.distributed not supported on Windows, but throws error for non-distributed training

Hi,

Can't seem to train:

runfile('main.py','--mode train --config-yml configs/s2ef/200k/cgcnn/cgcnn.yml')
Traceback (most recent call last):

  File "main.py", line 15, in <module>
    from ocpmodels.common import distutils

  File "C:\Users\sterg\Documents\GitHub\sparks-baird\ocp\ocpmodels\common\distutils.py", line 98, in <module>
    def broadcast(tensor, src, group=dist.group.WORLD, async_op=False):

AttributeError: module 'torch.distributed' has no attribute 'group'

Possibly because torch.distributed isn't supported on Windows(?).

There may have been some progress with Windows support, however, it seemed like the command I used main.py --mode train --config-yml configs/s2ef/200k/cgcnn/cgcnn.yml shouldn't require torch.distributed, correct? Is there something basic I'm missing here?

Sterling

Dataset downloader with user-specified path

Hi all,

I am wondering if it would be helpful to add a --data-path as an optional CLA to the download data script so that the user can decide where he wants to store the data.

Best regards
Thorsten Kurth

Feature request for adding relaxed structure info to S2EF frames

It would be great to have relaxed structure info (relaxed states and energies) for every frame in the S2EF dataloaders for joint training.

Question about --get-edges flag

Hello,

Does the 3-5x slowdown for the --get-edges flag refer to the execution of download_data.py, or is it saying that the model training is slower?

main.py produces TypeError: 'NoneType' object is not subscriptable

Hi,

I've sorted through many issues involving use of Windows, then use of Windows Subsystem for Linux, etc., and it's been a pretty long process, but I think I've come to something I can't debug very well.

I followed the instructions in the primary README.md for environment activation/installation via conda and to download/uncompress/pre-process the data via scripts/download_data.py for 200k size training data (and I had to download a validation set, in my case val_id to prevent some errors with the following line even though it's supposedly in train mode).

(ocp-models) root@DESKTOP-<>:/mnt/c/Users/sterg/Documents/GitHub/sparks-baird/ocp# python3 main.py --mode train --config-yml configs/s2ef/200k/cgcnn/cgcnn.yml

This produced the following output:

amp: false
cmd:
  checkpoint_dir: ./checkpoints/2021-02-25-22-28-32
  commit: 5f16b64
  identifier: ''
  logs_dir: ./logs/tensorboard/2021-02-25-22-28-32
  print_every: 10
  results_dir: ./results/2021-02-25-22-28-32
  seed: 0
  timestamp: 2021-02-25-22-28-32
dataset:
  grad_target_mean: 0.0
  grad_target_std: 2.887317180633545
  normalize_labels: true
  src: data/s2ef/200k/train/
  target_mean: -0.7554450631141663
  target_std: 2.887317180633545
gpus: 0
logger: tensorboard
model: cgcnn
model_attributes:
  atom_embedding_size: 128
  cutoff: 6.0
  fc_feat_size: 128
  num_fc_layers: 3
  num_gaussians: 100
  num_graph_conv_layers: 2
  use_pbc: true
optim:
  batch_size: 32
  eval_batch_size: 24
  force_coefficient: 10
  lr_gamma: 0.1
  lr_initial: 0.0005
  lr_milestones:
  - 15
  - 20
  max_epochs: 50
  num_workers: 64
  warmup_epochs: 2
  warmup_factor: 0.2
task:
  dataset: trajectory_lmdb
  description: Regressing to energies and forces for DFT trajectories from OCP
  eval_on_free_atoms: true
  grad_input: atomic forces
  labels:
  - potential energy
  metric: mae
  train_on_free_atoms: true
  type: regression
val_dataset:
  src: data/s2ef/all/val_id/

### Loading dataset: trajectory_lmdb
### Loading model: cgcnn
### Loaded CGCNN with 245889 parameters.
NOTE: model gradient logging to tensorboard not yet supported.
Traceback (most recent call last):
  File "main.py", line 140, in <module>
    Runner()(config)
  File "main.py", line 69, in __call__
    trainer.train()
  File "/mnt/c/Users/sterg/Documents/GitHub/sparks-baird/ocp/ocpmodels/trainers/forces_trainer.py", line 365, in train
    batch = next(train_loader_iter)
  File "/root/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/root/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
    return self._process_data(data)
  File "/root/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
    data.reraise()
  File "/root/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/root/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
    data = fetcher.fetch(index)
  File "/root/anaconda3/envs/ocp-models/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/mnt/c/Users/sterg/Documents/GitHub/sparks-baird/ocp/ocpmodels/common/data_parallel.py", line 77, in __call__
    batch = data_list_collater(data_list, otf_graph=self.otf_graph)
  File "/mnt/c/Users/sterg/Documents/GitHub/sparks-baird/ocp/ocpmodels/datasets/trajectory_lmdb.py", line 145, in data_list_collater
    n_index = data.edge_index[1, :]
TypeError: 'NoneType' object is not subscriptable

As for the data files, I have:

(ocp-models) root@DESKTOP-<>:/mnt/c/Users/sterg/Documents/GitHub/sparks-baird/ocp/data/s2ef/200k/train# ls -l

-rwxrwxrwx 1 sterg sterg 21990232554 Feb 25 14:10 data.0000.lmdb
-rwxrwxrwx 1 sterg sterg        8192 Feb 25 13:02 data.0000.lmdb-lock
-rwxrwxrwx 1 sterg sterg 21990232554 Feb 25 14:10 data.0001.lmdb
-rwxrwxrwx 1 sterg sterg        8192 Feb 25 13:02 data.0001.lmdb-lock
-rwxrwxrwx 1 sterg sterg 21990232554 Feb 25 14:10 data.0002.lmdb
-rwxrwxrwx 1 sterg sterg        8192 Feb 25 13:02 data.0002.lmdb-lock
-rwxrwxrwx 1 sterg sterg 21990232554 Feb 25 14:10 data.0003.lmdb
-rwxrwxrwx 1 sterg sterg        8192 Feb 25 13:02 data.0003.lmdb-lock
-rwxrwxrwx 1 sterg sterg 21990232554 Feb 25 14:09 data.0004.lmdb
-rwxrwxrwx 1 sterg sterg        8192 Feb 25 13:20 data.0004.lmdb-lock
-rwxrwxrwx 1 sterg sterg 21990232554 Feb 25 14:09 data.0005.lmdb
-rwxrwxrwx 1 sterg sterg        8192 Feb 25 13:20 data.0005.lmdb-lock
-rwxrwxrwx 1 sterg sterg      812398 Feb 25 14:10 data_log.0000.txt
-rwxrwxrwx 1 sterg sterg      812401 Feb 25 14:10 data_log.0001.txt
-rwxrwxrwx 1 sterg sterg      812565 Feb 25 14:10 data_log.0002.txt
-rwxrwxrwx 1 sterg sterg      812503 Feb 25 14:10 data_log.0003.txt
-rwxrwxrwx 1 sterg sterg      696314 Feb 25 14:10 data_log.0004.txt
-rwxrwxrwx 1 sterg sterg      696434 Feb 25 14:10 data_log.0005.txt

(ocp-models) root@DESKTOP-<>:/mnt/c/Users/sterg/Documents/GitHub/sparks-baird/ocp/data/s2ef/all/val_id# ls -l

-rwxrwxrwx 1 sterg sterg 1285824512 Feb 25 21:53 data.0000.lmdb
-rwxrwxrwx 1 sterg sterg       8192 Feb 25 21:53 data.0000.lmdb-lock
-rwxrwxrwx 1 sterg sterg 1285353472 Feb 25 21:54 data.0001.lmdb
-rwxrwxrwx 1 sterg sterg       8192 Feb 25 21:54 data.0001.lmdb-lock
-rwxrwxrwx 1 sterg sterg 1248014336 Feb 25 21:53 data.0002.lmdb
-rwxrwxrwx 1 sterg sterg       8192 Feb 25 21:53 data.0002.lmdb-lock
-rwxrwxrwx 1 sterg sterg 1247002624 Feb 25 21:53 data.0003.lmdb
-rwxrwxrwx 1 sterg sterg       8192 Feb 25 21:53 data.0003.lmdb-lock
-rwxrwxrwx 1 sterg sterg 1248022528 Feb 25 21:53 data.0004.lmdb
-rwxrwxrwx 1 sterg sterg       8192 Feb 25 21:53 data.0004.lmdb-lock
-rwxrwxrwx 1 sterg sterg 1246285824 Feb 25 21:53 data.0005.lmdb
-rwxrwxrwx 1 sterg sterg       8192 Feb 25 21:53 data.0005.lmdb-lock
-rwxrwxrwx 1 sterg sterg    3776224 Feb 25 21:54 data_log.0000.txt
-rwxrwxrwx 1 sterg sterg    3775833 Feb 25 21:54 data_log.0001.txt
-rwxrwxrwx 1 sterg sterg    3664915 Feb 25 21:54 data_log.0002.txt
-rwxrwxrwx 1 sterg sterg    3662536 Feb 25 21:54 data_log.0003.txt
-rwxrwxrwx 1 sterg sterg    3664953 Feb 25 21:54 data_log.0004.txt
-rwxrwxrwx 1 sterg sterg    3665512 Feb 25 21:54 data_log.0005.txt

I was only able to get this far using the command line, so I don't get the (relative) ease of debugging like before with Spyder. I get that data.edge_index is probably a NoneType object, but that's about as far as I understand. Any suggestions for being able to run the full training process?

Thanks!

Sterling

Sharing models with the Hugging Face Hub

Hi OCP team!

I see you currently save your model checkpoint through multiple links to your hosted server. Would you be interested in sharing the model in the Hugging Face Hub? (we can even set up a OCP organization or alternatively with the Facebook organization)

The Hub offers free hosting of over 20K models, and it would make your work more accessible and visible to the rest of the ML community. Some of the benefits of sharing your models would be:

versioning
commit history and diffs
repos provide useful metadata about their tasks, languages, metrics, etc
we could add a widget for users to try the model directly in the browser

Creating the repos and adding new models should be a relatively straightforward process if you've used Git before. This is a step-by-step guide explaining the process in case you're interested. Please let us know if you would be interested and if you have any questions.

Happy to hear your thoughts,
Omar and the Hugging Face team

cc @LysandreJik @lewtun @Rocketknight1

S2EF baseline .npz file availability

Would it be possible for a S2EF .npz file (CGCNN 200k or something else) to be made available for download? Just trying to get some real energies and forces to play with while I'm sorting out compute cluster / WSL2 GPU usage issues.

LMDB Dataset Creation

Dear OCP Team,

I am trying to setup a custom S2EF LMDB database following the tutorial at tutorials/lmdb_dataset_creation.ipynb, however, the resulting dataset contains only empty data objects. Can you give any advice on how to fix the problem? I attach the relevant files, the test dataset contains only forces (no energys) as it is intended for use with ForceNet.

Many thanks in advance and best
Thorren

installed_packages_versions.txt
input_reaxFF_water_10.extxyz.txt
create_lmdb_dataset.py.txt

Is ForceNet trained on only the small 200K set?

This is probably a typo: https://github.com/Open-Catalyst-Project/ocp/blob/60cdae84fa4ca776088377c8abb5dd194e4ff511/configs/s2ef/all/forcenet/fn_forceonly.yml#L4

Leaderboard

Hi OCP team,

Thanks for the new datasets and codebase. I was wondering when the Leaderboard is planned to be released?

(To add on, is the project currently at a stable stage or can the codebase be expected to undergo breaking changes before the public leaderboard is released?)

System ID missing from is2res train lmdb

I downloaded the is2re lmdbs and noticed that the training split seems to be missing the system IDs. I was able to read the system ID from the validation and test splits, so I'm guessing this was not intentional?

Here is the Python script I wrote to read the first entry of the lmdb file for the train and validation splits, with the resulting output below:

import pickle

my_lmdb = "is2res_train_val_test_lmdbs/data/is2re/all/train/data.lmdb"
lmdb_env = lmdb.open(my_lmdb, subdir=False, readonly=True, lock=False, readahead=False, meminit=False)
with lmdb_env.begin() as txn:
    cursor = txn.cursor()
    cursor.iternext()

    myobject = cursor.get('0'.encode())
    data = pickle.loads(myobject)
    print(data)


my_lmdb = "is2res_train_val_test_lmdbs/data/is2re/all/val_id/data.lmdb"
lmdb_env = lmdb.open(my_lmdb, subdir=False, readonly=True, lock=False, readahead=False, meminit=False)
with lmdb_env.begin() as txn:
    cursor = txn.cursor()
    cursor.iternext()

    myobject = cursor.get('0'.encode())
    data = pickle.loads(myobject)
    print(data)

Output:

(/work/westgroup/opencatalyst/conda-ocp) [harris.se@login-00 opencatalyst]$ python missing_sid.py
Data(atomic_numbers=[86], cell=[1, 3, 3], cell_offsets=[2964, 3], distances=[2964], edge_index=[2, 2964], fixed=[86], force=[86, 3], natoms=86, pos=[86, 3], pos_relaxed=[86, 3], tags=[86], y_init=6.282500615000004, y_relaxed=-0.025550085000020317)
Data(atomic_numbers=[84], cell=[1, 3, 3], cell_offsets=[3623, 3], edge_index=[2, 3623], fixed=[84], force=[84, 3], natoms=84, pos=[84, 3], pos_relaxed=[84, 3], sid=1700380, tags=[84], y_init=8.067705849999982, y_relaxed=-0.40190949000003684)

From the output above, the validation split has a system ID but the training one does not. Is there a way for me to get the system ID for the training split so I can easily look up the catalyst and adsorbate?

Set up environment for cpu machine in WSL2

Hi, @mshuaibii
I tried to setup your repo on WSL2. This time, it failed to install the following packages: demjson, torch-scatter, torch-geometric. I've looked at the env.yml file and found that the link "git+https://github.com/rusty1s/pytorch_geometric.git@4ea63d3" doesn't exist any longer.

Could you please try to setup the environment on your site to see if there's systematic error
Thanks

Originally posted by @phuongpho in #300 (comment)

Potential bug in OutputPPBlock of DimeNet++

In the following line in reset_parameters method of OutputPPBlock used in DimeNet++,

https://github.com/Open-Catalyst-Project/ocp/blob/b5a197fc3c79a9a5a787aabaa02979be53d296b7/ocpmodels/models/dimenet_plus_plus.py#L192

I think that initializing the weights of the linear layer to 0 will produce an output of 0. Probably, it should be
glorot_orthogonal(self.lin.weight, scale=2.0)

Unable to run single gpu script

Hi! When I directly run the script

python main.py --mode train --config-yml configs/is2re/10k/schnet/schnet.yml

it would throw an exception

Traceback (most recent call last):
  File "main.py", line 126, in <module>
    Runner()(config)
  File "main.py", line 43, in __call__
    self.trainer = registry.get_trainer_class(
  File "/home/grads/k/kruskallin/ocp/ocpmodels/trainers/forces_trainer.py", line 89, in __init__
    super().__init__(
  File "/home/grads/k/kruskallin/ocp/ocpmodels/trainers/base_trainer.py", line 191, in __init__
    self.load()
  File "/home/grads/k/kruskallin/ocp/ocpmodels/trainers/base_trainer.py", line 197, in load
    self.load_logger()
  File "/home/grads/k/kruskallin/ocp/ocpmodels/trainers/base_trainer.py", line 223, in load_logger
    self.logger = registry.get_logger_class(self.config["logger"])(
TypeError: 'NoneType' object is not callable

where registry.get_logger_class(self.config["logger"]) is none. However, when I add flag '--debug', it can run properly. Could you please give a hint where have I gone wrong?

How to set the scaling factor file of GemNet-dT

Hello OCP team,

Thanks for the amazing project!

I'm currently trying to reproduce the results using GemNet-dT model. I'm wondering how the scaling factors (gemnet-dT.json) are generated.

Thanks for your time and patience!
Bowen

Identifying catalyst and adsorbate

Hi! I was interested in identifying the catalyst and adsorbate atoms within the PyG Batch/Data object:

Batch(atomic_numbers=[172], batch=[172], cell=[2, 3, 3], cell_offsets=[7584, 3], distances=[7584], edge_index=[2, 7584], fixed=[172], force=[172, 3], natoms=[2], neighbors=[2], pos=[172, 3], pos_relaxed=[172, 3], tags=[172], y_init=[2], y_relaxed=[2])

Intuitively, I understand that all the catalyst atoms are packed at the end of the Data object attribute lists, e.g. the atomic numbers 6, 1, 1, 1, and 8:

> data.atomic_numbers[data.batch==0]
tensor([20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20.,
        20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20., 20.,
        20., 20., 20., 20., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28.,
        28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28.,
        28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28.,
        28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28.,
        28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28., 28.,  6.,  1.,
         1.,  1.,  8.], device='cuda:0')

> data.fixed[data.batch==0]
tensor([1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1.,
        1., 1., 1., 0., 0., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 1., 0.,
        1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1.,
        0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 0., 1.,
        1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 0.,
        1., 1., 0., 1., 1., 1., 0., 0., 0., 0., 0.], device='cuda:0')

> data.tags[data.batch==0]
tensor([0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
        0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        2, 2, 2, 2, 2], device='cuda:0')

Am I correct in my understanding that the 'tags' attribute is the reverse of the 'fixed' attribute, but also identifies the catalyst atoms with the tag = 2?

Implementation of pretrained models on my own dataset

Hello! There are an issue with implementation of pretrained models on your own structures (s2ef) task. Currently we have a set of .xyz files of structures. Is there any way to create an affordable dataset for model implementation?

open-catalyst-project / ocp Goto Github PK

ocp's Introduction

fairchem by FAIR Chemistry

Documentation

Contents

Installation

Quick Start

Why a single repository?

LICENSE

ocp's People

Contributors

Stargazers

Watchers

Forkers

ocp's Issues

Loading dataset: single_point_lmdb

Loading model: schnet

Loaded SchNet with 541697 parameters.

Loading dataset: trajectory_lmdb

Loading model: cgcnn_throwaway

Loaded CGCNN_Throwaway with 245889 parameters.

Recommend Projects

Recommend Topics

Recommend Org

`fairchem` by FAIR Chemistry