Code Monkey home page Code Monkey logo

ocp's People

Contributors

aarongarrison avatar abhshkdz avatar adeeshkolluru avatar anuroopsriram avatar aschneidman avatar brookwander avatar calebho avatar clz55 avatar d-stoll avatar dan1elherbst avatar deviparikh avatar ericmusa avatar gasteigerjo avatar ianbenlolo avatar jmusiel avatar joshes avatar junwoony avatar kruskallin avatar ktran9891 avatar mshuaibii avatar nianhant avatar nimashoghi avatar sgbaird avatar sidgoyal78 avatar txie-93 avatar weihua916 avatar wood-b avatar

Watchers

 avatar  avatar

ocp's Issues

Enabling DeepSpeed with fp16 crashes

I tried getting DeepSpeed running the s2ef task running with the cgcnn model (using my latest commit on the deepspeed branch). Using the code as is (i.e. using the plain DeepSpeed trainer without any optimization) works.
However, using the following DeepSpeed config file:

{
  "train_batch_size": 32,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.0005
    }
  },
  "fp16": {
    "enabled": true
  },
  "zero_optimization": false
}

where the fp16 optimization is enabled, and running the job on one GPU for now as follows:

(ocp-models) [dherbst@kanon ocp]$ python -u -m torch.distributed.launch --nproc_per_node=1 main.py --distributed --num-gpus 1 --mode train --config-yml configs/s2ef/200k/cgcnn/cgcnn.yml --deepspeed-mode deepspeed-optimizer --deepspeed-config configs/s2ef/200k/cgcnn/ds_config.json

results in the following error:

Traceback (most recent call last):
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 128, in <module>
    Runner()(config)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 68, in __call__
    self.task.run()
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 56, in run
    raise e
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 49, in run
    self.trainer.train(
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 330, in train
    out = self._forward(batch)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 432, in _forward
    out_energy, out_forces = self.model(batch_list)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1616, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/common/data_parallel.py", line 59, in forward
    return self.module(batch_list[0].to(f"cuda:{self.device_ids[0]}"))
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/models/cgcnn.py", line 165, in forward
    energy = self._forward(data)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/common/utils.py", line 121, in cls_method
    return f(self, *args, **kwargs)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/models/cgcnn.py", line 154, in _forward
    mol_feats = self._convolve(data)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/models/cgcnn.py", line 185, in _convolve
    node_feats = self.embedding_fc(data.x)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: expected scalar type Half but found Float

ZeRO stage 3 errors

During training with ZeRO stage 3 enabled in the Deepspeed config, following warnings/errors occur:

[WARNING] [stage3.py:106:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch_geometric.data.batch.DataBatch'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Traceback (most recent call last):
  File "/home/dstoll/ocp/main.py", line 126, in <module>
    Runner()(config)
  File "/home/dstoll/ocp/main.py", line 66, in __call__
    self.task.run()
  File "/home/dstoll/ocp/ocpmodels/tasks/task.py", line 56, in run
    raise e
  File "/home/dstoll/ocp/ocpmodels/tasks/task.py", line 49, in run
    self.trainer.train(
  File "/home/dstoll/ocp/ocpmodels/trainers/forces_trainer.py", line 329, in train
    self._backward(loss)
  File "/home/dstoll/ocp/ocpmodels/trainers/base_trainer.py", line 716, in _backward
    self.model.backward(loss)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1726, in backward
    self.optimizer.backward(loss)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2536, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 51, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: The expanded size of the tensor (256) must match the existing size (0) at non-singleton dimension 1.  Target sizes: [73085, 256].  Tensor sizes: [0]

Enabling DeepSpeed with ZeRO crashes

I tried getting DeepSpeed running the s2ef task running with the cgcnn model (using my latest commit on the deepspeed branch). Using the code as is (i.e. using the plain DeepSpeed trainer without any optimization) works.
However, using the following DeepSpeed config file:

{
  "train_batch_size": 32,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.0005
    }
  },
  "fp16": {
    "enabled": false
  },
  "zero_optimization": true
}

where the ZeRO optimization is enabled, and running the job on one GPU for now as follows:

(ocp-models) [dherbst@kanon ocp]$ python -u -m torch.distributed.launch --nproc_per_node=1 main.py --distributed --num-gpus 1 --mode train --config-yml configs/s2ef/200k/cgcnn/cgcnn.yml --deepspeed-mode deepspeed-optimizer --deepspeed-config configs/s2ef/200k/cgcnn/ds_config.json

results in the following error:

Traceback (most recent call last):
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 128, in <module>
    Runner()(config)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 68, in __call__
    self.task.run()
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 49, in run
    self.trainer.train(
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 333, in train
    self._backward(loss)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/base_trainer.py", line 741, in _backward
    self.optimizer.step()
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1660, in step
    self.check_overflow()
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1919, in check_overflow
    self._check_overflow(partition_gradients)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1820, in _check_overflow
    self.overflow = self.has_overflow(partition_gradients)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1839, in has_overflow
    overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1832, in has_overflow_partitioned_grads_serial
    for j, grad in enumerate(self.averaged_gradients[i]):
KeyError: 0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.