Describe the bug I have installed the LAMMPS implementation of MA

Hey, I am tagging <a class="user-mention notranslate" data-hovercard

That was my first guess too, based on the CUDA in the trace. Try <a class="user-mentio

Hey, thanks for the swift reply both! So I used this to switc

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

LAMMPS implementation terminates unexpectedly about mace HOT 7 CLOSED

acesuit commented on August 28, 2024

LAMMPS implementation terminates unexpectedly

from mace.

Comments (7)

ilyes319 commented on August 28, 2024 1

Hey,

I am tagging @wcwitt who will be able to help more in details.

from mace.

ilyes319 commented on August 28, 2024

Hey,

Looking at the trace it seems that you are trying to use a model compiled on GPU to run on CPU in archer. Could you check that the model you are loading is saved on a CPU? To do so, you could do:

  model = torch.load(path)
  model_cpu = model.to("cpu")
  torch.save(model_cpu,path_cpu)

and then load the model_cpu through your setup.

from mace.

wcwitt commented on August 28, 2024

That was my first guess too, based on the CUDA in the trace. Try @ilyes319's suggestion and we can go from there.

@zakmachachi, just a warning, if you have access to a decent GPU, I predict you will prefer using that for MD. The CPU LAMMPS can't really compete (yet) in performance for most use cases. Feel free to email at [email protected] if you want to discuss any details you'd rather not post here.

from mace.

zakmachachi commented on August 28, 2024

Hey, thanks for the swift reply both!

So I used this script to switch the model from GPU to CPU compilation:

from e3nn.util import jit
import sys
import torch
from mace.calculators import LAMMPS_MACE

# Load the model
model = torch.load(
    path,
    map_location=torch.device("cpu"),
)
model_cpu = model.to("cpu")
torch.save(model_cpu, path_cpu)

And it worked! But now I get the following error:

LAMMPS (22 Dec 2022)
terminate called after throwing an instance of 'c10::Error'
  what():  PytorchStreamReader failed locating file constants.pkl: file not found
Exception raised from valid at ../caffe2/serialize/inline_container.cc:177 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x2af16ceb856e in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x2af16ce82f18 in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #2: caffe2::serialize::PyTorchStreamReader::valid(char const*, char const*) + 0x8e (0x2af175c3cc4e in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamReader::getRecordID(std::string const&) + 0x46 (0x2af175c3cdd6 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamReader::getRecord(std::string const&) + 0x45 (0x2af175c3ce85 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #5: torch::jit::readArchiveAndTensors(std::string const&, std::string const&, std::string const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::string const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) + 0xa5 (0x2af176cfe5e5 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x4d20507 (0x2af176ce9507 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x4d2324b (0x2af176cec24b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #8: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>, std::unordered_map<std::string, std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::string> > >&) + 0x3a2 (0x2af176cedc82 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #9: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>) + 0x7b (0x2af176cee39b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #10: torch::jit::load(std::string const&, c10::optional<c10::Device>) + 0xa5 (0x2af176cee475 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #11: /work/e89/e89/zem/lammps/build/lmp() [0x7bc928]
frame #12: /work/e89/e89/zem/lammps/build/lmp() [0x31a2dd]
frame #13: /work/e89/e89/zem/lammps/build/lmp() [0x3117c4]
frame #14: /work/e89/e89/zem/lammps/build/lmp() [0x31ceaf]
frame #15: /work/e89/e89/zem/lammps/build/lmp() [0x2fe6ad]
frame #16: __libc_start_main + 0xea (0x2af19118c34a in /lib64/libc.so.6)
frame #17: /work/e89/e89/zem/lammps/build/lmp() [0x2fe5ca]

Some more info:

I trained the model on a GPU on a different cluster
I compiled the model using a GPU on that same cluster (to get the .pt file) and then converted to CPU on that same cluster using the script above and then copied over to Archer2 and received the error provided
I then tried the same process except converting to CPU on Archer2 now and still received the same error
So I went back to the cluster and took the .model file and serialized it on Archer2 using the following script:

from e3nn.util import jit
import sys
import torch
from mace.calculators import LAMMPS_MACE

model_path = sys.argv[1]  # takes model name as command-line input
model = torch.load(model_path, map_location=torch.device("cpu"))
lammps_model = LAMMPS_MACE(model)
lammps_model_compiled = jit.compile(lammps_model)
lammps_model_compiled.save(model_path + "-lammps.pt")

And got the following error from LAMMPS:

LAMMPS (22 Dec 2022)
terminate called after throwing an instance of 'c10::Error'
  what():  open file failed because of errno 2 on fopen: , file path: /work/e89/e89/zem/MACE_C_Potential/C_MACE_GAP-17_CPU.pt
Exception raised from RAIIFile at ../caffe2/serialize/file_adapter.cc:21 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x2ac9493e756e in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x2ac9493b1f18 in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #2: caffe2::serialize::FileAdapter::RAIIFile::RAIIFile(std::string const&) + 0x124 (0x2ac952170634 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::FileAdapter::FileAdapter(std::string const&) + 0x2e (0x2ac95217068e in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x5a (0x2ac95216eada in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #5: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>, std::unordered_map<std::string, std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::string> > >&) + 0x2a5 (0x2ac95321cb85 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #6: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>) + 0x7b (0x2ac95321d39b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #7: torch::jit::load(std::string const&, c10::optional<c10::Device>) + 0xa5 (0x2ac95321d475 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #8: /work/e89/e89/zem/lammps/build/lmp() [0x7bc928]
frame #9: /work/e89/e89/zem/lammps/build/lmp() [0x31a2dd]
frame #10: /work/e89/e89/zem/lammps/build/lmp() [0x3117c4]
frame #11: /work/e89/e89/zem/lammps/build/lmp() [0x31ceaf]
frame #12: /work/e89/e89/zem/lammps/build/lmp() [0x2fe6ad]
frame #13: __libc_start_main + 0xea (0x2ac96d6bb34a in /lib64/libc.so.6)
frame #14: /work/e89/e89/zem/lammps/build/lmp() [0x2fe5ca]

srun: error: nid001643: task 0: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=3227130.0

I guess the obvious thing here is to train and compile on Archer2, but I chose the other cluster as they have some fancy RTX cards which were not running out of memory during training. Archer2 sadly does not have any GPUs so training is an issue.

@wcwitt Have you setup GPU MACE runs for LAMMPS? I think this could be an interesting approach as I think this swapping between CPU and GPU compilations is a bit messy from my side!

from mace.

ilyes319 commented on August 28, 2024

@wcwitt @zakmachachi Can we close this issue, is there a fix somewhere?

from mace.

wcwitt commented on August 28, 2024

We've been emailing about it in combination with some other things. Let's leave it open for a bit longer and I'll post once it's ready to close

from mace.

ilyes319 commented on August 28, 2024

Sure thank you!

from mace.

LAMMPS implementation terminates unexpectedly about mace HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent