Code Monkey home page Code Monkey logo

Comments (7)

ilyes319 avatar ilyes319 commented on August 28, 2024 1

Hey,

I am tagging @wcwitt who will be able to help more in details.

from mace.

ilyes319 avatar ilyes319 commented on August 28, 2024

Hey,

Looking at the trace it seems that you are trying to use a model compiled on GPU to run on CPU in archer. Could you check that the model you are loading is saved on a CPU? To do so, you could do:

  model = torch.load(path)
  model_cpu = model.to("cpu")
  torch.save(model_cpu,path_cpu)

and then load the model_cpu through your setup.

from mace.

wcwitt avatar wcwitt commented on August 28, 2024

That was my first guess too, based on the CUDA in the trace. Try @ilyes319's suggestion and we can go from there.

@zakmachachi, just a warning, if you have access to a decent GPU, I predict you will prefer using that for MD. The CPU LAMMPS can't really compete (yet) in performance for most use cases. Feel free to email at [email protected] if you want to discuss any details you'd rather not post here.

from mace.

zakmachachi avatar zakmachachi commented on August 28, 2024

Hey, thanks for the swift reply both!

So I used this script to switch the model from GPU to CPU compilation:

from e3nn.util import jit
import sys
import torch
from mace.calculators import LAMMPS_MACE

# Load the model
model = torch.load(
    path,
    map_location=torch.device("cpu"),
)
model_cpu = model.to("cpu")
torch.save(model_cpu, path_cpu)

And it worked! But now I get the following error:

LAMMPS (22 Dec 2022)
terminate called after throwing an instance of 'c10::Error'
  what():  PytorchStreamReader failed locating file constants.pkl: file not found
Exception raised from valid at ../caffe2/serialize/inline_container.cc:177 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x2af16ceb856e in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x2af16ce82f18 in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #2: caffe2::serialize::PyTorchStreamReader::valid(char const*, char const*) + 0x8e (0x2af175c3cc4e in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamReader::getRecordID(std::string const&) + 0x46 (0x2af175c3cdd6 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamReader::getRecord(std::string const&) + 0x45 (0x2af175c3ce85 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #5: torch::jit::readArchiveAndTensors(std::string const&, std::string const&, std::string const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::string const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) + 0xa5 (0x2af176cfe5e5 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x4d20507 (0x2af176ce9507 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x4d2324b (0x2af176cec24b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #8: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>, std::unordered_map<std::string, std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::string> > >&) + 0x3a2 (0x2af176cedc82 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #9: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>) + 0x7b (0x2af176cee39b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #10: torch::jit::load(std::string const&, c10::optional<c10::Device>) + 0xa5 (0x2af176cee475 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #11: /work/e89/e89/zem/lammps/build/lmp() [0x7bc928]
frame #12: /work/e89/e89/zem/lammps/build/lmp() [0x31a2dd]
frame #13: /work/e89/e89/zem/lammps/build/lmp() [0x3117c4]
frame #14: /work/e89/e89/zem/lammps/build/lmp() [0x31ceaf]
frame #15: /work/e89/e89/zem/lammps/build/lmp() [0x2fe6ad]
frame #16: __libc_start_main + 0xea (0x2af19118c34a in /lib64/libc.so.6)
frame #17: /work/e89/e89/zem/lammps/build/lmp() [0x2fe5ca]

Some more info:

  • I trained the model on a GPU on a different cluster
  • I compiled the model using a GPU on that same cluster (to get the .pt file) and then converted to CPU on that same cluster using the script above and then copied over to Archer2 and received the error provided
  • I then tried the same process except converting to CPU on Archer2 now and still received the same error
  • So I went back to the cluster and took the .model file and serialized it on Archer2 using the following script:
from e3nn.util import jit
import sys
import torch
from mace.calculators import LAMMPS_MACE

model_path = sys.argv[1]  # takes model name as command-line input
model = torch.load(model_path, map_location=torch.device("cpu"))
lammps_model = LAMMPS_MACE(model)
lammps_model_compiled = jit.compile(lammps_model)
lammps_model_compiled.save(model_path + "-lammps.pt")

And got the following error from LAMMPS:

LAMMPS (22 Dec 2022)
terminate called after throwing an instance of 'c10::Error'
  what():  open file failed because of errno 2 on fopen: , file path: /work/e89/e89/zem/MACE_C_Potential/C_MACE_GAP-17_CPU.pt
Exception raised from RAIIFile at ../caffe2/serialize/file_adapter.cc:21 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x2ac9493e756e in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x2ac9493b1f18 in /work/e89/e89/zem/libtorch/lib/libc10.so)
frame #2: caffe2::serialize::FileAdapter::RAIIFile::RAIIFile(std::string const&) + 0x124 (0x2ac952170634 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::FileAdapter::FileAdapter(std::string const&) + 0x2e (0x2ac95217068e in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x5a (0x2ac95216eada in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #5: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>, std::unordered_map<std::string, std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::string> > >&) + 0x2a5 (0x2ac95321cb85 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #6: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::string const&, c10::optional<c10::Device>) + 0x7b (0x2ac95321d39b in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #7: torch::jit::load(std::string const&, c10::optional<c10::Device>) + 0xa5 (0x2ac95321d475 in /work/e89/e89/zem/libtorch/lib/libtorch_cpu.so)
frame #8: /work/e89/e89/zem/lammps/build/lmp() [0x7bc928]
frame #9: /work/e89/e89/zem/lammps/build/lmp() [0x31a2dd]
frame #10: /work/e89/e89/zem/lammps/build/lmp() [0x3117c4]
frame #11: /work/e89/e89/zem/lammps/build/lmp() [0x31ceaf]
frame #12: /work/e89/e89/zem/lammps/build/lmp() [0x2fe6ad]
frame #13: __libc_start_main + 0xea (0x2ac96d6bb34a in /lib64/libc.so.6)
frame #14: /work/e89/e89/zem/lammps/build/lmp() [0x2fe5ca]

srun: error: nid001643: task 0: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=3227130.0

I guess the obvious thing here is to train and compile on Archer2, but I chose the other cluster as they have some fancy RTX cards which were not running out of memory during training. Archer2 sadly does not have any GPUs so training is an issue.

@wcwitt Have you setup GPU MACE runs for LAMMPS? I think this could be an interesting approach as I think this swapping between CPU and GPU compilations is a bit messy from my side!

from mace.

ilyes319 avatar ilyes319 commented on August 28, 2024

@wcwitt @zakmachachi Can we close this issue, is there a fix somewhere?

from mace.

wcwitt avatar wcwitt commented on August 28, 2024

We've been emailing about it in combination with some other things. Let's leave it open for a bit longer and I'll post once it's ready to close

from mace.

ilyes319 avatar ilyes319 commented on August 28, 2024

Sure thank you!

from mace.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.