Hi, You mentioned at PSI-K that the only bottleneck for larger than memory dataset

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Support for larger than memory datasets about mace HOT 22 CLOSED

acesuit commented on August 28, 2024

Support for larger than memory datasets

from mace.

Comments (22)

davkovacs commented on August 28, 2024 1

@peastman we have significantly improved the support for large datasets. Please have a look at the multi-GPU branch. It does not create huge datasets anymore during the preprocessing, and the memory requirements should also be even smaller. For me a preprocessed SPICE is now less than 8Gb. If you need any help, let me know!

from mace.

ilyes319 commented on August 28, 2024

Hi @JonathanSchmidt1,

Thank you for your messsage! I think the several sides for this are :

Enable parsing of the statistics to avoid running out of memory
Add multi-processing for the dataloader to speed up the loading and more memory effecient loading and pre-processing

Could you give an estimate of the size of the dataset in you want to load?

from mace.

JonathanSchmidt1 commented on August 28, 2024

Hi,
the solutions sound good. At the moment we are thinking about ~35M structures, maybe on average 12 atoms and the corresponding forces and stresses.
best,
Jonathan

from mace.

peastman commented on August 28, 2024

I just ran into the same problem. I'm trying to fit a MACE model to the SPICE dataset, which is moderately large but not huge. About 1.1 million conformations, an average of around 40 atoms per molecule, with energies and forces. Converted to xyz format it's 3.9 GB. When it tried to load the dataset, it filled up all the memory in my computer (32 GB), then the computer hung and had to be turned off with the power switch.

In my work with TorchMD-Net, I developed a HDF5 based dataset format for dealing with this problem. It transparently pages data into memory as needed. Would something similar be helpful here?

from mace.

gabor1 commented on August 28, 2024

The question is whether we should push multi-gpu-multi-node training, assuming that large data sets take a long time to train anyway

from mace.

peastman commented on August 28, 2024

Multi-GPU is nice if you happen to have them, but it isn't necessary. A single GPU can train models on large datasets. It just takes longer.

In my case it never got as far as using the GPU. It ran out of host memory first.

from mace.

ilyes319 commented on August 28, 2024

Hi @peastman!

So you did not get any error message but just a crash? I guess HDF5 is a good option for that. In your code, is it interfaced with a torch dataloader ?

from mace.

peastman commented on August 28, 2024

So you did not get any error message but just a crash?

It went deeply into swap space and started thrashing, which caused the computer to become unresponsive.

In your code, is it interfaced with a torch dataloader ?

It subclasses torch_geometric.data.Dataset.

I created a reduced version of the dataset with less than 10% of the data, so I could analyze where exactly the memory was being used. I noted the total memory used by the process at various points in run_train.py.

Immediately before the call to get_dataset_from_xyz(): 300 MB
Immediately after it returns: 2.9 GB
After creating train_loader and valid_loader: 6.1 GB
After creating the model: 8.2 GB

It looks like there are multiple bottlenecks in memory use. Loading the raw data with get_dataset_from_xyz() would take over 30 GB for the full dataset. Then when it creates the AtomicData objects, that more than doubles the memory use. Addressing this would involve two steps:

Load the raw data in a way that doesn't require everything to be in memory at once.
Create the AtomicData objects as they're needed and then discard them again, rather than building them all in advance.

from mace.

davkovacs commented on August 28, 2024

@peastman Have you also considered using LMDB instead of hdf5? I would like to implement a new data loader that can efficiently deal with arbitrarily large datasets, and was wondering if in your experience hdf5 is better / faster? I am currently leaning towards using lmdb, but have not benchmarked them exhaustively yet.

from mace.

peastman commented on August 28, 2024

I'd never heard of LMDB before. It looks like something very different. LMDB is a database system, while hdf5 is just a file format that's designed to allow efficient read access. It's possible a full database would have advantages, but it's going to be a lot harder for users. There's good support for hdf5 in just about every language, and it takes minimal code to build a file. Users are also much more likely to be familiar with it already.

As for performance, hdf5 is working great for me. I have no trouble handling large datasets and I can get GPU utilization close to 100%.

from mace.

davkovacs commented on August 28, 2024

@peastman Could you perhaps try this pull request. I had a go at implementing the on-the-fly data loading and statistics parsing. Let me know if you have any questions, instructions are in the README.

#73

from mace.

peastman commented on August 28, 2024

It moves the problem to a new place without fixing it. When I run the preprocess_data.py script I get the same behavior as before. The memory used by the process gradually increases until it fills up all available memory. Then the computer hangs and has to be shut down with the power button.

from mace.

davkovacs commented on August 28, 2024

Sorry, I was under the assumption that there is 100-s of Gb-s of CPU RAM, and it is only the GPU RAM that is limited. So I moved the whole preprocesssing to the CPU and the GPU should read the preprocessed data from the HDF5 one batch at a time.

I can try to implement a modified low memory version of the preprocessing that processes one config at a time and writes it to disk before going to the next.

from mace.

peastman commented on August 28, 2024

My computer only has 32 GB of RAM.

from mace.

davkovacs commented on August 28, 2024

I have attempted a fix. I really hope it will work. Please use this branch and see the example in the README

https://github.com/davkovacs/mace/tree/on_the_fly_dataloading

from mace.

peastman commented on August 28, 2024

Thanks! It's running now. The resident size climbed to about 12 GB and then stopped increasing. I'll let you know what happens.

from mace.

peastman commented on August 28, 2024

Success! I started from a dataset that's about 1 GB in the HDF5 format used by TorchMD-Net. Converting it to xyz format increased it to just under 4 GB. preprocess_data.py ran for over 3.5 hours and produced a pair of files that totaled about 56 GB(!). But training now seems to be working. It's been running for about 12 hours, and the loss is gradually decreasing.

What exactly is the loss function? And is there any way I can tell what epoch it's on?

from mace.

davkovacs commented on August 28, 2024

Great to hear!
It should have created a logs directory which contains a file that logs the validation loss (every 2 epochs by default). There is another folder called results which contains a file that logs the loss for each batch.

For the precise form of the loss function see Appendix 5.1 of the paper
https://arxiv.org/pdf/2206.07697.pdf

from mace.

peastman commented on August 28, 2024

I guess that means it hasn't completed two epochs yet. I'll keep watching.

from mace.

JonathanSchmidt1 commented on August 28, 2024

That's great to here is the multi-GPU branch operational already?

from mace.

davkovacs commented on August 28, 2024

We still have some debugging to do for multi-GPU training, but it works for training on a single GPU.

from mace.

peastman commented on August 28, 2024

It works great, thanks! I have a training run going right now. I'm also looking forward to multi-GPU support.

from mace.

Support for larger than memory datasets about mace HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent