Hi all, I am trying to build a MMP-DB with 10M compounds. But I got

Memory error with mmpdb fragment for large dataset about mmpdb HOT 4 CLOSED

rdkit commented on September 23, 2024

Memory error with mmpdb fragment for large dataset

from mmpdb.

Comments (4)

KramerChristian commented on September 23, 2024

Dear Cheng,

I suppose that the error you observe comes from running out of memory. 10M compounds is a very large dataset for mmpdb. Could you try to run the mmpdb fragmentation with just the first 10K compounds and check whether you get the same error?

Overall, mmpdb is not made to run with such large datasets. The largest set for which I successfully created a DB usign standard mmpdb was roughly in the range of 1M compounds (using 256 GB of RAM). However, with that DB size queries within the DB take pretty long.

If you really want to fragment 10M compounds, you could just cut your input .smi file into smaller chunks, distribute the fragmentation for each of the chunks on a cluster, and cat the results back together (after removing the header).
Indexing in the current version of mmpdb is not parallelized, and your workflow for 10M compounds will likely fail here due to running out of memory.

The indexing step can be parallelized, but this is very involved and requires a rewrite of the code.

I hope this clarifies the situation. Please let me know whether fragmentation works for you with smaller datasets.

Bests,
Christian

from mmpdb.

chengthefang commented on September 23, 2024

Hi Christian,

Thank you for your prompt response. Yes, I ran the mmpdb command for 1M compounds. Everything works fine.

So I think I can split 10M compounds into 10 chunks and fragment each one, and combine them together. That's a good way. Thanks!

Regarding the indexing step, I am planning to put some constrains in order to reduce the size as you suggested in an old post (#6), for example reduce --max-radius =3, --min-heavies-per-const-frag =3, and turn on the --smallest-transformation-only flag. Is there any other suggestions on controlling the size of generated DB without losing meaningful transforms?

Thanks,
Cheng

from mmpdb.

KramerChristian commented on September 23, 2024

Dear Cheng,

I do not want to be demotivating, and I do not know how much memory you have available - but I have strong doubts that you will be able to index a DB with 10M compounds, even with the most restrictive settings. If you want to reduce the DB size, you can also restrict the fragment size to very small fragments. However, you will always have a tradeoff between creating a manageable DB size and loosing interesting pairs.

Bests,
Christian

from mmpdb.

chengthefang commented on September 23, 2024

Dear Christian,

Thank you for your comments. I agree that there is always a tradeoff between the size of DB and the pair information. I will do some tests to see how it works.

Thanks,
Cheng

from mmpdb.

Recommend Projects

Memory error with mmpdb fragment for large dataset about mmpdb HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent