Code Monkey home page Code Monkey logo

Comments (4)

KramerChristian avatar KramerChristian commented on September 23, 2024

Dear Cheng,

I suppose that the error you observe comes from running out of memory. 10M compounds is a very large dataset for mmpdb. Could you try to run the mmpdb fragmentation with just the first 10K compounds and check whether you get the same error?

Overall, mmpdb is not made to run with such large datasets. The largest set for which I successfully created a DB usign standard mmpdb was roughly in the range of 1M compounds (using 256 GB of RAM). However, with that DB size queries within the DB take pretty long.

If you really want to fragment 10M compounds, you could just cut your input .smi file into smaller chunks, distribute the fragmentation for each of the chunks on a cluster, and cat the results back together (after removing the header).
Indexing in the current version of mmpdb is not parallelized, and your workflow for 10M compounds will likely fail here due to running out of memory.

The indexing step can be parallelized, but this is very involved and requires a rewrite of the code.

I hope this clarifies the situation. Please let me know whether fragmentation works for you with smaller datasets.

Bests,
Christian

from mmpdb.

chengthefang avatar chengthefang commented on September 23, 2024

Hi Christian,

Thank you for your prompt response. Yes, I ran the mmpdb command for 1M compounds. Everything works fine.

So I think I can split 10M compounds into 10 chunks and fragment each one, and combine them together. That's a good way. Thanks!

Regarding the indexing step, I am planning to put some constrains in order to reduce the size as you suggested in an old post (#6), for example reduce --max-radius =3, --min-heavies-per-const-frag =3, and turn on the --smallest-transformation-only flag. Is there any other suggestions on controlling the size of generated DB without losing meaningful transforms?

Thanks,
Cheng

from mmpdb.

KramerChristian avatar KramerChristian commented on September 23, 2024

Dear Cheng,

I do not want to be demotivating, and I do not know how much memory you have available - but I have strong doubts that you will be able to index a DB with 10M compounds, even with the most restrictive settings. If you want to reduce the DB size, you can also restrict the fragment size to very small fragments. However, you will always have a tradeoff between creating a manageable DB size and loosing interesting pairs.

Bests,
Christian

from mmpdb.

chengthefang avatar chengthefang commented on September 23, 2024

Dear Christian,

Thank you for your comments. I agree that there is always a tradeoff between the size of DB and the pair information. I will do some tests to see how it works.

Thanks,
Cheng

from mmpdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.