Comments (4)
Dear Cheng,
I suppose that the error you observe comes from running out of memory. 10M compounds is a very large dataset for mmpdb. Could you try to run the mmpdb fragmentation with just the first 10K compounds and check whether you get the same error?
Overall, mmpdb is not made to run with such large datasets. The largest set for which I successfully created a DB usign standard mmpdb was roughly in the range of 1M compounds (using 256 GB of RAM). However, with that DB size queries within the DB take pretty long.
If you really want to fragment 10M compounds, you could just cut your input .smi file into smaller chunks, distribute the fragmentation for each of the chunks on a cluster, and cat the results back together (after removing the header).
Indexing in the current version of mmpdb is not parallelized, and your workflow for 10M compounds will likely fail here due to running out of memory.
The indexing step can be parallelized, but this is very involved and requires a rewrite of the code.
I hope this clarifies the situation. Please let me know whether fragmentation works for you with smaller datasets.
Bests,
Christian
from mmpdb.
Hi Christian,
Thank you for your prompt response. Yes, I ran the mmpdb command for 1M compounds. Everything works fine.
So I think I can split 10M compounds into 10 chunks and fragment each one, and combine them together. That's a good way. Thanks!
Regarding the indexing step, I am planning to put some constrains in order to reduce the size as you suggested in an old post (#6), for example reduce --max-radius =3, --min-heavies-per-const-frag =3, and turn on the --smallest-transformation-only flag. Is there any other suggestions on controlling the size of generated DB without losing meaningful transforms?
Thanks,
Cheng
from mmpdb.
Dear Cheng,
I do not want to be demotivating, and I do not know how much memory you have available - but I have strong doubts that you will be able to index a DB with 10M compounds, even with the most restrictive settings. If you want to reduce the DB size, you can also restrict the fragment size to very small fragments. However, you will always have a tradeoff between creating a manageable DB size and loosing interesting pairs.
Bests,
Christian
from mmpdb.
Dear Christian,
Thank you for your comments. I agree that there is always a tradeoff between the size of DB and the pair information. I will do some tests to see how it works.
Thanks,
Cheng
from mmpdb.
Related Issues (20)
- Use SQLAlchemy to work with the database HOT 1
- organize command-line processing and move to click HOT 4
- Remove vendered use of peewee HOT 1
- Use importlib.resources instead of __file__ HOT 1
- Can we specify the environmental radius when generating the mmpdb? HOT 11
- Which "cut-smarts" patterns (i.e. fragmentation parameters) are used by default for Transform function? HOT 1
- mmpdb transform behaves unexpectedly HOT 5
- Error when using "--out mmpa" HOT 4
- [docs] README.md: create installation guide HOT 5
- AttributeError: module '__main__' has no attribute '__spec__'
- Support for CXSMILES HOT 4
- Obtain list of matched pairs with common core from an ID. HOT 4
- Automatically get MMPs for a given data set HOT 4
- Turning on --property flag leading to a smaller number of transformed structures
- Unexpected generated molecule with double cuts MMPDB HOT 4
- Possible SQL injection vulnerability HOT 2
- Release the latest version into PyPI? HOT 2
- How to solve "sqlite3.OperationalError: database is locked" error?
- Error of "sqlite3.OperationalError: database is locked" HOT 1
- How to use SQL to get the table of all rules in the mmpdb file, as well as the number of pairs, and statistics for each rule?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mmpdb.