rdkit / mmpdb Goto Github PK

View Code? Open in Web Editor NEW

191.0 19.0 52.0 915 KB

A package to identify matched molecular pairs and use them to predict property changes.

License: Other

Python 100.00%

mmpdb's People

Stargazers

Watchers

Forkers

chemphy esrehmki cxn-astracz accsc jones-gareth adalke prakashrathi minghao2016 mengwuxiao bramble50 kzfm highdxy gedeck acquaregia bbyun28 msteijaert cttq-wubin zhangshd lilleswing z-linlinlin markusferdinanddablander i-tub sraghavan0610 shunsunsun trumanw bp-kelley js0108 niki0404 sailfish009 unixjunkie aspirincode josephpparker shiyx409 rnaimehaom takshan abhik1368 cthoyt liang2508 iq-scm caiyingchun shangchien jamestiotio pctskate ferreira68 nxccc syntheticgestalt jwallaceevotec faisalmehmood2007

mmpdb's Issues

Obtain list of matched pairs with common core from an ID.

Hi, I've built my MMP database for a set of compounds but am struggling to generate the output I would like.

My use case is a pretty typical one, finding changes that lead to large property change in compounds obtained from patents.

c1ccccc1O X1
c1ccccc1OC X2
c1ccccc1N X3
c1ccccc1OC1CC1 X4
c1cc(Cl)ccc1O X5

What I'm hoping to do is generate a list of matched pairs for a given processed compound. e.g. X1

c1ccccc1* X1 *O X2 OC
c1ccccc1 X1 *O X3 N
c1ccccc1 X1 *O X4 *OC1CC1
etc.

Is this possible?

Thanks
Mike

mmpdb transform behaves unexpectedly

The transform rules in mmpdblib appears to miss some apparent cases.

A test case with the following structures:

OC(c(cccc1)c1O)=O	 mol1
CCCCCCCC(c(cc1)cc(C(O)=O)c1O)=O	mol2
CCCCCC(c(cc1)cc(C(O)=O)c1O)=O	mol3

with some properties:

ID	prop
mol1	0.0
mol2	1.0
mol3	1.5

I performed the fragmentation, index and property loading as instructed.

python -m mmpdblib fragment test_struct.tsv --max-rotatable-bonds 20 --num-cuts 3 -o test.fragments
python -m mmpdblib index test.fragments -o test.mmpdb
python -m mmpdblib loadprops --properties test_prop.tsv test.mmpdb

The indexed pairs makes sense.

However, when I run:

python -m mmpdblib transform --smiles 'OC(c(cccc1)c1O)=O' test.mmpdb --explain

I noticed that I cannot get mol2 or mol3, where the rules mol1->mol2 and mol1->mol3 is included in the index step. Did I miss something here? Thank you for your help.

Here's the explanation output:

WARNING: APSW not installed. Falling back to Python's sqlite3 module.
Processing fragment Fragmentation(1, 'N', 7, '1', '*c1ccccc1O', '0', 3, '1', '*C(=O)O', 'O=CO')
  variable '*c1ccccc1O' not found as SMILES '[*:1]c1ccccc1O'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(1, 'N', 3, '1', '*C(=O)O', '0', 7, '1', '*c1ccccc1O', 'Oc1ccccc1')
  variable '*C(=O)O' not found as SMILES '[*:1]C(=O)O'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(2, 'N', 6, '11', '*c1ccccc1*', '01', 4, '12', '*C(=O)O.*O', None)
  variable '*c1ccccc1*' not found as SMILES '[*:1]c1ccccc1[*:2]'
  variable '*c1ccccc1*' not found as SMILES '[*:2]c1ccccc1[*:1]'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(1, 'N', 1, '1', '*O', '0', 9, '1', '*c1ccccc1C(=O)O', 'O=C(O)c1ccccc1')
  variable '*O' not found as SMILES '[*:1]O'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(1, 'N', 9, '1', '*c1ccccc1C(=O)O', '0', 1, '1', '*O', 'O')
  variable '*c1ccccc1C(=O)O' not found as SMILES '[*:1]c1ccccc1C(=O)O'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(2, 'N', 6, '11', '*c1ccccc1*', '01', 4, '12', '*O.*C(=O)O', None)
  variable '*c1ccccc1*' not found as SMILES '[*:1]c1ccccc1[*:2]'
  variable '*c1ccccc1*' not found as SMILES '[*:2]c1ccccc1[*:1]'
  No matching rule SMILES found. Skipping fragment.
== Product SMILES in database: 0 ==
ID      SMILES  prop_from_smiles        prop_to_smiles  prop_radius     prop_fingerprint      prop_rule_environment_id        prop_count      prop_avg        prop_std      
  prop_kurtosis prop_skewness   prop_min        prop_q1 prop_median     prop_q3 prop_max      prop_paired_t   prop_p_value

Open question: chemical formula equivalent to mmpdb

This isn't so much a bug report, as it is an open question if there is any software like mmpdb that works strictly off chemical formulas (as opposed to chemical structures)? I am working with mass spectrometry data and could very much use such functionality. Thank you in advance.

Support for CXSMILES

Currently mmpdb seems to only support SMILES. But rdkit can natively support CXSMILES. Is it possible to extend mmpdb to support CXSMILES?

For our work, we were initially using CXSMILES and were using the comma delimiter. Using a csvwriter we were enclosing the CXSMILES in double quotes so that the csvreader would know not to split the commas inside the double qoutes. But as it turns turns out mmpdb just uses python's split() method, which does not take care of ignoring the commas inside qoutes. So, this doesn't work for CXSMILES.

How to output a list of ID of compounds forming pairs?

This is not clear to me. Thanks

How to build mmpdb with large data set

Dear developer,

I would like to ask how to build mmpdb with large data set.
I tried to build mmpdb with chembl28 data. At first, I made chunk files from over 1 million smiles which came from chemblDB.
Then made fragment files from the chunk data and merge them to one file.
Finally I run mmpdb index command against merged fragment data. But the process was killed due to lack of memory.
Are there any way to build mmpdb from large size of fragments?
My environment 32GB RAM.
Any advice or suggestion are greatly appreciated.
Thanks,

Taka

multiple smarts in --cut-smarts

I have large database 500K compounds and I am interested in finding only few transforms.
Ideally I would like to give transform in the form of smirks.
I understand that it might be easier to ask for a different fragmentation pattern and perform indexing on it.
I can translate the smirks into smarts specifying specific bonds.
For the tool to be useful I would like to be able to provide more than one SMARTS to the --cut-smarts option.
It would be excellent if an option like --cache would allow using a fragmentation file and enhance it by specifying other cut patterns.
Thanks.
marco

Possible SQL injection vulnerability

Hi there,

I work at a contract research organisation (CRO) that has recently been interested in implementing mmpdb as a part of a drug discovery pipeline. A step of this implementation involves checking for potential security issues before it can be installed internally.

This check was failed due to two areas in the code where SQL injection vulnerabilities appear. For context, an SQL injection vulnerability is a technique in which a call normally used to execute SQL queries could be used by a malicious user to execute unintended actions, like the exposure of sensitive/confidential information in databases or the installation of malware etc.

Out IT team flagged two areas of the code where the vulnerabilities appear. Here is the first (in mmpdblib/peewee.py):-

   def execute_sql(self, sql, params=None, require_commit=True):
        logger.debug((sql, params))
        with self.exception_wrapper():
            cursor = self.get_cursor()
            try:
                cursor.execute(sql, params or ())
            except Exception as exc:
                if self.get_autocommit() and self.autorollback:
                    self.rollback()
                if self.sql_error_handler(exc, sql, params, require_commit):
                    raise
            else:
                if require_commit and self.get_autocommit():
                    self.commit()
        return cursor

The second is in the same location, so I assume their method of flagging vulnerabilities has picked this same SQL injection problem twice.

Would you be interested in addressing this particular vulnerability? If not, this isn't an immediate problem as we can address it internally and submit a PR with the fix? Let me know how you'd like to proceed with this, if at all

Thanks

Incorrect atom mappings for the new generated molecules via transformation

Hi developers,

I am keeping trying the mmpdb regarding the "transform" function. I found that for 2 cuts and 3 cuts, the new generated SMILES messed up with the atom mappings in the transform rule.

For example,

Original SMILES	New SMILES	from_smiles	to_smiles
CC(C)c1nc(C(=O)NCc2ccccc2)no1	CC(C)CNC(=O)N1CCC(c2ccccc2)C1	[:1]CNC(=O)c1noc([:2])n1	[:1]CNC(=O)N1CCC([:2])C1
CC(C)c1nc(C(=O)NCc2ccccc2)no1	CC(C)CNC(=O)c1cc(-c2ccccc2)no1	[:1]CNC(=O)c1noc([:2])n1	[:1]CNC(=O)c1cc([:2])no1

I expect that the transformed linkers (i.e. "to_smiles") should connect with two other unchanged fragments at the same attachment points (*1 & *2) as the old linkers (i.e. "from_smiles"). However, it shows the new generated molecules (i.e. "New SMILES") flip the transformed linker over. In other words, the atom mappings in "From Smiles" to "To Smiles" are correct, but the atom mappings are incorrect in the new generated whole molecule.

Would you mind take a look at this issue?

Thanks,
Cheng

Add property information to CSV export

At this point the CSV output generated with mmpdb index --out csv [...] does not contain property information even if you specify a property file with --properties.

If a property file is explicitly given, it would be nice if the information like property value of compound 1 and 2, and property change during the transformation would be included in the resulting CSV file.

How to know the number of pairs & rules from generated .mmpdb file?

Hi all,

I would like to know if there is a function to show the number of pairs & rules from already generated .mmpdb file. Previously, once I generated the .mmpdb database, the information of the number of pairs and rules is shown on the screen. But I would like to review these information without regenerating the database.

Appreciate if anyone else can help me figure a way out.

Thanks,
Cheng

Out of memory when indexing a large fragment file

When I try to execute mmpdb index command on a large fragments file (>5M structures), Python runs out of memory even on a very big Linux node > 700Gb.

Can anything be done to process such big databases?

Thanks.

Error when using "--out mmpa"

I'm not sure what "--output mmpa" does but it gives the following error for the GitHub version:

$ mmpdb index myfile.fragments -o myfile.mmpa --out mmpa
...
  File "mmpdb/mmpdblib/index_writers.py", line 106, in add_environment_fingerprint_parent
    self._W("FINGERPRINT\t%d\t%s\n" % (fp_idx, environment_fingerprint, parent_idx))
TypeError: not all arguments converted during string formatting

cannot fix the AttributeError: module 'main' has no attribute 'spec'

Dear developer,

I use vscode with Miniconda3-4.5.4 (python=3.6) in Windows.
When I run the command line, as shown in the Fragment structures section in README.md:

mmpdb fragment test_data.smi -o test_data.fragments

I get the error message:

Traceback (most recent call last):
  File "C:/Users/User/miniconda3/envs/mmpdb/Scripts/mmpdb", line 4, in <module>
    __import__('pkg_resources').run_script('mmpdb==2.3.dev1', 'mmpdb')
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\site-packages\pkg_resources\__init__.py", line 651, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\site-packages\pkg_resources\__init__.py", line 1448, in run_script
    exec(code, namespace, namespace)
  File "c:\users\user\miniconda3\envs\mmpdb\lib\site-packages\mmpdb-2.3.dev1-py3.6.egg\EGG-INFO\scripts\mmpdb", line 11, in <module>
    commandline.main()
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\site-packages\mmpdb-2.3.dev1-py3.6.egg\mmpdblib\commandline.py", line 1054, in main
    parsed_args.command(parsed_args.subparser, parsed_args)
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\site-packages\mmpdb-2.3.dev1-py3.6.egg\mmpdblib\commandline.py", line 181, in fragment_command
    do_fragment.fragment_command(parser, args)
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\site-packages\mmpdb-2.3.dev1-py3.6.egg\mmpdblib\do_fragment.py", line 567, in fragment_command
    pool = create_pool(args.num_jobs)
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\site-packages\mmpdb-2.3.dev1-py3.6.egg\mmpdblib\do_fragment.py", line 396, in create_pool
    pool = multiprocessing.Pool(num_jobs, init_worker)
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\multiprocessing\context.py", line 119, in Pool
    context=self.get_context())
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\multiprocessing\pool.py", line 174, in __init__
    self._repopulate_pool()
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\multiprocessing\pool.py", line 239, in _repopulate_pool
    w.start()
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\multiprocessing\popen_spawn_win32.py", line 33, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Users\User\miniconda3\envs\mmpdb\lib\multiprocessing\spawn.py", line 172, in get_preparation_data
    main_mod_name = getattr(main_module.__spec__, "name", None)
AttributeError: module '__main__' has no attribute '__spec__'

I reviewed lots of methods fixing this error but nothing changed.
It would be great if someone could also help me fix it.

Thanks.

Symmetric variable fragments are not replaced in both possible directions

When running the transform query, the enumerator only creates one out of the several possible compounds if a fragment to be replaced is symmetric.

Here is a simplified example:
Running a transform query with this input compound

C1CCC1CCN2CC2

yields (among others) this fragmentation:

constant: C1CCC1*.C2CN2*
variable: CC

Note that the variable linker is symmetric.
The MMP database now contains transformations like

CC >> C(C)C

which should produce these two compounds:

C1CCC1C(C)CN2CC2
C1CCC1CC(C)N2CC2

, depending on which way round the transformation is applied. However, it only produces one of them. (The other one is produced too, but based on a different rule with much less pairs.)

Automatically get MMPs for a given data set

Hi authors,

I thought this tool can automatically find the MMPs from a group of molecules.

For example, if mmpdb is given a sdf, csv or smi file, it can generate a resulting file which has all the MMPs from the given file.

However, when I read the paper, it seems that the user needs to provide user-defined cutting patterns. (the constants part in the paper)

Is mmpdb a interactive MMPs generation tool?

Best,

Can we specify the environmental radius when generating the mmpdb?

Hi all,

I am aware of that we can adjust the max-radius parameter to set the maximum environmental radius to be indexed in MMPDB. But I wonder if there is a way that we can only index the database at a specific radius. For example, could we just generate mmpdb at the radius =3 specifically?

Thanks!
Cheng

Use importlib.resources instead of file

mmpdb uses __file__ to get the *.sql files. This prevents mmpdb from being installed as a wheel/zipfile.

I've switched to importlib.resources, which is the modern way to get resources like this.

The importlib.resources module was added in Python 3.7, which means this change drops Python 3.6 support!

This should not be a problem. Python 3.6 came out nearly 5 years ago, and its support period ends 2021-12, which is next month.

If it is a problem, then there are a couple of solutions. 1) use pkg_resources, 2) use the resources back-port.

The mapping from (package_name, resource_name) -> content is in the setup.cfg:

[options.package_data]
mmpdblib = schema.sql, create_index.sql, drop_index.sql, fragment_schema.sql

and loaded like this:

_schema_template = importlib.resources.read_text("mmpdblib", "schema.sql")

AttributeError: module 'main' has no attribute 'spec'

Hello!

I tried to run the command mmpdb fragment tests/chembl_test.smi -o tests/chembl_test.fragments with my own data - looks like this (if it matters):

c1cn(-c2ccc3c(-c4cc5cc(CN6CCCCC6)ccc5[nH]4)n[nH]c3c2)nn1
CN(C)C(=O)c1ccc2c(-c3cc4cc(CN5CCOCC5)ccc4[nH]3)n[nH]c2c1
c1cnn(-c2ccc3c(-c4cc5cc(CN6CCCCC6)ccc5[nH]4)n[nH]c3c2)c1
c1cc2[nH]c(-c3n[nH]c4cc(-c5cn[nH]c5)ccc34)cc2cc1CN1CCOCC1
c1ncc(-c2cnc(Nc3cc(N4CCNCC4)ccn3)s2)cn1

and then I got AttributeError: module '__main__' has no attribute '__spec__' with the full traceback:

Traceback (most recent call last):
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/bin/mmpdb", line 4, in <module>
    __import__('pkg_resources').run_script('mmpdb==2.3.dev1', 'mmpdb')
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/site-packages/pkg_resources/__init__.py", line 672, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1472, in run_script
    exec(code, namespace, namespace)
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/site-packages/mmpdb-2.3.dev1-py3.9.egg/EGG-INFO/scripts/mmpdb", line 11, in <module>
    commandline.main()
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/site-packages/mmpdb-2.3.dev1-py3.9.egg/mmpdblib/commandline.py", line 1054, in main
    parsed_args.command(parsed_args.subparser, parsed_args)
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/site-packages/mmpdb-2.3.dev1-py3.9.egg/mmpdblib/commandline.py", line 181, in fragment_command
    do_fragment.fragment_command(parser, args)
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/site-packages/mmpdb-2.3.dev1-py3.9.egg/mmpdblib/do_fragment.py", line 567, in fragment_command
    pool = create_pool(args.num_jobs)
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/site-packages/mmpdb-2.3.dev1-py3.9.egg/mmpdblib/do_fragment.py", line 396, in create_pool
    pool = multiprocessing.Pool(num_jobs, init_worker)
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/multiprocessing/pool.py", line 212, in __init__
    self._repopulate_pool()
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/Users/alisagorislav/opt/anaconda3/envs/mmpdb/lib/python3.9/multiprocessing/spawn.py", line 183, in get_preparation_data
    main_mod_name = getattr(main_module.__spec__, "name", None)
AttributeError: module '__main__' has no attribute '__spec__'

I used macOS Monterey 12.1

Remove vendered use of peewee

mmpdb uses peewee as an adapter for different back-end databases.

I originally included a vendored version of peewee for easy of installation. I've removed that and am instead using an installation dependency on the "peewee" package.

That is, I removed peewee.py and playhouse/ and configured setup.cfg to have a installation dependency on peewee >= 3.0.

Turns out the peewee API changed from 2.x to 3.x, which occurred in 2018. A plus side of vendoring is that mmpdb was isolated from this change so we didn't have to worry about it until now. :)

I've updated mmpdb to work with the new peewee API.

ujson is not appreciably faster than json

In Python 2.7 the built-in json module was significantly slower than the third-party ujson and cjson modules at parsing the fragment file. If one of the latter two is not found, mmpdb prints a warning message to suggest installing one of those two modules, then falls back to using the json module.

It appears that Python 3.6's json module, while still slower than cjson, is no longer sufficiently slower as to warrant having that warning message. In one test, json took 2m09s while cjson took 2m04s.

Need to re-run the timing tests with Python 3.5, 3.6, and 3.7. If it's no longer needed then have the warning message only for Python 2.7.

How to get the chemical information for each rule environment with a given rule?

Hi all,

I built a mmpDB using the example "test_data.smi" shown on the github page. Then, I run a query to grep all the rules from the schema as follows: (please correct me if I did something wrong with the query)

c = cursor.execute(
            "SELECT rule_environment.rule_id, from_smiles.smiles, from_smiles.num_heavies, to_smiles.smiles, to_smiles.num_heavies, "
            "          rule_environment.radius, "
            "          rule_environment_statistics.id, property_name_id, count, avg, std, kurtosis, skewness, min, q1, median, q3, max, paired_t, p_value "
            "  FROM rule, rule_environment, rule_environment_statistics, "
            "          rule_smiles as from_smiles, rule_smiles as to_smiles "
            " WHERE rule_environment.id = rule_environment_id "
            "   AND rule_environment_statistics.rule_environment_id = rule_environment_id "
            "   AND rule_environment.rule_id = rule.id "
            "   AND rule.from_smiles_id = from_smiles.id "
            "   AND rule.to_smiles_id = to_smiles.id ")

After that, I took a look at all rules. I found it is difficult to understand the rule environments for the same rule.

For example below, for the rule id 0, there are 11 environments with different rule-environment-id. But how could I get the local chemical information for each environment? That will help me understand the difference between those environments.

from_smiles	from_smiles_nHeavies	to_smiles	to_smiles_nHeavies	environ_radius	rule_environ_id	count	avg
[*:1]c1ccccc1N	7	[*:1]c1ccccc1O	7	0	1	2	1
[*:1]c1ccccc1N	7	[*:1]c1ccccc1O	7	1	3	1	1
[*:1]c1ccccc1N	7	[*:1]c1ccccc1O	7	2	5	1	1
[*:1]c1ccccc1N	7	[*:1]c1ccccc1O	7	3	7	1	1
[*:1]c1ccccc1N	7	[*:1]c1ccccc1O	7	4	9	1	1
[*:1]c1ccccc1N	7	[*:1]c1ccccc1O	7	5	11	1	1
[*:1]c1ccccc1N	7	[*:1]c1ccccc1O	7	1	214	1	1
[*:1]c1ccccc1N	7	[*:1]c1ccccc1O	7	2	216	1	1
[*:1]c1ccccc1N	7	[*:1]c1ccccc1O	7	3	218	1	1
[*:1]c1ccccc1N	7	[*:1]c1ccccc1O	7	4	220	1	1
[*:1]c1ccccc1N	7	[*:1]c1ccccc1O	7	5	222	1	1

Thanks,
Jen

Stop cutting aliphatic C-Halogen bonds

Most of the Cut-SMARTS currently cut bonds between aliphatic carbon and Halogens. This may not be desirable, since it leads to CF3 and OCF3 groups being split up. These splits may not be interesting for users.

The idea is to create new cut-SMARTS that do not cut CF, CF2, CF3, OCF3, and generally C[Halogen] bonds.

[docs] README.md: create installation guide

Just an idea for docs. It could be pretty helpful if there will be an installation guide in the README.md file.

sqlite3.OperationalError: database or disk is full when indexing

Dear all,

I came across a SQLite3 error when indexing the fragments. See below:

WARNING: Neither ujson nor cjson installed. Falling back to Python's slower built-in json decoder. Building index ...
Failed to execute the following SQL: CREATE INDEX pair_rule_environment_id on pair (rule_environment_id);
Traceback (most recent call last): 
File "/mmpdb/mmpdb", line 11, in <module> commandline.main() 
File "/mmpdb/mmpdblib/commandline.py", line 1054, in main parsed_args.command(parsed_args.subparser, parsed_args) 
File "/mmpdb/mmpdblib/commandline.py", line 393, in index_command do_index.index_command(parser, args) 
File "/mmpdb/mmpdblib/do_index.py", line 205, in index_command pair_writer.end(reporter) 
File "mmpdb/mmpdblib/index_algorithm.py", line 1199, in end self.backend.end(reporter) 
File "/mmpdb/mmpdblib/index_writers.py", line 228, in end schema.create_index(self.conn) 
File "/mmpdb/mmpdblib/schema.py", line 133, in create_index _execute_sql(c, get_create_index_sql()) 
File "/mmpdb/mmpdblib/schema.py", line 119, in _execute_sql c.execute(statement) 

**sqlite3.OperationalError: database or disk is full**

But I checked my disk memory and confirmed there were plenty space available (1T). Any comments or suggestions on that? Will it help if I switch to APSW instead of SQlite3?

Thanks,
Cheng

Smallest transformation only appears to not reduce to H>>X transformation

The Smallest-transformation-only appears to not reduce single-cut transformations to H>>X transformations. For example, the transformation

[:1]c1ccccc1 >> [:1]c1ccc(F)cc1

is still in the database, although it could be reduced to

[:1][H] >> [:1]F

Enable Specification of exchange Fragment in mmpdb transform

When mmpdb transform is called, the algorithm currently fragments the input molecule and searches for replacements for all fragments in the DB. This can then be filtered by the substructure filter and others, but fundamentally all fragments are searched. A huge speedup can potentially be gained if the user could specify the fragment she wants to exchange in the query.

This requires a check whether the fragment as specified exists at all, potentially some fragment cleanup (e.g. specification of the attachment atoms), and the a filter to the specific fragment after fragmentation of the input compound

Smallest-transformation-only does not reduce all double cuts to single cuts

With the current implementation of smallest-transformation only, some reducible double-cut transformations are still present in the database. For example, this transformation

[:1]C(F)[:2] >> [:1]C(Cl)[:2]

can be reduced to that transformation

[:1]F >> [:1]Cl

How to use SQL to get the table of all rules in the mmpdb file, as well as the number of pairs, and statistics for each rule?

I don't know how to use SQL to get the table of all rules in the mmpdb file, as well as the number of pairs, and statistics for each rule. Could you please share your steps to achieve this? @chengthefang

          @KramerChristian Thanks, Christian. This solved my problem. I will close the issue from my end.

Originally posted by @chengthefang in #12 (comment)

no_regioisomer filter could be useful

Double- and triple-cuts can produce regioisomers, where the constant parts are just swapped. Examples are these transformations:

Double cut: [:1]CC1(CC1)[:2] >> [:1]C1(CC1)C[:2]
Triple cut: [:1]c1cc([:2])c([:3])cc1 >> [:1]c1cc([:3])c([:2])cc1

It may be useful to not store these transformations in order to reduce database size, in particular for triple cuts. If implemented, it would be good if these filter can be set separately for double and triple cuts.

Error of "sqlite3.OperationalError: database is locked"

Hi all,

I was trying to use the mmpdb version 3 for the fragmentation. However, I came across an error when implementing it on Linux:

The code I ran is mmpdb fragment test_data.smi -o test_data.fragdb

The error is:
`Failed to execute the following SQL:
-- Version 3.0 switched to a SQLite database to store the fragments.
-- Earlier versions used JSON-Lines.
-- The SQLite database improves I/O time, reduces memory use, and
-- simplifies the development of fragment analysis tools.

-- NOTE: There is configuration information in three files!
-- 1) fragment_types.py -- the data types
-- 2) fragment_schema.sql -- (this file) defines the SQL schema
-- 3) fragment_db.py -- defines the mapping from SQL to the data types

CREATE TABLE options (
id INTEGER NOT NULL,
version INTEGER,
cut_smarts VARCHAR(1000),
max_heavies INTEGER,
max_rotatable_bonds INTEGER,
method VARCHAR(20),
num_cuts INTEGER,
rotatable_smarts VARCHAR(1000),
salt_remover VARCHAR(200),
min_heavies_per_const_frag INTEGER,
min_heavies_total_const_frag INTEGER,
max_up_enumerations INTEGER,
PRIMARY KEY (id)
);
Traceback (most recent call last):
File "/miniconda3/envs/mmpdb31/bin/mmpdb", line 8, in
sys.exit(main())
File "/.local/lib/python3.9/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/.local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/miniconda3/envs/mmpdb31/lib/python3.9/site-packages/mmpdblib/cli/fragment_click.py", line 215, in make_fragment_options_wrapper
return command(**kwargs)
File "/miniconda3/envs/mmpdb31/lib/python3.9/site-packages/mmpdblib/cli/smi_utils.py", line 98, in make_input_options_wrapper
return command(**kwargs)
File "/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/miniconda3/envs/mmpdb31/lib/python3.9/site-packages/mmpdblib/cli/fragment.py", line 256, in fragment
with fragment_db.open_fragment_writer(
File "/miniconda3/envs/mmpdb31/lib/python3.9/site-packages/mmpdblib/fragment_db.py", line 372, in open_fragment_writer
init_fragdb(c, options)
File "/miniconda3/envs/mmpdb31/lib/python3.9/site-packages/mmpdblib/fragment_db.py", line 92, in init_fragdb
schema._execute_sql(c, get_schema_template())
File "/miniconda3/envs/mmpdb31/lib/python3.9/site-packages/mmpdblib/schema.py", line 129, in _execute_sql
c.execute(statement)
sqlite3.OperationalError: database is locked`

It probably has nothing to do with the MMPDB-v3 program since it is running fine on my Mac. If anyone has some advice/suggestions on how to solve it, it would be highly appreciated.

Thanks,
Cheng

Turning on --property flag leading to a smaller number of transformed structures

Hi all,

I recently found some unexpected outcomes when using "mmpdb transform" with or without the property flag. The mmpdb database was generated using ChEMBL database with calculated LogP as the property.

When I used the "--no-properties" flag, I got 4632 transformed structures.
mmpdb transform chembl.mmpdb --smiles "XXXXXX" --min-pairs 5 --min-variable-size 0 --max-variable-size 20 --no-properties -o results_noprop.csv &

However, when I turned on the "--property LogP" flag, I got 591 transformed structures.
mmpdb transform chembl.mmpdb --smiles "XXXXXX" --min-pairs 5 --min-variable-size 0 --max-variable-size 20 --property LogP -o results_prop.csv &

I would expect the code with "--property LogP" generates the same number of compounds but with more output info.

Any thoughts on that?

Thanks!
Cheng

Failure with mmpdb fragment for some specific smiles

Hi all,

I am using mmpdb fragment to parse a subset of SureChembl database, and then I found the mmpdb fragment will fail for some specific smiles. I wonder if we could add some error handling to deal with some unfavorable structures.

Here is the example of test.smi.

C[C@]12CCC3c4c5cc(O)cc4[C@@]4(CC[C@@]1(C4)C3CC5)[C@@H]2O SCHEMBL9251776
Oc1ccccc1 phenol
Oc1ccccc1O catechol
Oc1ccccc1N 2-aminophenol
Oc1ccccc1Cl 2-chlorophenol
Nc1ccccc1N o-phenylenediamine
Nc1cc(O)ccc1N amidol
Oc1cc(O)ccc1O hydroxyquinol
Nc1ccccc1 phenylamine
C1CCCC1N cyclopentanol

I ran "python mmpdb/mmpdb fragment test.smi -o test_data.fragments". It failed on parsing the first smiles and won't skip it to continue. The error is shown as below:

Failure: file 'test.smi', line 1, record #1: first line starts 'C[C@]12CCC3c4c5cc(O)cc4[C@@]4(CC[C@@]1(C ...'
Traceback (most recent call last): File "mmpdb/mmpdb", line 11, in commandline.main() File "/mmpdb/mmpdblib/commandline.py", line 1054, in main parsed_args.command(parsed_args.subparser, parsed_args) File "/mmpdb/mmpdblib/commandline.py", line 181, in fragment_command do_fragment.fragment_command(parser, args) File "/mmpdb/mmpdblib/do_fragment.py", line 581, in fragment_command writer.write_records(records) File "/mmpdb/mmpdblib/fragment_io.py", line 404, in write_records for rec in fragment_records: File "/mmpdb/mmpdblib/do_fragment.py", line 475, in make_fragment_records fragments = result.get() File "anaconda2/lib/python2.7/multiprocessing/pool.py", line 572, in get raise self._value ValueError: need more than 1 value to unpack

Appreciate any suggestions or ideas.

Thanks,
Cheng

Release the latest version into PyPI?

The one indexed in PyPI is 2.1. It would be great if the latest version is also in PyPI. Thanks.

Unexpected generated molecule with double cuts MMPDB

Hi!
I am getting an unexpected generated molecule when using double cuts. The transformation, which in this case is the linker, is attached the wrong way around (details below). I would really appreciate your help :)

Using the two following molecules to create an MMPDB
c1ccc(cc1)c2ccc3c(c2)c(c(cc3NCc4cn5cc(cnc5n4)F)c6ccccc6)c7ccccc7 test1
c1ccc(cc1)c2ccc3c(c2)c(c(cc3NCc4[nH]c5c(n4)ccc(c5F)F)c6ccccc6)c7ccccc7 test2

to then generate new molecules from the molecule to improve
c2ccc3c(c2)c(c(cc3NCc4[nH]c5c(n4)ccc(c5F)F)c6ccccc6)c7ccccc7

the following molecule is proposed
Fc1cn2cc(CNc3cc(-c4ccccc4)c(-c4ccccc4)c4ccccc34)cnc2n1

while the expected generated molecule would be
c2ccc3c(c2)c(c(cc3NCc4[nH]c5c(n4)ccc(c5F)F)c6ccccc6)c7ccccc7.

Please let me know if any other information is needed to better understand my issue.
Thanks a lot,
Alice

--min-heavies-per-const-frag 3 option looses some transformations

When using --min-heavies-per-const-frag 3 option during the fragmentation stage, I noticed that I am loosing the following transformation:
[:1]O[:2] to [:1]C([:2])N
in which one of the Rs is a simple methyl group. Is it possible to somehow loosing this transformation by playing with any option on during the indexing step.

Can I get a table of all rules, as well as the number of pairs, and statistics for each rule?

Hi,

I wonder if there is a way to obtain the information of all rules, as well as the number of pairs and the statistics for each rule for a built fragment(mmpdb) database. A simple output table I expect is like:

            from_smiles(smirks) .   To_smiles(smirks)    # of pairs      Mean    std

rule1 ***** *****
rule2 ***** *****
.....

I am pretty interested in presenting the rules from a database in a similar way to the Tables1-5 & Figure 5 in your publication "J. Med. Chem. 2018, 61, 3277−3292". Would you mind me give some hints how to achieve that through mmpdb codes?

Thanks,
Cheng

--smallest-transformation-only doesn't work for some transformations

A "--smallest-transformation-only" option doesn't produce desirable result with some transformations. It seems that there is conflict between --smallest-transformation-only option (use during indexing) with --min-heavies-per-const-frag option (use during fragmentation)

For example, if one considers this transformation [:1]C(=O)Nc1ccccc1>>[:1]C(=O)Nc1cccnc1, it is clearly reducible to [:1]c1ccccc1>>[:1]c1cccnc1.

However, if someone sets the option --min-heavies-per-const-frag == 9 (during fragmentation step), then the output is [:1]C(=O)Nc1ccccc1>>[:1]C(=O)Nc1cccnc1 and not the [:1]c1ccccc1>>[:1]c1cccnc1.

This is possibly because, a number of heavy atoms in fragment [:1]C(=O)Nc1ccccc1 (or [:1]C(=O)Nc1cccnc1) <=min-heavies-per-const-frag and hence there is no further fragmentation possible for this fragments and hence it's not reducible.

store fragments in a SQLite db instead of JSON-Lines

Resolved issues:

Default fragment output filename?
- If no output is specified, create a name based on structure filename
- If there is no structure filename (reading from stdin) then use 'input.fragdb'
Alternative text output?
- No. SQL seems more usable.
Normalize the fragment SMILES?
- No. Doesn't save memory.
What indices are needed?
- Need fragmentation -> record id for indexing
- Otherwise, none. While unindexed tasks are ~2-3x slower, indices take up space most won't need.
- Could have a user-level command to add additional indices if needed.
Completely remove support for the old 'fragments' format?
- Yes. Makes the code simpler.
- Will require people to re-fragment everything.

Currently the fragment file stores the fragmentation in JSON-Lines format. After an initial header (with version and option lines) are a sequence of "RECORD" or "IGNORE" lines, each structured as a JSON list. For RECORD lines, there are 10 fields, with the last a list of fragmentations.

I propose switching the fragment command to save to a SQLite database (proposed extension, ".fragdb"). Analysis in SQL is so much easier than writing one-off Python programs

I have pull-request implementing this proposal.

Pros and Cons

I see several advantages of using SQLite instead of a flat(ish) file:

simplify the fragment file I/O by using SQL instead of Python
potentially speed up loading the data I/O step because of less data validation in Python and no need to go through JSON
don't need to load all of the data into memory for re-fragmentation
more easily enable analysis and filtering of the fragments

To clarify the last point, consider a tool to examine the most common fragments, or to select only a few constants. This sort of tool could be written as a SQL command rather than read the data into memory followed by some Python-based data processing.

The disadvantages I can think of are:

Requires people to regenerate their fragmentations in the new format
- unless a converter is developed?
- which I don't think is needed
The SQLite file is about the same size, though about 10-15x larger than gzip'ed fragments
- indexing on the constant and variable parts roughly doubles the size

Maybe there are other downsides?

Default fragment filename

One issue is how to handle the default fragment file. Currently mmpdb fragment will write to stdout if no -o option is given. This does not work with SQLite output file.

I could:

require a -o file
use a default name for that case (eg, input.fragdb)
synthesize a name based on the fragment file

I have decided that if you do not specify -o then the default output is the structure filename, with possible extension and .gz removed and replaced with .fragdb. If the structure file is AB.smi.gz then the default output fragment database name is AB.fragdb. If the structure file is CD.smi then the default output name is CD.fragdb.

This decision was guided by the need to distribute fragmentations across a cluster (rather than the simple Python-based multiprocess fragmentation now). In that case, your process will be something like:

o Split the input SMILES files into N parts

convert 'input.smi' -> 'input.part001.smi', 'input.part002.smi', ...
perhaps via a mmpdb smi_split input.smi command?

o Fragment each SMILES file (using a fake cluster queue submission)

qsub --exec 'mmpdb fragment {} --cache prev_job.fragdb' --files *.smi
could process each of the SMILES files to 'input.part001.fragdb', 'input.part002.fragdb', ...

o Merge the fragments back into a single file:

mmpdb fragdb_merge input.part*.mmpdb -o input.fragdb

If the mmpdb fragment step used the input SMILES filename to influence how the default output name is determined (in this case, from input.part001.smi to input.part001.fragdb) then there wouldn't need to be a filename manipulation step here.

Alternative text output

Currently there is the option to display the fragments in "fraginfo" format. This was an undocumented option to display the text in a more human-readable format. It does not appear to be used, as the code clearly says "Fix this for Python 3". I suspect it can be removed without a problem.

Still, perhaps there is a reason to have a way to view the fragmentations in a more human readable format? For example:

mmpdb fragment_dump whatever.fragdb
mmpdb fragment_dump --id ABC123 whatever.fragdb

However, it just doesn't seem useful. It's so much easier to do the query in SQL.

normalize the fragment SMILES?

Should the SMILES strings in the fragdb database be normalized? (That is, all 23,256 occurrences of *C.*C would be normalized to an integer id in a new smiles table, and the fragmentation SMILES stored by id reference, rather than storing the actual string.)

I used the ChEMBL_CYP3A4_hERG.smi test set from the original mmpdb development, with 20267 SMILES strings. Using a denormalized data set (constant and fragment SMILES are stored as-is), the resulting sizes are:

% ls -lh ChEMBL_CYP3A4_hERG.frag*
-rw-r--r--  1 dalke  admin   139M Oct 12 13:30 ChEMBL_CYP3A4_hERG.fragdb
-rw-r--r--  1 dalke  admin   146M Oct 12 12:41 ChEMBL_CYP3A4_hERG.fragments

This shows that "fragdb" is slightly more compact than "fragments".

On the other hand, gzip -9 produces a ChEMBL_CYP3A4_hERG.fragments.gz which is 1/13th the size, at 11MB bytes (153151552/11656549 = 13.1).

A SQL query suggests I can save about 50MB by normalizing the duplicate fragment SMILES, which is about 40% of the file size.

sqlite> select sum(length(s) * N), sum(length(s) * (N-1)) FROM (select s, count(*) as N FROM (SELECT constant_smiles AS s FROM fragmentation UNION ALL SELECT variable_smiles AS s FROM fragmentation) group by s);
84360180|32669074

On the other hand, that estimate doesn't fully include the normalization table, nor does it include the indices which may be needed for possible analyzes.

(The constant_with_H_smiles and record's input_smiles and normalized_smiles have few duplicate values so should not be normalized.)

I changed the code to normalize the fragments in a 'smiles' table and regenerated the data set. The new one is second, with the "hERG2"

% ls -lh ChEMBL_CYP3A4_hERG.fragdb ChEMBL_CYP3A4_hERG2.fragdb
-rw-r--r--  1 dalke  admin   139M Oct 12 13:30 ChEMBL_CYP3A4_hERG.fragdb
-rw-r--r--  1 dalke  admin   189M Oct 12 16:26 ChEMBL_CYP3A4_hERG2.fragdb

The resulting size is larger because it contains the SMILES normalization table, and the indexing needed for the UNIQUE constraint. It is still roughly the same size as the uncompressed fragments file, though quite larger than the gzip-compressed fragments.

What indices are needed?

There must be an index mapping the each fragmentation to its record. I tried a version without that index and mmpdb index was obviously slower, even for a test set of only 1,000 SMILES.

The database should be index to support the types of analyses which might be done on the fragment data. At present I don't know what these are likely to be. Some likely ones are:

Merge multiple fragment data sets into one (eg, using 20 machines to fragment structures in parallel)
Determine the distribution of constants and/or variable fragments?
Generate a subset containing only specified constants.
Partition the dataset into N groups, based on their constants.

I modified the two datasets to make them index by the constant and variable parts. For the de-normalized hERG.fragdb I indexed the constant and variable SMILES strings. For the normalized hERG2.fragdb I indexed the constant and variable SMILES ids.

-rw-r--r--    1 dalke  admin   241M Oct 12 16:55 ChEMBL_CYP3A4_hERG_with_idx.fragdb
-rw-r--r--    1 dalke  admin   235M Oct 12 17:04 ChEMBL_CYP3A4_hERG2_with_idx.fragdb

This nearly doubles the database size. This also shows that normalization doesn't affect the database size if both the constants and variables need to be indexed.

It's hard to judge if this increase in size is useful without tests on the types of analyses to do, so I used the above "likely ones".

merge multiple data sets into one

This is trivial with unnormalized version - merge the two sets of tables, and update the ids so they don't overlap. The normalized version is a bit more complex as the normalization tables must be merged.

Determine the distribution of constants

The following (in the unnormalized data set) prints the distribution of constant SMILES, ordered by count:

select count(*), constant_smiles from fragmentation GROUP BY constant_smiles ORDER BY count(*) DESC

Given the fragmentations from 20K molecules, this takes a bit over 2 seconds on the unindexed,unnormalized data set.

If the SMILES strings are indexed, but still unnormalized, it takes a bit over 1 second.

If the SMILES strings are normalized and indexed it's 0.66 seconds. That's about 3x faster.

Given that this will likely not be common, I suggest staying with unnormalized strings, and no index. Perhaps there can be an mmpdb fragdb_index command to add indices if people want a 2x speedup.

Bear in mind that currently analysis must do a linear search of the fragments file, decoding the JSON-Lines, and building the search into memory. This probably takes much longer, though I haven't tested it.

Generate a subset of fragments given the constants

There are a couple of ways I could think of to select a set of fragments: 1) specify the SMILES strings, and 2) request only those in the top-P%, or top N, or those with at least M occurrences.

In both cases, I think the right solution is to make a temporary table containing the selected fragments, then use that to select the subset of fragments and of records containing those fragments. For example, the following selects all records with a fragmentation with a constant SMILES is one of the 10,000 most common.

attach database ":memory:" as tmp;

CREATE TABLE tmp.selected_fragmentation (
    fragment_id INTEGER 
);

INSERT INTO tmp.selected_fragmentation
    SELECT id 
    FROM fragmentation
    WHERE constant_smiles IN (
       SELECT constant_smiles
         FROM fragmentation
   GROUP BY constant_smiles
   ORDER BY count(*)
          DESC
         LIMIT 10000);

analyze tmp;

select record.id from record, fragmentation where record.id = fragmentation.record_id AND fragmentation.id in (select fragment_id from tmp.selected_fragmentation);

The :memory: + index + analyze is slightly faster than doing a straight search on the unindexed database. Note the above only filters the records, when full export will also need to export the fragments, which requires another search. That's why I think the temporary index is worthwhile.

Partition based on constants

It looks like this can be done with a similar mechanism - create a table, randomize the order, split into M parts, and save to distinct output databases.

Drop support for the old 'fragments' format

The current pull request drops supports the old 'fragments' format if the filename ends with .fragments or .fragments.gz. This simplifies the relevant code by quite a bit, compared to a version which must support support both I/O formats.

The feedback from Mahendra is that this isn't an issue.

An middle solution is to only support the old 'fragments' format in the --cache option, which would let people upgrade to the new system without having to re-fragment everything. I don't think that's needed as re-fragmentation, while slow, is doable.

How to get environment smiles?

Dear,
Thanks for developing the nice tool.
And I would like to know how to get environment smiles. When mmpdb is created, there is the environment_fingerprint table which has id and env_fp. I would like to get smiles which is corresponding to each fingerprint.
Any advice or suggestions are greatly appreciated.
Thanks,

How to solve "sqlite3.OperationalError: database is locked" error?

mmpdb transform --min-constant-size N - AttributeError: 'Fragmentation' object has no attribute 'num_variable_heavies'

On Windows, the options --min-variable-size and --min-constant-size of transform command aren't working. It is saying that, in the line 909 in mmpdblib/analysis_algorithms.py, a Fragmentation object has no attribute 'num_variable_heavies'.

I'm on Windows 10 and using Anaconda Prompt, mmpdb works fine.

I tried the same command on Linux and it worked, but I think it's relevant to report the issue on Windows.

(base) C:\Users\me\Desktop\mmpdb-2.1>python mmpdb transform --smiles CC=CC=CC=O --min-variable-size 5 my_data.mmpdb
Traceback (most recent call last):
  File "mmpdb", line 10, in <module>
    commandline.main()
  File "C:\Users\me\Desktop\mmpdb-2.1\mmpdblib\commandline.py", line 988, in main
    parsed_args.command(parsed_args.subparser, parsed_args)
  File "C:\Users\me\Desktop\mmpdb-2.1\mmpdblib\commandline.py", line 678, in transform_command
    do_analysis.transform_command(parser, args)
  File "C:\Users\me\Desktop\mmpdb-2.1\mmpdblib\do_analysis.py", line 116, in transform_command
    explain = explain,
  File "C:\Users\me\Desktop\mmpdb-2.1\mmpdblib\analysis_algorithms.py", line 764, in transform
    cursor=cursor, explain=explain)
  File "C:\Users\me\Desktop\mmpdb-2.1\mmpdblib\analysis_algorithms.py", line 909, in make_transform
    if min_variable_size and frag.num_variable_heavies < min_variable_size:
AttributeError: 'Fragmentation' object has no attribute 'num_variable_heavies'

No statistics information in the output when using transform without property argument

Hi all,

I am using the MMP transform without properties information.

For example,
python mmpdb transform test_data.mmpdb --smiles 'c1cccnc1O' --no-properties

Output
ID SMILES
1 Clc1ccccn1
2 Nc1ccccn1
3 c1ccncc1

In the output, I only got ID and Smiles. But I would like to print other useful information including 'from_smiles,' 'to_smiles', 'radius', 'rule_environment_id', 'count'. Those statistics could give us some confidence for the transform. I think those statistics should be independent of properties. I am curious if there is a way to output those statistics with generated molecules if we work with a MMPDB without property information.

Thanks, Cheng

To increase the max_rotatable_bonds?

I would like to do fragmentation on a dataset with 145 peptides. When I tried with mmpdb it says "too many rotatable bonds". Is there any way to increase the number of max rotatable bonds permited?

Memory error with mmpdb fragment for large dataset

Hi all,

I am trying to build a MMP-DB with 10M compounds. But I got an error at the first step of fragmentation.

The command I used is as follows:
python mmpdb fragment first10M.smi --num-jobs 8 -o first10M.fragments.gz

The error I got is:
Traceback (most recent call last): File "/home/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/home/anaconda2/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "/home/anaconda2/lib/python2.7/multiprocessing/pool.py", line 328, in _handle_workers pool._maintain_pool() File "/home/anaconda2/lib/python2.7/multiprocessing/pool.py", line 232, in _maintain_pool self._repopulate_pool() File "/home/anaconda2/lib/python2.7/multiprocessing/pool.py", line 225, in _repopulate_pool w.start() File "/home/anaconda2/lib/python2.7/multiprocessing/process.py", line 130, in start self._popen = Popen(self) File "/home/anaconda2/lib/python2.7/multiprocessing/forking.py", line 121, in __init__ self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

Does anybody have comments or suggestions on that? Also, can I run the command on distributed nodes on the cluster?

ps: I also have similar concerns about the second step of indexing since it usually takes longer time and larger memory than the fragmentation. Can I run the indexing command in parallel or on the distributed cluster?

Thanks,
Cheng

organize command-line processing and move to click

Currently, command-line processing is a mess.

There's commandline.py, which is ~1,000 lines of argparse configuration for all of the commands.

Each command dispatches to a function in one of the 6 do_*.py files. For example, the help commands are in do_help.py and the analysis commands ("predict" and "transform") are in do_analysis.py.

There's a growing consensus to organize the command-line components into its own subpackage named cli (for "command-line interface"). I propose doing this.

argparse -> click

Further, I propose switching from argparse to click. These are both packages to simplify working with command-line processing.

The "click" package is now (I don't know what it was like 6 years ago) a mature and full-featured package. It uses a different model than argparse, so this will require extensive rewriting. One clear advantage to click is the built-in support for testing. With argparse I needed my own functions to, for example, capture stdout and stderr so I can verify they contain the right information. While click's CliRunner.invoke does that for me.

6 years ago I chose argparse because I knew it best, and because it's part of the Python standard library. I don't like having external dependencies if I can avoid it.

My preference is in large part because when I started with Python I always had to install packages manually. Modern Python packages can specify which dependencies they need, and modern package installers can install them automatically if needed.

This means I'll also be updating the mmpdb package configuration to use the more modern conventions.

Which "cut-smarts" patterns (i.e. fragmentation parameters) are used by default for Transform function?

Hi All,

I am wondering which "cut-smarts" patterns are used by default in Transform function. That is, how does the program fragment a given molecule when applying a built MMPDB to do the transform? I couldn't find the parameter that controls the fragmentation methods in "transform" function.

Ideally, the transform function will fragment a molecule in the same way as how the MMPDB was built. But I am not sure if it is the case.

Thank you!
Cheng

Use SQLAlchemy to work with the database

In 2019 kzfm gave an example of using SQLAlchemy to work with the database.

In my work to replace the JSON-Lines fragment format with a SQLite-based format I quickly discovered that I am hand-writing an ORM. Poorly.

I propose switching using SQLAlchemy in this updated version I am working on.

At this point I don't know how much work that requires. kzfm showed that defining the structure and querying the generated mmpdb database was simple.

rdkit / mmpdb Goto Github PK

mmpdb's People

Stargazers

Watchers

Forkers

mmpdb's Issues

Pros and Cons

Default fragment filename

Alternative text output

normalize the fragment SMILES?

What indices are needed?

merge multiple data sets into one

Determine the distribution of constants

Generate a subset of fragments given the constants

Partition based on constants

Drop support for the old 'fragments' format

argparse -> click

Recommend Projects

Recommend Topics

Recommend Org