Comments (5)
Andrew commented on this request on the RDKit-discuss mailing list as:
"""
I took a look at the code. It expects that there is only a single SMARTS, so there's no way to get what you want.
The SMARTS handling code only touches <50 lines of code. It does not seem that hard to have it take multiple --cut-smarts, apply each of the cuts, find the unique union of those cuts, and work with them.
Could you add that as a issue in the mmpdb tracker?
It is in principle possible to merge two fragment files together and index the result. However, it would be difficult to use the indexed database for analysis purposes, because any input/query structure would use the single SMARTS pattern defined in the database.
"""
from mmpdb.
At the RDKit UGM Hackathon 2019, this question came up again. Participants wanted to use the RECAP rules for cutting. Creating a single SMARTS to match all 11 rules might theoretically be possible, but would results in an extremely complicated string which would then be hard to debug and modify. Extending RDKit such that a list of SMARTS is appears as the preferred long term solution.
from mmpdb.
There are 12 rules, not 11:
>>> from rdkit.Chem import Recap
>>> len(Recap.reactionDefs)
12
>>> for rxn in Recap.reactionDefs:
... print(rxn)
...
[#7;+0;D2,D3:1]!@C(!@=O)!@[#7;+0;D2,D3:2]>>*[#7:1].[#7:2]*
[C;!$(C([#7])[#7]):1](=!@[O:2])!@[#7;+0;!D1:3]>>*[C:1]=[O:2].*[#7:3]
[C:1](=!@[O:2])!@[O;+0:3]>>*[C:1]=[O:2].[O:3]*
[N;!D1;+0;!$(N-C=[#7,#8,#15,#16])](-!@[*:1])-!@[*:2]>>*[*:1].[*:2]*
[#7;R;D3;+0:1]-!@[*:2]>>*[#7:1].[*:2]*
[#6:1]-!@[O;+0]-!@[#6:2]>>[#6:1]*.*[#6:2]
[C:1]=!@[C:2]>>[C:1]*.*[C:2]
[n;+0:1]-!@[C:2]>>[n:1]*.[C:2]*
[O:3]=[C:4]-@[N;+0:1]-!@[C:2]>>[O:3]=[C:4]-[N:1]*.[C:2]*
[c:1]-!@[c:2]>>[c:1]*.*[c:2]
[n;+0:1]-!@[c:2]>>[n:1]*.*[c:2]
[#7;+0;D2,D3:1]-!@[S:2](=[O:3])=[O:4]>>[#7:1]*.*[S:2](=[O:3])=[O:4]
How to people want to specify the cut with these? Is the cut match defined with the product side of the reaction, and the reactant side ignored?
Some of those SMARTS use more than two atoms. The first makes a cut between :1 and :2 while the second makes a cut between :2 and :3. That means that if the reaction side is ignored (eg, if the cut is always made between :1 and :2) then there will be problems.
It could do a more in-depth analysis of the transform to detect if there is a labeled pair on the product side which is not a labeled pair in the reactant side, and use that for the cut.
But that's overkill if people really just want --cut-smarts RECAP
as an option, since that list could be hard-coded using only the product side SMARTS, and only with :1 and :2.
from mmpdb.
I'm thinking to support it as --cut-smarts RECAP
, and have --cut-smarts
support multiple SMARTS patterns, where either the SMARTS pattern defines two atoms and a single bond, or the SMARTS pattern contains atoms labeled :1
and :2
where the cut occurs between them - which must match a single bond.
Looking at the RECAP rules, there are several places where I see problems.
- Pattern 1:
[#7;+0;D2,D3:1]!@C(!@=O)!@[#7;+0;D2,D3:2]>>*[#7:1].[#7:2]*
(urea)
Given NC(=O)N this removes the C(=O) to give N.N. Should the SMARTS be [#7;+0;D2,D3:1]!@[C:2](!@=O)!@[#7;+0;D2,D3]
, which will match and cut both of the N-C bonds?
- cuts on "any" bond
The existing code only allows cuts on single bonds. The RECAP patterns use !@
to match any non-ring bonds. I want to change them to -!@
to enforce that it must match a single bond.
Note that [#7]=!@C(=O)!A[#7]
matches nothing in ChEMBL. However, pattern 2 (amide), [C;!$(C([#7])[#7]):1](=!@[O:2])!@[#7;+0;!D1:3]
does match. More specifically, if I replace the !@
between :2
and :3
with =!@
then I get matches like:
% obgrep '[C;\!$(C([#7])[#7]):1](=\!@[O:2])=\!@[#7;+0;\!D1:3]' ~/databases/chembl_23.rdkit.smi
O=C=NC1CCCCC1 CHEMBL26886
CCCCN=C=O CHEMBL27104
CCCC(N=C=O)C(=O)OC CHEMBL65298
COC(=O)C(CCSC)N=C=O CHEMBL67787
CC(C)c1cccc(C(C)C)c1N=C=O CHEMBL109470
CCc1cccc(CC)c1N=C=O CHEMBL111198
[C-]#[N+][C@@]1(C)CC[C@@H]2[C@@H](C)C[C@H]3C[C@@H](C)[C@@](C)(N=C=O)[C@H]4CC[C@H]1[C@@H]2[C@H]34 CHEMBL169156
CC(=O)O[C@H]1CC[C@@]2(C)[C@@H](CC[C@]3(C)[C@@H]2CC=C2[C@@H]4[C@@H](C)[C@H](C)CC[C@]4(C)CC[C@@]32C)[C@@]1(C)N=C=O CHEMBL235436
O=C=NCCc1ccccc1 CHEMBL2074871
CC(=O)O[C@@H]1CC[C@@]2(C)[C@@H](CC[C@]3(C)[C@@H]2CC=C2[C@@H]4[C@@H](C)[C@H](C)CC[C@]4(C)CC[C@@]32C)[C@@]1(C)N=C=O CHEMBL237112
O=C=Nc1cccc2ccccc12 CHEMBL2074791
...
There are far more matches with -!@
.
It looks like in those few cases where the non-ring bond type is not specified, it's okay for me to say it's a single bond, without changing the intent of matching an amide.
It also seems like that RECAP definition in RDKit is wrong, in that it is not supposed to match a double bond there.
- Match with explicit double bond
The pattern [C:1]=!@[C:2]>>[C:1]*.*[C:2]
explicitly matches a double bond which is a non-ring bond. The underlying code says this is to handle olefins, so it really does want to match a double bond.
mmpdb cannot handle this case. Should I drop it?
from mmpdb.
Going back to acquaregia's request, can you give an example of of the SMIRKS you are interested in?
I can see two steps that might be affected: 1) limit the fragment to just a few SMARTS patterns, and 2) limit the indexing to just a few SMIRKS patterns.
I would like to see some of the SMIRKS to get a better feel for how to handle this. For example, if all of the SMIRKS were transforms of R-groups to R-groups, where the R-groups could be defined as SMILES fragments with a single attachment point denoted *
, then those SMILES could be merged into a single recursive SMARTS.
Otherwise, if multiple distinct SMARTS are needed, then the mmpdb file formats need to change someone in order to store them. There could be multiple entries, one per definition, or they could be space/tab separated.
from mmpdb.
Related Issues (20)
- How to get environment smiles? HOT 3
- How to build mmpdb with large data set HOT 4
- cannot fix the AttributeError: module '__main__' has no attribute '__spec__' HOT 2
- store fragments in a SQLite db instead of JSON-Lines HOT 2
- Use SQLAlchemy to work with the database HOT 1
- organize command-line processing and move to click HOT 4
- Remove vendered use of peewee HOT 1
- Use importlib.resources instead of __file__ HOT 1
- Can we specify the environmental radius when generating the mmpdb? HOT 11
- Which "cut-smarts" patterns (i.e. fragmentation parameters) are used by default for Transform function? HOT 1
- mmpdb transform behaves unexpectedly HOT 5
- Error when using "--out mmpa" HOT 4
- [docs] README.md: create installation guide HOT 5
- AttributeError: module '__main__' has no attribute '__spec__'
- Support for CXSMILES HOT 4
- Obtain list of matched pairs with common core from an ID. HOT 4
- Automatically get MMPs for a given data set HOT 4
- Turning on --property flag leading to a smaller number of transformed structures
- Unexpected generated molecule with double cuts MMPDB HOT 4
- Possible SQL injection vulnerability HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mmpdb.