Code Monkey home page Code Monkey logo

lychi's Introduction

LyChI (Layered Chemical Identifier)

This directory contains code for BARD/GSRS structure standardizer and hash-code generator.

To build simply type:

bash make.sh

You will need to have maven installed. Inside the make.sh script it simply adds the dependencies and calls mvn clean package.

This will build 2 jar files (one with dependencies and one without). They can be located in the target directory, and will look like:

target/lychi-0.5.1.jar  
target/lychi-0.5.1-jar-with-dependencies.jar

The self-contained jar file can be invoked directly. For example:

java -jar target/lychi-0.5.1-jar-with-dependencies.jar tests/standardizer_case1.smi

lychi's People

Contributors

caodac avatar dependabot[bot] avatar olegursu avatar tylerperyea avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lychi's Issues

Isotope Perception

Compounds differing only by an isotope of an atom sometimes share nothing in their hashes -- I believe this is a bug, but may be by design.

Example 1

Iothalamic acid

Consider the various isotopically enriched iothalamic acid:
isotope

None of the above share any layers of the lychi hash, as you can see from the output:

CNC(=O)C1=C([131I])C(NC(C)=O)=C(I)C(C(O)=O)=C1I MONO_SUB    9294TS2JN-NS2GK3VWU6-N69MTFGPKFU-N6U76FGFF34C
CNC(=O)C1=C(I)C(C(O)=O)=C([131I])C(NC(C)=O)=C1I MONO_SUB    MJ4GHJUB3-37XZ9WZ6P5-35BMVBQL8G6-3567XZBU869G
CNC(=O)C1=C([131I])C(C(O)=O)=C(I)C(NC(C)=O)=C1I MONO_SUB    F5QJZ4FYL-LL2D31KA4X-LXCYK4978LC-LXCJLCYPHSJZ
CNC(=O)C1=C([131I])C(C(O)=O)=C([131I])C(NC(C)=O)=C1I    DI_SUB  5SCL4MFSZ-Z71PN9SJPP-ZPWTG3CD1SR-ZPRG7AZ57V5Y
CNC(=O)C1=C([131I])C(C(O)=O)=C(I)C(NC(C)=O)=C1[131I]    DI_SUB  58CNGBXRJ-JB98H1963G-JGJP5R392U1-JG1XQX6DK7M1
CNC(=O)C1=C([131I])C(NC(C)=O)=C([131I])C(C(O)=O)=C1I    DI_SUB  PS852WNDH-HFJ4X5AUVA-HATLUZGA8KG-HAGHQ57QT37R
CNC(=O)C1=C([131I])C(C(O)=O)=C([131I])C(NC(C)=O)=C1[131I]   TRI_SUB YDGUTZDXP-PYVYW99MKH-PHL5R2UNMCM-PHMWU77U2LYV
CNC(=O)C1=C(I)C(C(O)=O)=C(I)C(NC(C)=O)=C1I  NO_SUB  D1DBNGVNG-G9T7D2UU8L-GLA8MR5PGYK-GLKRLMDQ31TX

Ideally, these would all be the same up to the very last level of the hash.

imidazole not tautomers

Hi Trung

In the example below

example

Lychi considers the structures above as tautomers (same hash key) with -keto-enol option off, however the number of hydrogens is not the same so they should be perhaps get different hash keys.

Non-stereo Encoding Problem with Explicit Hydrogens

In certain cases, explicit hydrogens seem to cause trouble for the atom-labelling layer of the hash. In these cases, it seems that the smiles generated by the standardizer produces a different hash than the input molfile itself.

Consider the following poorly layed-out structure:
encodeprob
[molfile below]

Direct generation of hash from this Std_SMILES:

[H][C@@]12CC3=C(C(O)=C(OC)C(C)=C3)[C@@]([H])(N1C)[C@@]4([H])N([C@H]2O)[C@@]5([H])COC(=O)[C@]8(CS[C@]4([H])C6=C5C7=C(OCO7)C(C)=C6OC(C)=O)NCCC9=C8C=C(OC)C(O)=C9

And this hash:

DCLRH149F-FGAV2BD6PA-FA8DSLTXL4L-FALJX635AFC5

However, when that same smiles is fed into the standardizer, I get:

DCLRH149F-FFMPLZ16VC-FC1Y2MQMGXU-FCUZ42LBF8VB

If the explicit hydrogens are removed entirely:
encodeprob2

The output hash is now compatible with the smiles.

DCLRH149F-FFMPLZ16VC-FC1Y2MQMGXU-FCU1SY5C8458

Molfile for explicit hydrogen version:


  Ketcher 12201304332D 1   1.00000     0.00000     0

 59 67  0     1  0            999 V2000
   -2.2321   -1.8660    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7321   -1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.5981   -0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.5981    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4641    1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4641    2.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.3301    2.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.5981    2.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.5981    3.5000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4641    4.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7321    2.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.8660    2.5000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7321    1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.8660    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.4740    1.2647    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7321    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.9071   -0.4750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    1.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000   -1.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.8660   -1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.8660   -2.5000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.8660   -1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.8561   -2.3746    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    2.4488   -3.1947    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.5544   -3.0234    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.3132   -1.2000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.5741   -1.3179    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.8632    0.2250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0294    0.9234    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.9488    1.1197    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    0.8660    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.8811    1.3246    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    1.7321    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5981    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4244    1.4848    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.6097    2.0768    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7419    1.9858    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7927    2.9165    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.4641    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.9301    0.3000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4641   -1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2072   -1.6691    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.8005   -2.5827    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8060   -2.4781    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.5981   -1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7321   -1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.6506   -0.2222    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    6.4172    0.1894    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.4966    1.1232    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.8342    1.6954    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.0136    2.6792    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.2512    3.3264    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.4306    4.3102    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.3096    2.9897    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5473    3.6370    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.7266    4.6207    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.1303    2.0060    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.8676    1.3838    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  1     0  0
  2  3  1  0     0  0
  3  4  1  0     0  0
  4  5  1  0     0  0
  5  6  2  0     0  0
  6  7  1  0     0  0
  6  8  1  0     0  0
  8  9  1  0     0  0
  9 10  1  0     0  0
  8 11  2  0     0  0
 11 12  1  0     0  0
 11 13  1  0     0  0
  4 13  2  0     0  0
 13 14  1  0     0  0
 14 15  1  1     0  0
 14 16  1  0     0  0
  2 16  1  0     0  0
 16 17  1  0     0  0
 14 18  1  0     0  0
 18 19  1  1     0  0
 18 20  1  0     0  0
 20 21  1  0     0  0
  2 21  1  0     0  0
 21 22  1  1     0  0
 20 23  1  0     0  0
 23 24  1  1     0  0
 23 25  1  0     0  0
 25 26  1  0     0  0
 26 27  1  0     0  0
 27 28  2  0     0  0
 29 27  1  0     0  0
 29 30  1  6     0  0
 30 31  1  0     0  0
 31 32  1  0     0  0
 18 32  1  0     0  0
 32 33  1  1     0  0
 32 34  1  0     0  0
 34 35  1  0     0  0
 35 36  1  0     0  0
 36 37  1  0     0  0
 37 38  1  0     0  0
 37 39  2  0     0  0
 35 40  2  0     0  0
 40 41  1  0     0  0
 40 42  1  0     0  0
 42 43  1  0     0  0
 43 44  1  0     0  0
 44 45  1  0     0  0
 45 46  1  0     0  0
 42 46  2  0     0  0
 46 47  1  0     0  0
 23 47  1  0     0  0
 34 47  2  0     0  0
 29 48  1  0     0  0
 48 49  1  0     0  0
 49 50  1  0     0  0
 50 51  1  0     0  0
 51 52  1  0     0  0
 52 53  2  0     0  0
 53 54  1  0     0  0
 53 55  1  0     0  0
 55 56  1  0     0  0
 56 57  1  0     0  0
 55 58  2  0     0  0
 58 59  1  0     0  0
 29 59  1  0     0  0
 51 59  2  0     0  0
M  END

Cannot find output

when i run the test
java -jar dist/lychi-all-v0.1.jar tests/standardizer_case1.smi
i'm not able to find output, it only print results on command line

Round-trip Disagreement

Some structures (usually with several fused rings that contain stereo annotations) don't return the same hash after a round-trip. I'm not sure why this happens.

Example:

[H][C@@]12[C@@H]3SC[C@]4(NCCC5=C4C=C(OC)C(O)=C5)C(=O)OC[C@H](N1[C@@H](O)[C@@H]6CC7=C([C@H]2N6C)C(O)=C(OC)C(C)=C7)C8=C9OCOC9=C(C)C(OC(C)=O)=C38

Yeilds:

DCLRH149F-FFMPLZ16VC-FC1Y2MQMGXU-FCUZ42LBF8VB

But, if the output file is fed through again, I get:

DCLRH149F-FFMPLZ16VC-FC1Y2MQMGXU-FCUZ3C1UCNTD

Each new loop seems to agree with the last hash. This may be due to parity conflict resolution, which seems to be done arbitrarily. If there is ambiguity/conflict, it would probably be better to err on the side of no annotation. However, I think this example does contain enough information to work.

Similarly, this happens with the following (theoretically equivalent) molfile:


  Symyx   02191314562D 1   1.00000     0.00000     0

 55 63  0     1  0            999 V2000
    5.2579  -10.3796    0.0000 N   0  0  3  0  0  0           0  0  0
    5.2643  -11.3463    0.0000 C   0  0  2  0  0  0           0  0  0
    6.9059  -11.3355    0.0000 C   0  0  0  0  0  0           0  0  0
    6.8995  -10.3688    0.0000 C   0  0  0  0  0  0           0  0  0
    3.7293  -11.1063    0.0000 N   0  0  3  0  0  0           0  0  0
    4.4671  -12.0064    0.0000 C   0  0  2  0  0  0           0  0  0
    6.3356   -7.6267    0.0000 C   0  0  1  0  0  0           0  0  0
    3.2045  -12.4015    0.0000 C   0  0  0  0  0  0           0  0  0
    4.4093   -9.9560    0.0000 C   0  0  2  0  0  0           0  0  0
    6.0714  -11.7993    0.0000 C   0  0  1  0  0  0           0  0  0
    3.4158  -10.1363    0.0000 C   0  0  2  0  0  0           0  0  0
    6.0592   -9.9452    0.0000 C   0  0  2  0  0  0           0  0  0
    7.7339  -11.7884    0.0000 C   0  0  0  0  0  0           0  0  0
    7.7217   -9.9342    0.0000 C   0  0  0  0  0  0           0  0  0
    3.2108  -13.3557    0.0000 C   0  0  0  0  0  0           0  0  0
    7.0154   -7.0763    0.0000 C   0  0  0  0  0  0           0  0  0
    8.5767  -11.3245    0.0000 C   0  0  0  0  0  0           0  0  0
    8.5703  -10.3578    0.0000 C   0  0  0  0  0  0           0  0  0
    2.3766  -11.9653    0.0000 C   0  0  0  0  0  0           0  0  0
    5.4048   -8.0161    0.0000 C   0  0  0  0  0  0           0  0  0
    2.3695  -10.8862    0.0000 C   0  0  0  0  0  0           0  0  0
    2.4719  -13.7855    0.0000 C   0  0  0  0  0  0           0  0  0
    7.1803   -8.7253    0.0000 C   0  0  0  0  0  0           0  0  0
    7.7977   -7.5504    0.0000 C   0  0  0  0  0  0           0  0  0
    6.1550   -8.6695    0.0000 O   0  0  0  0  0  0           0  0  0
    7.0097   -6.2138    0.0000 C   0  0  0  0  0  0           0  0  0
    5.7261   -9.3390    0.0000 C   0  0  0  0  0  0           0  0  0
    5.5820   -7.0857    0.0000 N   0  0  0  0  0  0           0  0  0
    1.6316  -13.3661    0.0000 C   0  0  0  0  0  0           0  0  0
    7.7492  -12.8549    0.0000 O   0  0  0  0  0  0           0  0  0
    1.6254  -12.4119    0.0000 C   0  0  0  0  0  0           0  0  0
    8.5362   -7.0663    0.0000 C   0  0  0  0  0  0           0  0  0
    7.9612   -8.9785    0.0000 O   0  0  0  0  0  0           0  0  0
    7.7858   -5.7420    0.0000 C   0  0  0  0  0  0           0  0  0
    8.5302   -6.1580    0.0000 C   0  0  0  0  0  0           0  0  0
    9.4749   -9.7810    0.0000 O   0  0  0  0  0  0           0  0  0
    8.6620  -13.5281    0.0000 C   0  0  0  0  0  0           0  0  0
    8.9890   -8.7634    0.0000 C   0  0  0  0  0  0           0  0  0
    4.5516   -8.1550    0.0000 O   0  0  0  0  0  0           0  0  0
    8.6687  -14.5489    0.0000 O   0  0  0  0  0  0           0  0  0
    4.0768  -13.8833    0.0000 O   0  0  0  0  0  0           0  0  0
    3.1995  -11.6390    0.0000 C   0  0  0  0  0  0           0  0  0
    2.4776  -14.6439    0.0000 O   0  0  0  0  0  0           0  0  0
    5.5763   -6.2233    0.0000 C   0  0  0  0  0  0           0  0  0
    9.4200   -5.8813    0.0000 O   0  0  0  0  0  0           0  0  0
    9.6512  -11.8924    0.0000 C   0  0  0  0  0  0           0  0  0
    9.0177   -7.4215    0.0000 O   0  0  0  0  0  0           0  0  0
    6.2861   -5.8061    0.0000 C   0  0  0  0  0  0           0  0  0
    0.9270  -13.9374    0.0000 C   0  0  0  0  0  0           0  0  0
    9.5507  -13.0639    0.0000 C   0  0  0  0  0  0           0  0  0
    1.6559  -15.1576    0.0000 C   0  0  0  0  0  0           0  0  0
    9.7288   -7.2168    0.0000 C   0  0  0  0  0  0           0  0  0
    6.0443  -12.7495    0.0000 S   0  0  0  0  0  0           0  0  0
    5.2569  -12.1255    0.0000 H   0  0  0  0  0  0           0  0  0
    4.4042   -9.0125    0.0000 O   0  0  0  0  0  0           0  0  0
  2  1  1  0     0  0
  3  4  2  0     0  0
  4 12  1  0     0  0
 11  5  1  6     0  0
  6  2  1  0     0  0
  7 20  1  6     0  0
  8  6  1  0     0  0
  9  1  1  0     0  0
 10  2  1  0     0  0
 11  9  1  0     0  0
 12  1  1  0     0  0
 13  3  1  0     0  0
 14  4  1  0     0  0
 15  8  2  0     0  0
 16  7  1  0     0  0
 17 18  1  0     0  0
 18 14  2  0     0  0
 19 21  1  0     0  0
 20 25  1  0     0  0
 21 11  1  0     0  0
 22 15  1  0     0  0
 24 16  2  0     0  0
 25 27  1  0     0  0
 26 16  1  0     0  0
 12 27  1  1     0  0
 28  7  1  0     0  0
 29 31  1  0     0  0
 30 13  1  0     0  0
 31 19  2  0     0  0
 32 24  1  0     0  0
 33 14  1  0     0  0
 34 26  2  0     0  0
 35 34  1  0     0  0
 36 18  1  0     0  0
 37 30  1  0     0  0
 38 33  1  0     0  0
 39 20  2  0     0  0
 40 37  2  0     0  0
 41 15  1  0     0  0
 42  5  1  0     0  0
 43 22  1  0     0  0
 44 28  1  0     0  0
 45 35  1  0     0  0
 46 17  1  0     0  0
 47 32  1  0     0  0
 48 44  1  0     0  0
 49 29  1  0     0  0
 50 37  1  0     0  0
 51 43  1  0     0  0
 52 47  1  0     0  0
 10 53  1  1     0  0
  2 54  1  6     0  0
 10  3  1  0     0  0
  6  5  1  6     0  0
 19  8  1  0     0  0
  7 23  1  1     0  0
 38 36  1  0     0  0
 17 13  2  0     0  0
 29 22  2  0     0  0
 48 26  1  0     0  0
 35 32  2  0     0  0
 53 23  1  0     0  0
  9 55  1  6     0  0
M  END

Which gets:
java -jar lychi-all-v0.1.jar test.mol

DCLRH149F-FFMPLZ16VC-FC1Y2MQMGXU-FCUS96TNY5ZD

java -jar lychi-all-v0.1.jar test.mol | java -jar lychi-all-v0.1.jar

DCLRH149F-FFMPLZ16VC-FC1Y2MQMGXU-FCUZ3C1UCNTD

E/Z perception on tautomers

In certain cases, unspecified E/Z information is encoded as known (or known E/Z information is lost) based on tautomer generation.

Example 1

Consider the following two structures, which have the same smiles, but are drawn differently (molfiles at the bottom).

Compare:

CN1C(=O)/C(=N\NC(N)=S)C2=CC=CC=C12

cistrans2

2ATKPHXN6-63AZUWLKFU-6U9JBVHA63M-6UM6PRK6J3GX

vs

CN1C(=O)/C(=N\NC(N)=S)C2=CC=CC=C12

cistrans

2ATKPHXN6-63AZUWLKFU-6U9JBVHA63M-6UMV4H3F7CST

Notice that while the smiles representations are exactly the same, the structures still get different hashes based on their initial coordinates. This happens because the cannonical tautomer has a different E/Z bond location than the one drawn above:

CN1C(=O)/C(=N\NC(N)=S)C2=CC=CC=C12

cistrans3

After selecting the prefered tautomer, E/Z is apparently recalculated based on the original atom coordinates. This leads two apparently identical structures to have different hashes.

The resolution to this problem isn't trivial, and is more a shortcoming of valance bond theory than of the encoding in general. This will require a bit of research, and an expert should be consulted. My intuition is that any cis/trans designation should be allowed if (and only if) both involved bonded atoms remain in an sp2 hybridized state across all tautomers (therefore the atoms and their substituents should remain coplanar).

If this is accurate, there is an unfortunate corollary: The prefered tautomer in the above example is either wrong, or should capture cis/trans information about the exocyclic bond, even though it is not explicitly a double bond.

The molfiles for the above structures are posted here for convenience:


  Ketcher 12191320432D 1   1.00000     0.00000     0

 16 17  0     0  0            999 V2000
    0.4048    3.7213    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.0739    2.9781    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    2.0684    3.0827    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5684    3.9487    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.4752    2.1691    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4534    1.9612    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    4.1225    2.7044    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.1006    2.4964    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.4096    1.5454    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.7697    3.2396    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    1.7321    1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7321    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.8660    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.8660    2.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0     0  0
  2  3  1  0     0  0
  3  4  2  0     0  0
  3  5  1  0     0  0
  5  6  2  0     0  0
  6  7  1  0     0  0
  7  8  1  0     0  0
  8  9  1  0     0  0
  8 10  2  0     0  0
  5 11  1  0     0  0
 11 12  2  0     0  0
 12 13  1  0     0  0
 13 14  2  0     0  0
 14 15  1  0     0  0
 15 16  2  0     0  0
 16  2  1  0     0  0
 16 11  1  0     0  0
M  END

  Ketcher 12191320472D 1   1.00000     0.00000     0

 16 17  0     0  0            999 V2000
    8.1745    3.7213    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.8436    2.9781    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    9.8381    3.0827    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   10.3381    3.9487    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   10.2449    2.1691    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   11.2231    1.9612    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   11.8922    2.7044    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   11.4775    3.6036    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   12.1079    4.5454    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   10.7180    3.7039    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    9.5018    1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.6357    2.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.7697    1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.7697    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.6357    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    9.5018    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0     0  0
  2  3  1  0     0  0
  3  4  2  0     0  0
  3  5  1  0     0  0
  5  6  2  0     0  0
  6  7  1  0     0  0
  7  8  1  0     0  0
  8  9  1  0     0  0
  8 10  2  0     0  0
  5 11  1  0     0  0
 11 12  2  0     0  0
 12  2  1  0     0  0
 12 13  1  0     0  0
 13 14  2  0     0  0
 14 15  1  0     0  0
 15 16  2  0     0  0
 16 11  1  0     0  0
M  END

Non-Tetrahedral Stereochemistry

There are some non-tetrahedral stereochemistry annotations that are ignored in standardization. I wouldn't call these bugs, as they're somewhat obscure, and generally, have poor support in existing drawing/encoding software. However, I think they're important to note and worthwhile to investigate. I have included sd files for each case with its enantiomer in the tests folder.

Allene-Like Stereochemistry

JChem Smiles : Not supported
Daylight Smiles : Supported
InChi: Supported (via molfile)

Example: Mycomycin

allenelike

This is a special case, often described as tetrahedral stereo stretched out across two consecutive double bonds (allene). The defacto standard for drawing this configuration is to use dash/wedge for one side of the allene, and cis/trans like configuration on the other side. InChi respects this convention and will generate 2 different keys if I invert the dashes and wedges. Daylight smiles also allows this to be encoded (according to their website) but most tools I use either break or ignore their published rules.

Daylight smiles:

OC(C/C=C/C=C\C([H])=[C@]=C([H])C#CC#C)=O

Molfile:


  Ketcher 12201302422D 1   1.00000     0.00000     0

 17 16  0     0  0            999 V2000
   -0.5000   -0.8660    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.5000    0.8660    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.5000   -0.8660    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5000   -0.8660    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0000   -1.7321    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0000   -1.7321    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.5000   -0.8660    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.5000   -0.8660    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.5000   -0.8660    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.5000    0.8660    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.0000    1.7321    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.5000    2.5981    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0000    0.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    7.0000   -1.7320    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0     0  0
  2  3  2  0     0  0
  2  4  1  0     0  0
  4  5  1  0     0  0
  5  6  2  0     0  0
  6  7  1  0     0  0
  7  8  2  0     0  0
  8  9  1  0     0  0
  9 10  2  0     0  0
 10 11  2  0     0  0
 11 12  1  6     0  0
 12 13  3  0     0  0
 13 14  1  0     0  0
 14 15  3  0     0  0
  9 16  1  0     0  0
 11 17  1  1     0  0
M  END

Square-Planar Stereochemistry

JChem Smiles : Not supported
Daylight Smiles : Supported
InChi: Not supported

Example: Cisplatinin
cisplat2
The left is cisplatinin, the right is transplatinin. They are distinct molecules that behave very differently in the clinic. And yet they are rarely treated as distinct by toolkits and registration systems. The square-planar stereochemistry is very straight-forward to draw. However, 1-D encoding and graph invariant annotations are undersupported for this class. This is one of the simpler extensions into inorganic chemistry that could be accomplished, and still, unfortunately, requires a bit of groundwork. Daylight's website claims to support this, but, again, I haven't found something to accept their encoding.

Daylight Smiles:

(N)[Pt@@SP1+2](N)([Cl-])[Cl-]

Molfile:


  -OEChem-12201301332D

  5  4  0     0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 Pt  0  2  0  0  0  0  0  0  0  0  0  0
    1.0308   -1.0897    0.0000 Cl  0  5  0  0  0  0  0  0  0  0  0  0
    1.0897    1.0308    0.0000 Cl  0  5  0  0  0  0  0  0  0  0  0  0
   -1.0308    1.0897    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -1.0897   -1.0308    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  1  0  0  0  0
  1  5  1  0  0  0  0
M  CHG  3   1   2   2  -1   3  -1
M  END

Restricted Rotation Axial Stereochemistry

JChem Smiles : Not supported
Daylight Smiles : Not supported
InChi: Not supported

Example: R-GOSSYPOL
gossypol

This happens in the special case where two phenyl rings are connected via a single bond, and both have sufficiently sized ortho substituents to restrict free rotation. Conceptually, this is similar to allene-like stereochemistry, in that the "stereo center" occurs across an axis rather than at a specific atom. However, I have found no 1D encoding of this form, and most molfile representations will try to overuse wedge bonds or non-standard "thicker" bonds to emphasize 3 dimensionality (much like with morphine). From my view, a single wedge/dash inside one of the aromatic rings in the molfile is ugly, but sufficient for annotation. If anyone is aware of accepted standards on drawing / encoding this, please let me know. I'd love to learn of a simple smiles extension that would encode this.

Molfile:


  Ketcher 12201301552D 1   1.00000     0.00000     0

 22 25  0     0  0            999 V2000
    0.8660    1.5000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.8660    1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7321    1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7321    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.5981   -0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.5981   -1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7321   -2.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.8660   -1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.8660   -0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.8660   -0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.8660   -1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000   -2.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.7321   -2.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5981   -1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5981   -0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4641    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4641    1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5981    1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7321    1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7321    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0     0  0
  2  3  2  0     0  0
  3  4  1  0     0  0
  4  5  2  0     0  0
  5  6  1  0     0  0
  6  7  2  0     0  0
  7  8  1  0     0  0
  8  9  2  0     0  0
  9 10  1  0     0  0
 10  5  1  0     0  0
 10 11  2  0     0  0
 11  2  1  0     0  0
 11 12  1  0     0  0
 12 13  2  0     0  0
 13 14  1  0     0  0
 13 15  1  0     0  0
 15 16  2  0     0  0
 16 17  1  0     0  0
 17 18  2  0     0  0
 18 19  1  0     0  0
 19 20  2  0     0  0
 20 21  1  0     0  0
 21 22  2  0     0  0
 12 22  1  1     0  0
 22 17  1  0     0  0
M  END

lychi output

Hi all,

I have a question regarding the output, for example:

echo "C=CNCCC1=CC=CC=C1.Cl" | java -jar dist/lychi-all-v0.1.jar

produces 2 hash keys:

CC=NCCC1=CC=CC=C1 null N4765661Y-YGFUN3DUM8-Y8HGNSQ89JY-Y8Y5C3KXJ1U2
C\C=N\CCC1=CC=CC=C1 null N4765661Y-YGFUN3DUM8-Y8HGNSQ89JY-Y8Y6J8LJGS98

which one is suppose to be the canonical one?

stereo issues

Issues when multiple stereocenters are next to each other. See doxorubicin and carubicin in the tests.

Meaningless Stereo is sometimes honored

Some meaningless stereo annotations (wedge/dash bonds) produce different hashes than non-annotated bonds.

Example 1

Compare:

C[C@H]1OC(C)O[C@@H](C)O1

stereononsense

WDF2GBCFX-X5KQLPPFPK-XK7RRPGCCM2-XK25W2RXGM3Z

vs

CC1OC(C)OC(C)O1

stereononsense2

WDF2GBCFX-X5KQLPPFPK-XK7RRPGCCM2-XK23BSZ142DG

In this example, there are 3 stereo centers that could be annotated. However, out of the 8 absolute permutations, only 2 are actually unique:
stereononsenseexplained

You'll notice that of all 8 possibilities, only 2 are non-degenerate. And in both cases, it must be the case that at least two adjacent methyl groups are on the same side of the ring. So the information provided by the first structure is self-evident.

The InChI algorithm does handle this specific case (possibly by accident), but it does not handle the general issue, as explained in example 2.

Example 2

Compare:

[C@H](C)1CCC(C)CC1

stereononsense3

T75RBW5S8-8D9T563A7Y-8YC8NQXD9W5-8Y5APDLVJ782

vs

C(C)1CCC(C)CC1

stereononsense4

T75RBW5S8-8D9T563A7Y-8YC8NQXD9W5-8Y5VPCVHUV1Z

Again, these should be equivalent, but currently generate different hashes. For reasons I can't imagine, the two above also generate different InChIs. This is especially odd, considering it's a much simpler case of the general problem explained in example 1.

Isotope Perception: Nonspecific site

Certain compounds are isotopically enriched with a specific isotope, but at non-specific atoms. It would be helpful for these to be understood in their most commonly drawn format.

Example 1

Iothalamic acid I-131
CasRN: 770645-97-1
ChemID link:
http://chem.sis.nlm.nih.gov/chemidplus/rn/770645-97-1

This structure is how it's shown in ChemID, and represents a typical strategy for capturing isotope enrichment:

Enriched:

radio
HASH:

<NONE>

Non-enriched:

nonisotope

HASH:

D1DBNGVNG-G9T7D2UU8L-GLXN2UBNFSF-GLF3JS6KB6T7

The first fails in standardization, as it contains query / psuedo atoms. However, it is common enough that it should probably be handled. It would ideally produce the same hash as if it were missing the Isotopic enrichment, with a distinct 4th level hash meaning "mixed isotopic enrichment".

Proposed HASH:

D1DBNGVNG-G9T7D2UU8L-GLXN2UBNFSF-<DISTINCT>

Enriched Molfile:

4YR0DGU31K
  Symyx   12171317512D 1   1.00000     0.00000     0

 22 21  0     0  0            999 V2000
    9.7243   -3.7903    0.0000 *   0  0  0  0  0  0           0  0  0
   10.1951   -3.7903    0.0000 I   4  0  0  0  0  0           0  0  0
    4.3451   -6.0362    0.0000 C   0  0  0  0  0  0           0  0  0
    5.0076   -5.6653    0.0000 N   0  0  0  0  0  0           0  0  0
    5.0076   -4.9111    0.0000 C   0  0  0  0  0  0           0  0  0
    4.3451   -4.5111    0.0000 O   0  0  0  0  0  0           0  0  0
    5.6826   -4.5195    0.0000 C   0  0  0  0  0  0           0  0  0
    6.3409   -4.9112    0.0000 C   0  0  0  0  0  0           0  0  0
    6.9743   -4.5195    0.0000 C   0  0  0  0  0  0           0  0  0
    7.6493   -4.9112    0.0000 N   0  0  0  0  0  0           0  0  0
    8.3076   -4.5236    0.0000 C   0  0  0  0  0  0           0  0  0
    8.3076   -3.7903    0.0000 C   0  0  0  0  0  0           0  0  0
    8.9743   -4.9112    0.0000 O   0  0  0  0  0  0           0  0  0
    6.9743   -3.7695    0.0000 C   0  0  0  0  0  0           0  0  0
    7.6493   -3.3945    0.0000 I   0  0  0  0  0  0           0  0  0
    6.3409   -3.3861    0.0000 C   0  0  0  0  0  0           0  0  0
    6.3409   -2.6237    0.0000 C   0  0  0  0  0  0           0  0  0
    6.9951   -2.2528    0.0000 O   0  0  0  0  0  0           0  0  0
    5.6826   -2.2403    0.0000 O   0  0  0  0  0  0           0  0  0
    5.6826   -3.7695    0.0000 C   0  0  0  0  0  0           0  0  0
    5.0076   -3.3778    0.0000 I   0  0  0  0  0  0           0  0  0
    6.3409   -5.6153    0.0000 I   0  0  0  0  0  0           0  0  0
  1  2  1  0     0  0
  3  4  1  0     0  0
  4  5  1  0     0  0
  6  5  2  0     0  0
  5  7  1  0     0  0
  8  7  2  0     0  0
  9  8  1  0     0  0
 10  9  1  0     0  0
 11 10  1  0     0  0
 12 11  1  0     0  0
 13 11  2  0     0  0
  9 14  2  0     0  0
 15 14  1  0     0  0
 14 16  1  0     0  0
 17 16  1  0     0  0
 18 17  2  0     0  0
 19 17  1  0     0  0
 20 16  2  0     0  0
 21 20  1  0     0  0
  7 20  1  0     0  0
 22  8  1  0     0  0
M  ISO  1   2 131
M  END

2 different structures are perceived as same

The structures for IDs NCGC00249389 and NCGC00386324 appear fairly similar but there is a difference in the position of a heteratom within a ring.
Nevertheless, the 2 compounds have the same Lychi L1-3.

This fix would be useful for the NSRS project.

Benzocyclobutadiene resonance forms

2 resonant forms of benzocyclobutadiene are assigned different hash keys.

example

benzocyclobutadiene is unstable, there are stable substituted derivatives.

normalizing phosphorus group

Need to add additional rule to handle [P+][O-] vs P=O; e.g., OO[C@@H]1CCO[P@+]([O-])(N1)N(CCCl)CCCl and OO[C@@H]1CCO[P@](=O)(N1)N(CCCl)CCCl

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.