rapodaca / dialect Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 0.0 765 KB

Documenting a subset of the SMILES language.

License: MIT License

Shell 0.12% TeX 99.88%

dialect's People

Contributors

Stargazers

Watchers

dialect's Issues

Dead link?

Hi there, I came via your blog, but the link back near the top of the README seems dead? Should it be this one?
https://depth-first.com/articles/2021/09/22/beyond-smiles/

Table of Bond attributes

Attribute Name	Description	Type
kind	The kind of bond.	Enumeration
target	The index of the target atom.	Integer

Bond Kind	Bond Order	State
Elided	1	None
Single	1	None
Double	2	None
Triple	3	None
Up	1	Up
Down	1	Down

Add examples to Syntax section

The section currently lacks examples. It would be very helpful to include some to illustrate both acceptable and unacceptable encoding.

Attribute Name	Description	Type	Nullable	Meaning if null	Constraints	Default Value
index	A unique identifier	unsigned integer	no	N/A	-	N/A
isotope	The sum of proton and neutron counts.	unsigned integer	yes	Natural abundance	0 <= value < 1000	null
element	The atom's element.	enumeration	yes	unknown element	An IUPAC-approved element symbol.	null
virtual_hydrogens	The number of virtual hydrogens	unsigned integer	Yes	zero virtual hydrogens	0 <= value < 10	0
algorithmic_hydrogens	If true hydrogens are counted algorithmically.	boolean	No	-	-	true
configuration	Configurational descriptor	enumeration	Yes	No configuration	tetracoordinate	Null
extension	Application-specific data	integer	Yes	No extension	0 <= value < 1000	Null
selected	Whether the atom is selected.	boolean	No	-	element must be one of `C`, `N`, `O`, `P`, or `S`.	-

Remove At and Ts from shortcuts

"At" was excluded from the "organic subset." Dropping At and Ts aligns Dialect with the original intent.

Adding elements past Lr to <element> breaks backward-compatibility, but is more in line with expectation than new elements in the organic subset.

Among other corrections, this means removing At and Ts from the valence table.

Also:

The next atomic production rule, <shortcut> is a non-terminal comprised of the terminal element symbols: "B"; "C"; "N"; "O"; "P"; "S"; "F"; "Cl"; "Br"; "I"; "At"; "Ts"; "P"; and "S".

Delocalization Subgraph Section

VB model makes graphs unequal even though they should be treated as equal
- DIME
the DS solution
- encodes alternating single-double bond pattern as a perfect matching
  - maximum, maximal, perfect matching
a node-induced subgraph over VB graph
- unlabeled
- possibly empty
- encodes and therefore guarantees a perfect matching
  - corollary: every atom will be assigned a double bond
- only node membership is specified
  - edge membership is deduced
composition
- atoms
  - C, N, O, P, S, unknown
- selection
  - adding an atom to the DS
  - "selected atom", "selected atoms"
  - "induced bond", "induced bonds"
bonds
- by definition (node-induced) both terminals are members
literature refers to "aromaticity"
- not using that term because of its (imprecise) meaning in chemistry
algorithmic fill/empty
- every molecule expressed with a DS can also be expressed without it
- in other words, 1:1 translation
- fill algorithm
- empty algorithm
  - aka "kekulization"
examples
- invent a notation for node/edge membership (color?, weight?)
- graphics only
error states
- no perfect matching possible
- to re-iterate: DS is just another way to express single/double bond layout

Partial parity bonds mean that the dialect molecular graph is directed. Make sure that all statements about graph type are consistent with this point. For example, "Third, edges are undirected." It would be better to break the directed/undirected discussion out into a separate paragraph, noting that the reason for directed bonds will become clear.

Subset of SMILES

The ms makes several references to "dialect" in the linguistic sense and this is of course the code name for the language. But the goal is not to make yet another dialect. The goal is to for the first time fully define a language that functions as a subset of SMILES-as-practiced. No extensions. No pet nice-to-haves. But a subset to the extent it's possible without internal inconsistencies.

Parts to improve:

Title
Line 37. The introduction leads to this point, so this paragraph must concisely spell out the aim. It doesn't quite hit the mark. It uses the "dialect" idea, for one.
Discussion section. Line 497 brings in SMILES. This is a good place to re-introduce the previous two papers on SMILES. Compare what's in them (and not in them) with Dialect.

Cite projects using SMILES

These occur in the 2nd paragraph of the introduction.

Explicitly enumerate the forms of stereochemistry that can't be represented

The paper enumerates several kinds of stereochemistry that can't be represented by Dialect. Unless they are explicitly disallowed, there could be a temptation to bend the rules. It might be worth discussing some of the more obvious cases and re-iterate the support for extensions.

RULES FOR THE NOMENCLATURE OFORGANICHEMISTRY SECTION E:STEREOCHEMISTRY

Bond elision

Need a paragraph or section talking about elided bonds. Bond elision is mentioned by the Delocalization Subgraph section.

No hex digits in `<extension>`

The current support of hex digits breaks compatibility with most SMILES implementations for a minimal, uncertain payoff at best. An <extension> is up to four decimal digits, interpreted as a single integer, with leading zeros allowed.

Narrow the set of selectable atoms

Default valences are required to make an element eligible for selection. In other words, only eligible atoms can be selected.

Otherwise, pruning is impossible because free hydrogen count can not be determined. You effectively make up your own valence model. This is the main reason to exclude things like [te].

Note that for the same reason [as] and [se] should not be valid, either.

Reference Implementation Section

Those services needed by a reference implementation. One may or may not be ready in time for publication. Purr in its current form will definitely need revision to be compatible with Dialect. The extent isn't yet clear, though.

high-level description
- parser
- typesafe data structures
- errors
  - syntactical
  - semantic
  - cursor to error
    - not always possible
      - e.g., no perfect matching
subsequent publication

Implicit Hydrogens Section

Previously: "Hydrogen Suppression"

Comprehensively describe, modulo selection (aka "aromaticity"), the rules for computing implicit hydrogen count.

virtual hydrogens
- described in previous section
- unambiguous, but verbose
implicit hydrogens
- replace attribute with computed value
- computation without corruption requires a detailed protocol
target valence
- the number of hydrogens the free atom can bind to
  - one or more values, depending on atom
target valence table
- star (*) always has target = 0
- defines target valences for atoms supporting implicit hydrogen
- for example
  - a free carbon atom can bind four hydrogens
  - a free nitrogen atom can bind three or five hydrogens
atoms with elements not on the table are not eligible
- they must use virtual or graph hydrogens
works with graph hydrogens
- hydrogen acts like any other atom
example computations
- simple cases
  - C
  - N
  - CC
  - N(C)(C)(C)(C)
- weird/counterintutive cases
either implicit hydrogens or virtual hydrogens, never both

Empty molecular graph

Grammar supports the empty string, so the molecular graph can be empty. At least one statement is inconsistent:

"A Dialect molecule consists of a graph with at least one node and zero or more edges."

All such occurrences should be corrected to allow empty graphs.

Build system

It should be possible to build a preprint-ready manuscript in PDF format. The system should use Markdown as the source for accessibility.

Some ideas:

Remove quadruple bond

Extremely rare. Should it appear, it's very likely SMILES can't accurately serialize it. Remove quad bond from grammar and ms.

Conclusion Section

Address the idea that "there isn't anything new here."

a high and low level description of a SMILES dialect
supports most of SMILES as currently practiced
formalization > invention
the foundation for a reference implementation

Polymer molecules

One of the most interesting incomplete features in the Daylight SMILES/OpenSMILES to me is the the polymer extension. Currently, SMILES are mostly used for nonpolymer molecules, but this ignores large set of chemical compounds.

I have explained the problem in timvdm/OpenSMILES#8 and IUPAC/IUPAC_SMILES_plus#9.

Thanks for an interesting and really much needed effort!

Add pitfalls to Reading section

A collection of tricky cases and non-obvious conclusions around reading Dialect.

C1C.C1 is allowed
- simple text processing (e.g. regex on . will fail generally)
C/C, C\C are not valid
- one terminal of a directional bond must be a double bond terminal

Discussion Section

Scope, limitations, observations. Maybe some more context.

benefits
- compact
  - examples with byte counts
    - compare with molfile
  - use in REPLs
    - Jupyter
    - hand-codable
- easy to learn
- designed for lossless (de)serialization
  - c.f. InChI
- handles most of organic chemistry
tradeoffs
- non-representable entities
  - non-VB examples
    - organometallics, homotropylium cation, dative bonding
  - non-tetrahedral stereochemistry
    - e.g., TB, OH, lone-pair tetrahedral
  - conformational restriction other than (E)/(Z) double bonds
- PPBs are error-prone
compatibility
- SMILES itself is not well-defined
- gather public documentation on syntax/semantics (just the docs)
  - Daylight, OpenSMILES, CDK, OpenBabel, RDKit, Jchem, OE, etc.
- in case of conflict among tools, choose ease of implementation
- unsupported
  - some selections (e.g. [se])
  - the "aromatic" bond (:)
  - extreme charges (< -9, > 9)
  - arbitrary element symbols
extensions
- extension field
  - can be used to encode application-specific information as integer
    - limited range (0-9999)
    - can be used together with metadata
- versioning
  - breaks compatibility, but maybe metadata format
- metaformats - maybe
  - in-line vs out-of-line
  - leverage extension
- expanding range of selectable atoms
- additional configuration classes (OH, TB, etc)
- canonicalization
  - "preferred format"
  - atomic numbering
outlook
- detailed spec opens new paths
- reference implementation
  - to be reported
- validation suites
  - improve data quality by detecting syntax/semantics differences
- writing better implementations
  - reading/writing sections
- performance benchmarks
  - apples/apples comparisons using the same protocols
  - faster processing
- standardization efforts
  - more detailed, structured source material to draw from
  - select, rather than develop elements of standard
- better line notations
  - clearly-delineated scope and limitations
  - lots of room for improvement
  - may or may not happen through dialect extensions

Conformation Section

Describe what conformation means and how it's used. It will be helpful to develop a visual notation system, but it's not clear what it should be.

conformation
- restricted rotation about a bond
- limited to double bonds
partial parity bonds (PPB)
- values: Up; Down
- must be adjacent to a double bond
- must be paired on opposite sides of double bond
interpretation
1. view double bond from top of double bond plane
2. for each neighbor bond, decide position (Up/Down within plane) relative to its double bond terminal
3. invert parity if bond viewed in direction of high-to-low atom index
defined by at least two PPB's on opposite double bond terminals
examples
- trans-butene
- others...
error states
- there are many possible, and they must be reported
- "up" "up", "down" "down"
- only setting one side
  - propene
  - exception: adjacent to another PPB system
unrepresentable
- cyclooctatetraene

Selection performance considerations

This discussion, if present, probably goes in Writing.

The efficiency of decoding selected atoms is set by the Blossom algorithm. However, it's not clear what sets the upper bound on encoding selected atoms when the selection is based on something other than perfect matching. Examples:

Improve performance of aromaticity detection for large molecules

Writing Section

depth-first traversal over molecular graph
cycle closure
branch
examples

Please don't forget about metals

Nice idea - but please don't forget about metals, as your VB graph part will likely fail for most of the transition metals as well as some main group elements such as boron, valence bond theory does not work for the highly delocalized bonding situations you often encounter there, such as 2c3e bonds in diborane B2H6, iron nitrosyl complexes, [Fe-S] clusters, ...
Also, limiting the hydrogen count to 0..9 will exclude some compounds, e.g. [Zr(BH4)4] as well as the Hf analogue, which have 12 hydrogens coordinated to the metal center, 3 from each of the four borohydrides (the 4th H points away from the metal due to its tetrahedral structure):
https://link.springer.com/content/pdf/10.1007/BF00962359.pdf
Same issue with implicit hydrogens: "a free nitrogen atom can bind three or five hydrogens" - and what about molecules as simple as [NH4]+???
Bond order in metal complexes can go up to 6 due to involvement of d orbitals:
https://en.wikipedia.org/wiki/Sextuple_bond
Want more input???

Atom index

The atomic index attribute needs its own discussion. This may require an entire section, but could probably be tucked into Constitution. Either way, Conformation and Configuration will reference it, and so the index attribute should be introduced before these sections.

Stereodescriptors only valid on tetracoordinate atoms.

Discussion of stereo descriptors should indicate that it is an error to assign one to anything other than a tetracoordinate atom.

Reading/Writing and Partial Parity Bonds

PPBs are complicated and error-prone. Both Reading and Writing sections should discuss the issues and offer ways to work around them.

One approach: an intermediate descriptor that replaces PPBs with dedicated syn/anti double bonds. Similar in concept to the @ notation proposed as an OpenSMILES extension.

FAIR

FAIR principles mesh well with the goals of complete language specifications. The paper below contains some points that should be worked into the introduction.

FAIR chemical structures in the Journal of Cheminformatics

Findability
Accessibility
Interoperability
Reusability

Syntax Section

LL(1) formal grammar is the centerpiece. It could be a challenge to strike the right balance between background info on that and describing the syntax itself.

UTF-8 string
- "dstrings" ?
encodes a depth-first traversal
- branches
- closures (C1CC1), "closure digit" = rnum (blech)
- disconnections (C.C)
  - allows multiple components
formal grammar
- LL(1)
- listing
  - changes from before
    - --, ++ charge disallowed
    - only @ and @@ configuration
    - closure, disconnection
data types
- Integer(n)
  - -n -> +n, inclusive
- PositiveInteger(n)
  - 0 -> n, inclusive
- Boolean
- AtomicSymbol
  - list published by IUPAC
- LowercaseSymbol
- Star
- Configuration
- None
atoms
- bare
  - "organic subset"
- bracketed [ ... ]
  - virtual hydrogens
    - default values
      - isotope: None
      - hcount: 0
      - charge: 0
      - configuration: none
      - extension: none
- lowercase
  - eligibility
  - may occur inside or outside brackets
bonds
- single -, double =, triple # quadruple $
- elided
  - none ``, :
  - always two-electron
- up /, down /
  - two-electron
closures
- closure digits: 1..99
- balancing not easily expressable in syntax
  - it could be done, but with a lot of effort
    - explain
- re-use digits
- interaction with up/down bonds
- errors
  - loops (C11)
  - up/down mismatch
disconnection .
- "no bond"
  - not "zero order bond", but no bond
  - explicitly allows zob extension (link)
- to enable disconnected components
  - other uses
    - example
- may occur within cycle
  - examples
- may not occur immediately before ring junction
using the formal grammar
- reading
  - recursive descent parser
  - scanner-driven parser development
  - parser generator
  - parentheses vs. ring closure digits
- writing
  - no counterpart to scanner-driven parser development

Configuration Section

Anything other than four-coordinate tetrahedral configuration is not supported.

It might be necessary to introduce atom index as a property in the Constitution section, for use here and in Conformation.

relative arrangement of neighbors in space
template-based system
- Clockwise or Counterclockwise
atomic property
model
- sight along edge with lowest index, toward central atom
  - determine the order of edge indexes
  - clockwise or counterclockwise
  - virtual hydrogen counts as edge with lowest index
limitations
- can not represent sulfoxide configuration
operations
- swaps and their effect on parity
  - virtual hydrogen <-> neighbor hydrogen
  - two neighbors
examples
- cyclic, acyclic
- allene, cumulene
- include errors
  - degree != 4

Reorganize Semantics

The section is not as clear as it could be because it lacks a focus on data structures. Address this by introducing the Atom and Bond data structures first, then discussing attributes and their interactions later. A table of attributes for Atom and Bond would be very handy for reference.

Sulfoxide stereochemistry

A suggestion:

C[S@](=O)Cl means C[S@"LP"](O)Cl - that is, the lone pair, oxygen and chlorine appear anticlockwise from the carbon.

It's a possibly useful idea, but only provided that all issues can be resolved. If they can't then Stereochemistry should explicitly disallow the interpretation.

Constitution should enumerate, define atom/bond properties

These properties and their definitions will be used in later sections.

Add pitfalls to Writing section

A collection of tricky cases and non-obvious conclusions about writing Dialect.

bond split by cut
- the bond itself
  - double-encoding is NOT required
    - can lead to bond compatibility errors that MUST be reported by reader
  - leave it out
    - doesn't matter which side
- configuration
  - bond index, not atom index, sets configuration
  - examples

Reading Section

overview
- transform string into data structures
- current character and next direct state transitions
- parent atom
  - attached through bond to child atom
  - branches and cuts mean that this connection may not be immediate
high-level state to be managed
- the current atom
- the current bond
- the current branch
- open cuts
- the order of attachment of bonds (especially with cuts)
parsing
- top-down manual
  - also manage current and next character
  - Scanner
- parser generator
  - code from grammar
    - EBNF grammar in SI
- tradeoffs
errors (the place to enumerate all possible reading errors?)
- assume all input invalid or even malicious
- syntax
  - unexpected character
  - unexpected end-of-line
- semantic
  - unbalanced cut
  - mismatched bonds in cut
  - no DS perfect matching
  - disallowed PPB
- reporting
  - zero-based start/end index of error
  - error kind/description
non-errors
- physically impossible isotopes, valences, or charges
parse, don't validate

Discuss fully selected pyrrole

This is a constant point of confusion in SMILES. Clarifying it deals with this specific concern and illustrates the purpose and function of the DS.

Start by interpreting the meaning of the string c1cccn1. Run through the pruning algorithm (#41), which leaves all atoms unpruned. Finish with the error resulting from the lack of a perfect matching.

This probably best follows the discussion on pruning (#41).

Pruning Section

An atom requires an unpaired electron to become a member of the DS. Because atoms without unpaired electrons are often added, a pruning procedure can be used.

the removal of atoms from the DS
supported in Dialect strings for backward-compatibility
- writers output SMILES with gratuitous selected atoms
deselect atoms that are incapable of forming double bonds
- algorithm
examples

Grammar should allow empty string

Many formats support the empty molecule, including V2000 and V3000 molfile. If Dialect does not, this could create problems.

The empty string could be supported with something as simple as:

<string>     ::= <sequence> | ε
<sequence>   ::= <atom> ( <union> | <branch> | <split> )*
<union>      ::= <bond>? ( <cut> | <sequence> )
<branch>     ::= "(" ( "." | <bond> )? <sequence> ")"
<split>      ::= "." <sequence>
...

Remove notes file

The idea was always to move these to issues. It makes little sense to clutter up the repo with random pieces of information.

Pruning section should present necessary conditions

An atom must be pruned if its subvalence is zero. If the atom's charge is non-zero, the default valences for the isoelectronic element are used to compute subvalence. If there are no such target valences (e.g., [c+2]), an error must be generated. Writers must not write such atoms.

Introduce subvalence

The Pruning section (#10) needs to reference the algorithm for subvalence. Unfortunately, that is currently equated to the virtual hydrogen algorithm. This makes it very hard to talk about pruning and the full implicit hydrogen algorithm.

This can be resolved with the following changes:

move Delocalization Subgraph after Constitution
Rename Implicit Hydrogens to Subvalence
make appropriate changes to Subvalence. This will probably mean saving some material and creating new material.
Add Implicit Hydrogens immediately after Subvalence. This section should present the full algorithm for implicit hydrogen calculation, accounting for atom selection.

Build system for Railroad diagrams

RR diagrams should be built from source by build.sh script. Later they can be linked into the document.

Semantics of stereocenters with undefined configuration

How should a stereocenter with undefined configuration be interpreted?

No SMILES or OpenSMILES documentation explains this point. If it is not addressed, implementations will need to invent their own rules, which can cause data loss.

Options:

Either. One or the other descriptor is present, but which one is unknown.
Mixture. A mixture of configurations of unknown ration is present.
(1) or (2)

For its part, V2000 uses this interpretation: "It could be either of two stereoisomers, or a mixture of the two." In other words, (3).

Given the large number of V2K<->SMILES conversions being performed, and given no clear advantage to any option, (3) makes the most sense.

Bonds added to DS

The ms states that bonds must be elided to be inducted into the DS. It also states that the bond order must be one or two.

Following the policy of greatest restriction, the ms should be unified to omit any non-elided bond from the DS.

Reorganize Syntax

The current organization is confusing because it's organized solely around syntax with little consideration of semantics. It should bring in more semantics and organize around major data structure components.

Subheadings could address this:

Atom
- bracket
- shortcut
- selected shortcut
- star
Bond
- types
- overview of using bonds
Chains, Branches, and Cycles
- how each of item types affect the way in which parent connects to child

Reversed Partial Parity Bond

Conformation describes a convention for working with partial parity bonds (BBPs). It assumes that bonds are encoded left-to-right. In other words, target always succeeds source.

This works under the assumption that the graph has no cycles. If one exists, then the edge that bites back can be forward (target succeeds), reverse (target precedes), or both. This case leads to incorrect selection of neighbor quadrant.

Whatever solution is chosen must account for the apparent dual state of the central PPB in hexadiene (C/C=C/C=C/C).

Options:

reverse assignment in the event that target precedes source
ignore target-precedes PPBs
- but it may be the only one carrying state

Virtual hydrogens attribute allowed values

The hydrogens attribute, if present, may assume integer values ranging from zero to nine.

The grammar says 1-9. Change the above statement to match the grammar.

Additional SMILES Dialects

SMILES dialects not currently covered by the introduction go here.

Jmol SMILES and Jmol SMARTS: specifications and applications
- superset, so doesn't address any of the problems in OpenSMILES
- oligonucleotide with 100 BP
- proposal: C%(123)

rapodaca / dialect Goto Github PK

dialect's People

Contributors

Stargazers

Watchers

dialect's Issues

Recommend Projects

Recommend Topics

Recommend Org