Code Monkey home page Code Monkey logo

dialect's People

Contributors

rapodaca avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dialect's Issues

Table of Bond attributes

Attribute Name Description Type
kind The kind of bond. Enumeration
target The index of the target atom. Integer
Bond Kind Bond Order State
Elided 1 None
Single 1 None
Double 2 None
Triple 3 None
Up 1 Up
Down 1 Down

Add examples to Syntax section

The section currently lacks examples. It would be very helpful to include some to illustrate both acceptable and unacceptable encoding.

Table of Atom attributes

A table of Atom attributes should be present. It should use exactly the same terms used elsewhere in the ms.

Attribute Name Description Type Nullable Meaning if null Constraints Default Value
index A unique identifier unsigned integer no N/A - N/A
isotope The sum of proton and neutron counts. unsigned integer yes Natural abundance 0 <= value < 1000 null
element The atom's element. enumeration yes unknown element An IUPAC-approved element symbol. null
virtual_hydrogens The number of virtual hydrogens unsigned integer Yes zero virtual hydrogens 0 <= value < 10 0
algorithmic_hydrogens If true hydrogens are counted algorithmically. boolean No - - true
configuration Configurational descriptor enumeration Yes No configuration tetracoordinate Null
extension Application-specific data integer Yes No extension 0 <= value < 1000 Null
selected Whether the atom is selected. boolean No - element must be one of C, N, O, P, or S. -

... and so on.

Remove At and Ts from shortcuts

"At" was excluded from the "organic subset." Dropping At and Ts aligns Dialect with the original intent.

Adding elements past Lr to <element> breaks backward-compatibility, but is more in line with expectation than new elements in the organic subset.

Among other corrections, this means removing At and Ts from the valence table.

Also:

The next atomic production rule, <shortcut> is a non-terminal comprised of the terminal element symbols: "B"; "C"; "N"; "O"; "P"; "S"; "F"; "Cl"; "Br"; "I"; "At"; "Ts"; "P"; and "S".

Delocalization Subgraph Section

  • VB model makes graphs unequal even though they should be treated as equal
    • DIME
  • the DS solution
    • encodes alternating single-double bond pattern as a perfect matching
      • maximum, maximal, perfect matching
  • a node-induced subgraph over VB graph
    • unlabeled
    • possibly empty
    • encodes and therefore guarantees a perfect matching
      • corollary: every atom will be assigned a double bond
    • only node membership is specified
      • edge membership is deduced
  • composition
    • atoms
      • C, N, O, P, S, unknown
    • selection
      • adding an atom to the DS
      • "selected atom", "selected atoms"
      • "induced bond", "induced bonds"
  • bonds
    • by definition (node-induced) both terminals are members
  • literature refers to "aromaticity"
    • not using that term because of its (imprecise) meaning in chemistry
  • algorithmic fill/empty
    • every molecule expressed with a DS can also be expressed without it
    • in other words, 1:1 translation
    • fill algorithm
    • empty algorithm
      • aka "kekulization"
  • examples
    • invent a notation for node/edge membership (color?, weight?)
    • graphics only
  • error states
    • no perfect matching possible
    • to re-iterate: DS is just another way to express single/double bond layout

Directed bonds

Partial parity bonds mean that the dialect molecular graph is directed. Make sure that all statements about graph type are consistent with this point. For example, "Third, edges are undirected." It would be better to break the directed/undirected discussion out into a separate paragraph, noting that the reason for directed bonds will become clear.

Subset of SMILES

The ms makes several references to "dialect" in the linguistic sense and this is of course the code name for the language. But the goal is not to make yet another dialect. The goal is to for the first time fully define a language that functions as a subset of SMILES-as-practiced. No extensions. No pet nice-to-haves. But a subset to the extent it's possible without internal inconsistencies.

Parts to improve:

  1. Title
  2. Line 37. The introduction leads to this point, so this paragraph must concisely spell out the aim. It doesn't quite hit the mark. It uses the "dialect" idea, for one.
  3. Discussion section. Line 497 brings in SMILES. This is a good place to re-introduce the previous two papers on SMILES. Compare what's in them (and not in them) with Dialect.

Bond elision

Need a paragraph or section talking about elided bonds. Bond elision is mentioned by the Delocalization Subgraph section.

No hex digits in `<extension>`

The current support of hex digits breaks compatibility with most SMILES implementations for a minimal, uncertain payoff at best. An <extension> is up to four decimal digits, interpreted as a single integer, with leading zeros allowed.

Narrow the set of selectable atoms

Default valences are required to make an element eligible for selection. In other words, only eligible atoms can be selected.

Otherwise, pruning is impossible because free hydrogen count can not be determined. You effectively make up your own valence model. This is the main reason to exclude things like [te].

Note that for the same reason [as] and [se] should not be valid, either.

Reference Implementation Section

Those services needed by a reference implementation. One may or may not be ready in time for publication. Purr in its current form will definitely need revision to be compatible with Dialect. The extent isn't yet clear, though.

  • high-level description
    • parser
    • typesafe data structures
    • errors
      • syntactical
      • semantic
      • cursor to error
        • not always possible
          • e.g., no perfect matching
  • subsequent publication

Implicit Hydrogens Section

Previously: "Hydrogen Suppression"

Comprehensively describe, modulo selection (aka "aromaticity"), the rules for computing implicit hydrogen count.

  • virtual hydrogens
    • described in previous section
    • unambiguous, but verbose
  • implicit hydrogens
    • replace attribute with computed value
    • computation without corruption requires a detailed protocol
  • target valence
    • the number of hydrogens the free atom can bind to
      • one or more values, depending on atom
  • target valence table
    • star (*) always has target = 0
    • defines target valences for atoms supporting implicit hydrogen
    • for example
      • a free carbon atom can bind four hydrogens
      • a free nitrogen atom can bind three or five hydrogens
  • atoms with elements not on the table are not eligible
    • they must use virtual or graph hydrogens
  • works with graph hydrogens
    • hydrogen acts like any other atom
  • example computations
    • simple cases
      • C
      • N
      • CC
      • N(C)(C)(C)(C)
    • weird/counterintutive cases
  • either implicit hydrogens or virtual hydrogens, never both

Empty molecular graph

Grammar supports the empty string, so the molecular graph can be empty. At least one statement is inconsistent:

"A Dialect molecule consists of a graph with at least one node and zero or more edges."

All such occurrences should be corrected to allow empty graphs.

Remove quadruple bond

Extremely rare. Should it appear, it's very likely SMILES can't accurately serialize it. Remove quad bond from grammar and ms.

Conclusion Section

Address the idea that "there isn't anything new here."

  • a high and low level description of a SMILES dialect
  • supports most of SMILES as currently practiced
  • formalization > invention
  • the foundation for a reference implementation

Polymer molecules

One of the most interesting incomplete features in the Daylight SMILES/OpenSMILES to me is the the polymer extension. Currently, SMILES are mostly used for nonpolymer molecules, but this ignores large set of chemical compounds.

I have explained the problem in timvdm/OpenSMILES#8 and IUPAC/IUPAC_SMILES_plus#9.

Thanks for an interesting and really much needed effort!

Add pitfalls to Reading section

A collection of tricky cases and non-obvious conclusions around reading Dialect.

  • C1C.C1 is allowed
    • simple text processing (e.g. regex on . will fail generally)
  • C/C, C\C are not valid
    • one terminal of a directional bond must be a double bond terminal

Discussion Section

Scope, limitations, observations. Maybe some more context.

  • benefits
    • compact
      • examples with byte counts
        • compare with molfile
      • use in REPLs
        • Jupyter
        • hand-codable
    • easy to learn
    • designed for lossless (de)serialization
      • c.f. InChI
    • handles most of organic chemistry
  • tradeoffs
    • non-representable entities
      • non-VB examples
        • organometallics, homotropylium cation, dative bonding
      • non-tetrahedral stereochemistry
        • e.g., TB, OH, lone-pair tetrahedral
      • conformational restriction other than (E)/(Z) double bonds
    • PPBs are error-prone
  • compatibility
    • SMILES itself is not well-defined
    • gather public documentation on syntax/semantics (just the docs)
      • Daylight, OpenSMILES, CDK, OpenBabel, RDKit, Jchem, OE, etc.
    • in case of conflict among tools, choose ease of implementation
    • unsupported
      • some selections (e.g. [se])
      • the "aromatic" bond (:)
      • extreme charges (< -9, > 9)
      • arbitrary element symbols
  • extensions
    • extension field
      • can be used to encode application-specific information as integer
        • limited range (0-9999)
        • can be used together with metadata
    • versioning
      • breaks compatibility, but maybe metadata format
    • metaformats - maybe
      • in-line vs out-of-line
      • leverage extension
    • expanding range of selectable atoms
    • additional configuration classes (OH, TB, etc)
    • canonicalization
      • "preferred format"
      • atomic numbering
  • outlook
    • detailed spec opens new paths
    • reference implementation
      • to be reported
    • validation suites
      • improve data quality by detecting syntax/semantics differences
    • writing better implementations
      • reading/writing sections
    • performance benchmarks
      • apples/apples comparisons using the same protocols
      • faster processing
    • standardization efforts
      • more detailed, structured source material to draw from
      • select, rather than develop elements of standard
    • better line notations
      • clearly-delineated scope and limitations
      • lots of room for improvement
      • may or may not happen through dialect extensions

Conformation Section

Describe what conformation means and how it's used. It will be helpful to develop a visual notation system, but it's not clear what it should be.

  • conformation
    • restricted rotation about a bond
    • limited to double bonds
  • partial parity bonds (PPB)
    • values: Up; Down
    • must be adjacent to a double bond
    • must be paired on opposite sides of double bond
  • interpretation
    1. view double bond from top of double bond plane
    2. for each neighbor bond, decide position (Up/Down within plane) relative to its double bond terminal
    3. invert parity if bond viewed in direction of high-to-low atom index
  • defined by at least two PPB's on opposite double bond terminals
  • examples
    • trans-butene
    • others...
  • error states
    • there are many possible, and they must be reported
    • "up" "up", "down" "down"
    • only setting one side
      • propene
      • exception: adjacent to another PPB system
  • unrepresentable
    • cyclooctatetraene

Writing Section

  • depth-first traversal over molecular graph
  • cycle closure
  • branch
  • examples

Please don't forget about metals

Nice idea - but please don't forget about metals, as your VB graph part will likely fail for most of the transition metals as well as some main group elements such as boron, valence bond theory does not work for the highly delocalized bonding situations you often encounter there, such as 2c3e bonds in diborane B2H6, iron nitrosyl complexes, [Fe-S] clusters, ...
Also, limiting the hydrogen count to 0..9 will exclude some compounds, e.g. [Zr(BH4)4] as well as the Hf analogue, which have 12 hydrogens coordinated to the metal center, 3 from each of the four borohydrides (the 4th H points away from the metal due to its tetrahedral structure):
https://link.springer.com/content/pdf/10.1007/BF00962359.pdf
Same issue with implicit hydrogens: "a free nitrogen atom can bind three or five hydrogens" - and what about molecules as simple as [NH4]+???
Bond order in metal complexes can go up to 6 due to involvement of d orbitals:
https://en.wikipedia.org/wiki/Sextuple_bond
Want more input???

Atom index

The atomic index attribute needs its own discussion. This may require an entire section, but could probably be tucked into Constitution. Either way, Conformation and Configuration will reference it, and so the index attribute should be introduced before these sections.

Reading/Writing and Partial Parity Bonds

PPBs are complicated and error-prone. Both Reading and Writing sections should discuss the issues and offer ways to work around them.

One approach: an intermediate descriptor that replaces PPBs with dedicated syn/anti double bonds. Similar in concept to the @ notation proposed as an OpenSMILES extension.

Syntax Section

LL(1) formal grammar is the centerpiece. It could be a challenge to strike the right balance between background info on that and describing the syntax itself.

  • UTF-8 string
    • "dstrings" ?
  • encodes a depth-first traversal
    • branches
    • closures (C1CC1), "closure digit" = rnum (blech)
    • disconnections (C.C)
      • allows multiple components
  • formal grammar
    • LL(1)
    • listing
      • changes from before
        • --, ++ charge disallowed
        • only @ and @@ configuration
        • closure, disconnection
  • data types
    • Integer(n)
      • -n -> +n, inclusive
    • PositiveInteger(n)
      • 0 -> n, inclusive
    • Boolean
    • AtomicSymbol
      • list published by IUPAC
    • LowercaseSymbol
    • Star
    • Configuration
    • None
  • atoms
    • bare
      • "organic subset"
    • bracketed [ ... ]
      • virtual hydrogens
        • default values
          • isotope: None
          • hcount: 0
          • charge: 0
          • configuration: none
          • extension: none
    • lowercase
      • eligibility
      • may occur inside or outside brackets
  • bonds
    • single -, double =, triple # quadruple $
    • elided
      • none ``, :
      • always two-electron
    • up /, down /
      • two-electron
  • closures
    • closure digits: 1..99
    • balancing not easily expressable in syntax
      • it could be done, but with a lot of effort
        • explain
    • re-use digits
    • interaction with up/down bonds
    • errors
      • loops (C11)
      • up/down mismatch
  • disconnection .
    • "no bond"
      • not "zero order bond", but no bond
      • explicitly allows zob extension (link)
    • to enable disconnected components
      • other uses
        • example
    • may occur within cycle
      • examples
    • may not occur immediately before ring junction
  • using the formal grammar
    • reading
      • recursive descent parser
      • scanner-driven parser development
      • parser generator
      • parentheses vs. ring closure digits
    • writing
      • no counterpart to scanner-driven parser development

Configuration Section

Anything other than four-coordinate tetrahedral configuration is not supported.

It might be necessary to introduce atom index as a property in the Constitution section, for use here and in Conformation.

  • relative arrangement of neighbors in space
  • template-based system
    • Clockwise or Counterclockwise
  • atomic property
  • model
    • sight along edge with lowest index, toward central atom
      • determine the order of edge indexes
      • clockwise or counterclockwise
      • virtual hydrogen counts as edge with lowest index
  • limitations
    • can not represent sulfoxide configuration
  • operations
    • swaps and their effect on parity
      • virtual hydrogen <-> neighbor hydrogen
      • two neighbors
  • examples
    • cyclic, acyclic
    • allene, cumulene
    • include errors
      • degree != 4

Reorganize Semantics

The section is not as clear as it could be because it lacks a focus on data structures. Address this by introducing the Atom and Bond data structures first, then discussing attributes and their interactions later. A table of attributes for Atom and Bond would be very handy for reference.

Sulfoxide stereochemistry

A suggestion:

C[S@](=O)Cl means C[S@"LP"](O)Cl - that is, the lone pair, oxygen and chlorine appear anticlockwise from the carbon.

It's a possibly useful idea, but only provided that all issues can be resolved. If they can't then Stereochemistry should explicitly disallow the interpretation.

Add pitfalls to Writing section

A collection of tricky cases and non-obvious conclusions about writing Dialect.

  • bond split by cut
    • the bond itself
      • double-encoding is NOT required
        • can lead to bond compatibility errors that MUST be reported by reader
      • leave it out
        • doesn't matter which side
    • configuration
      • bond index, not atom index, sets configuration
      • examples

Reading Section

  • overview
    • transform string into data structures
    • current character and next direct state transitions
    • parent atom
      • attached through bond to child atom
      • branches and cuts mean that this connection may not be immediate
  • high-level state to be managed
    • the current atom
    • the current bond
    • the current branch
    • open cuts
    • the order of attachment of bonds (especially with cuts)
  • parsing
    • top-down manual
      • also manage current and next character
      • Scanner
    • parser generator
      • code from grammar
        • EBNF grammar in SI
    • tradeoffs
  • errors (the place to enumerate all possible reading errors?)
    • assume all input invalid or even malicious
    • syntax
      • unexpected character
      • unexpected end-of-line
    • semantic
      • unbalanced cut
      • mismatched bonds in cut
      • no DS perfect matching
      • disallowed PPB
    • reporting
      • zero-based start/end index of error
      • error kind/description
  • non-errors
    • physically impossible isotopes, valences, or charges
  • parse, don't validate

Discuss fully selected pyrrole

This is a constant point of confusion in SMILES. Clarifying it deals with this specific concern and illustrates the purpose and function of the DS.

Start by interpreting the meaning of the string c1cccn1. Run through the pruning algorithm (#41), which leaves all atoms unpruned. Finish with the error resulting from the lack of a perfect matching.

This probably best follows the discussion on pruning (#41).

Pruning Section

An atom requires an unpaired electron to become a member of the DS. Because atoms without unpaired electrons are often added, a pruning procedure can be used.

  • the removal of atoms from the DS
  • supported in Dialect strings for backward-compatibility
    • writers output SMILES with gratuitous selected atoms
  • deselect atoms that are incapable of forming double bonds
    • algorithm
  • examples

Grammar should allow empty string

Many formats support the empty molecule, including V2000 and V3000 molfile. If Dialect does not, this could create problems.

The empty string could be supported with something as simple as:

<string>     ::= <sequence> | ε
<sequence>   ::= <atom> ( <union> | <branch> | <split> )*
<union>      ::= <bond>? ( <cut> | <sequence> )
<branch>     ::= "(" ( "." | <bond> )? <sequence> ")"
<split>      ::= "." <sequence>
...

Remove notes file

The idea was always to move these to issues. It makes little sense to clutter up the repo with random pieces of information.

Pruning section should present necessary conditions

An atom must be pruned if its subvalence is zero. If the atom's charge is non-zero, the default valences for the isoelectronic element are used to compute subvalence. If there are no such target valences (e.g., [c+2]), an error must be generated. Writers must not write such atoms.

Introduce subvalence

The Pruning section (#10) needs to reference the algorithm for subvalence. Unfortunately, that is currently equated to the virtual hydrogen algorithm. This makes it very hard to talk about pruning and the full implicit hydrogen algorithm.

This can be resolved with the following changes:

  1. move Delocalization Subgraph after Constitution
  2. Rename Implicit Hydrogens to Subvalence
  3. make appropriate changes to Subvalence. This will probably mean saving some material and creating new material.
  4. Add Implicit Hydrogens immediately after Subvalence. This section should present the full algorithm for implicit hydrogen calculation, accounting for atom selection.

Semantics of stereocenters with undefined configuration

How should a stereocenter with undefined configuration be interpreted?

No SMILES or OpenSMILES documentation explains this point. If it is not addressed, implementations will need to invent their own rules, which can cause data loss.

Options:

  1. Either. One or the other descriptor is present, but which one is unknown.
  2. Mixture. A mixture of configurations of unknown ration is present.
  3. (1) or (2)

For its part, V2000 uses this interpretation: "It could be either of two stereoisomers, or a mixture of the two." In other words, (3).

Given the large number of V2K<->SMILES conversions being performed, and given no clear advantage to any option, (3) makes the most sense.

See also the discussion here.

Bonds added to DS

The ms states that bonds must be elided to be inducted into the DS. It also states that the bond order must be one or two.

Following the policy of greatest restriction, the ms should be unified to omit any non-elided bond from the DS.

Reorganize Syntax

The current organization is confusing because it's organized solely around syntax with little consideration of semantics. It should bring in more semantics and organize around major data structure components.

Subheadings could address this:

  • Atom
    • bracket
    • shortcut
    • selected shortcut
    • star
  • Bond
    • types
    • overview of using bonds
  • Chains, Branches, and Cycles
    • how each of item types affect the way in which parent connects to child

Reversed Partial Parity Bond

Conformation describes a convention for working with partial parity bonds (BBPs). It assumes that bonds are encoded left-to-right. In other words, target always succeeds source.

This works under the assumption that the graph has no cycles. If one exists, then the edge that bites back can be forward (target succeeds), reverse (target precedes), or both. This case leads to incorrect selection of neighbor quadrant.

Whatever solution is chosen must account for the apparent dual state of the central PPB in hexadiene (C/C=C/C=C/C).

Options:

  • reverse assignment in the event that target precedes source
  • ignore target-precedes PPBs
    • but it may be the only one carrying state

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.