rapodaca / dialect Goto Github PK
View Code? Open in Web Editor NEWDocumenting a subset of the SMILES language.
License: MIT License
Documenting a subset of the SMILES language.
License: MIT License
Hi there, I came via your blog, but the link back near the top of the README seems dead? Should it be this one?
https://depth-first.com/articles/2021/09/22/beyond-smiles/
Attribute Name | Description | Type |
---|---|---|
kind | The kind of bond. | Enumeration |
target | The index of the target atom. | Integer |
Bond Kind | Bond Order | State |
---|---|---|
Elided | 1 | None |
Single | 1 | None |
Double | 2 | None |
Triple | 3 | None |
Up | 1 | Up |
Down | 1 | Down |
The section currently lacks examples. It would be very helpful to include some to illustrate both acceptable and unacceptable encoding.
A table of Atom attributes should be present. It should use exactly the same terms used elsewhere in the ms.
Attribute Name | Description | Type | Nullable | Meaning if null | Constraints | Default Value |
---|---|---|---|---|---|---|
index | A unique identifier | unsigned integer | no | N/A | - | N/A |
isotope | The sum of proton and neutron counts. | unsigned integer | yes | Natural abundance | 0 <= value < 1000 | null |
element | The atom's element. | enumeration | yes | unknown element | An IUPAC-approved element symbol. | null |
virtual_hydrogens | The number of virtual hydrogens | unsigned integer | Yes | zero virtual hydrogens | 0 <= value < 10 | 0 |
algorithmic_hydrogens | If true hydrogens are counted algorithmically. | boolean | No | - | - | true |
configuration | Configurational descriptor | enumeration | Yes | No configuration | tetracoordinate | Null |
extension | Application-specific data | integer | Yes | No extension | 0 <= value < 1000 | Null |
selected | Whether the atom is selected. | boolean | No | - | element must be one of C , N , O , P , or S . |
- |
... and so on.
"At" was excluded from the "organic subset." Dropping At and Ts aligns Dialect with the original intent.
Adding elements past Lr to <element>
breaks backward-compatibility, but is more in line with expectation than new elements in the organic subset.
Among other corrections, this means removing At and Ts from the valence table.
Also:
The next atomic production rule,
<shortcut>
is a non-terminal comprised of the terminal element symbols: "B"; "C"; "N"; "O"; "P"; "S"; "F"; "Cl"; "Br"; "I"; "At"; "Ts"; "P"; and "S".
Partial parity bonds mean that the dialect molecular graph is directed. Make sure that all statements about graph type are consistent with this point. For example, "Third, edges are undirected." It would be better to break the directed/undirected discussion out into a separate paragraph, noting that the reason for directed bonds will become clear.
The ms makes several references to "dialect" in the linguistic sense and this is of course the code name for the language. But the goal is not to make yet another dialect. The goal is to for the first time fully define a language that functions as a subset of SMILES-as-practiced. No extensions. No pet nice-to-haves. But a subset to the extent it's possible without internal inconsistencies.
Parts to improve:
These occur in the 2nd paragraph of the introduction.
The paper enumerates several kinds of stereochemistry that can't be represented by Dialect. Unless they are explicitly disallowed, there could be a temptation to bend the rules. It might be worth discussing some of the more obvious cases and re-iterate the support for extensions.
Need a paragraph or section talking about elided bonds. Bond elision is mentioned by the Delocalization Subgraph section.
The current support of hex digits breaks compatibility with most SMILES implementations for a minimal, uncertain payoff at best. An <extension>
is up to four decimal digits, interpreted as a single integer, with leading zeros allowed.
Default valences are required to make an element eligible for selection. In other words, only eligible atoms can be selected.
Otherwise, pruning is impossible because free hydrogen count can not be determined. You effectively make up your own valence model. This is the main reason to exclude things like [te]
.
Note that for the same reason [as]
and [se]
should not be valid, either.
Those services needed by a reference implementation. One may or may not be ready in time for publication. Purr in its current form will definitely need revision to be compatible with Dialect. The extent isn't yet clear, though.
Previously: "Hydrogen Suppression"
Comprehensively describe, modulo selection (aka "aromaticity"), the rules for computing implicit hydrogen count.
Grammar supports the empty string, so the molecular graph can be empty. At least one statement is inconsistent:
"A Dialect molecule consists of a graph with at least one node and zero or more edges."
All such occurrences should be corrected to allow empty graphs.
It should be possible to build a preprint-ready manuscript in PDF format. The system should use Markdown as the source for accessibility.
Some ideas:
Extremely rare. Should it appear, it's very likely SMILES can't accurately serialize it. Remove quad bond from grammar and ms.
Address the idea that "there isn't anything new here."
One of the most interesting incomplete features in the Daylight SMILES/OpenSMILES to me is the the polymer extension. Currently, SMILES are mostly used for nonpolymer molecules, but this ignores large set of chemical compounds.
I have explained the problem in timvdm/OpenSMILES#8 and IUPAC/IUPAC_SMILES_plus#9.
Thanks for an interesting and really much needed effort!
A collection of tricky cases and non-obvious conclusions around reading Dialect.
C1C.C1
is allowed
.
will fail generally)C/C
, C\C
are not valid
Scope, limitations, observations. Maybe some more context.
:
)Describe what conformation means and how it's used. It will be helpful to develop a visual notation system, but it's not clear what it should be.
This discussion, if present, probably goes in Writing.
The efficiency of decoding selected atoms is set by the Blossom algorithm. However, it's not clear what sets the upper bound on encoding selected atoms when the selection is based on something other than perfect matching. Examples:
Nice idea - but please don't forget about metals, as your VB graph part will likely fail for most of the transition metals as well as some main group elements such as boron, valence bond theory does not work for the highly delocalized bonding situations you often encounter there, such as 2c3e bonds in diborane B2H6, iron nitrosyl complexes, [Fe-S] clusters, ...
Also, limiting the hydrogen count to 0..9 will exclude some compounds, e.g. [Zr(BH4)4] as well as the Hf analogue, which have 12 hydrogens coordinated to the metal center, 3 from each of the four borohydrides (the 4th H points away from the metal due to its tetrahedral structure):
https://link.springer.com/content/pdf/10.1007/BF00962359.pdf
Same issue with implicit hydrogens: "a free nitrogen atom can bind three or five hydrogens" - and what about molecules as simple as [NH4]+???
Bond order in metal complexes can go up to 6 due to involvement of d orbitals:
https://en.wikipedia.org/wiki/Sextuple_bond
Want more input???
The atomic index
attribute needs its own discussion. This may require an entire section, but could probably be tucked into Constitution. Either way, Conformation and Configuration will reference it, and so the index
attribute should be introduced before these sections.
Discussion of stereo descriptors should indicate that it is an error to assign one to anything other than a tetracoordinate atom.
PPBs are complicated and error-prone. Both Reading and Writing sections should discuss the issues and offer ways to work around them.
One approach: an intermediate descriptor that replaces PPBs with dedicated syn/anti double bonds. Similar in concept to the @
notation proposed as an OpenSMILES extension.
FAIR principles mesh well with the goals of complete language specifications. The paper below contains some points that should be worked into the introduction.
Findability
Accessibility
Interoperability
Reusability
LL(1) formal grammar is the centerpiece. It could be a challenge to strike the right balance between background info on that and describing the syntax itself.
--
, ++
charge disallowed@
and @@
configuration[
... ]
-
, double =
, triple #
quadruple $
:
/
, down /
C11
).
Anything other than four-coordinate tetrahedral configuration is not supported.
It might be necessary to introduce atom index as a property in the Constitution section, for use here and in Conformation.
The section is not as clear as it could be because it lacks a focus on data structures. Address this by introducing the Atom and Bond data structures first, then discussing attributes and their interactions later. A table of attributes for Atom and Bond would be very handy for reference.
A suggestion:
C[S@](=O)Cl means C[S@"LP"](O)Cl
- that is, the lone pair, oxygen and chlorine appear anticlockwise from the carbon.
It's a possibly useful idea, but only provided that all issues can be resolved. If they can't then Stereochemistry should explicitly disallow the interpretation.
These properties and their definitions will be used in later sections.
A collection of tricky cases and non-obvious conclusions about writing Dialect.
This is a constant point of confusion in SMILES. Clarifying it deals with this specific concern and illustrates the purpose and function of the DS.
Start by interpreting the meaning of the string c1cccn1
. Run through the pruning algorithm (#41), which leaves all atoms unpruned. Finish with the error resulting from the lack of a perfect matching.
This probably best follows the discussion on pruning (#41).
An atom requires an unpaired electron to become a member of the DS. Because atoms without unpaired electrons are often added, a pruning procedure can be used.
Many formats support the empty molecule, including V2000 and V3000 molfile. If Dialect does not, this could create problems.
The empty string could be supported with something as simple as:
<string> ::= <sequence> | ε
<sequence> ::= <atom> ( <union> | <branch> | <split> )*
<union> ::= <bond>? ( <cut> | <sequence> )
<branch> ::= "(" ( "." | <bond> )? <sequence> ")"
<split> ::= "." <sequence>
...
The idea was always to move these to issues. It makes little sense to clutter up the repo with random pieces of information.
An atom must be pruned if its subvalence is zero. If the atom's charge is non-zero, the default valences for the isoelectronic element are used to compute subvalence. If there are no such target valences (e.g., [c+2]
), an error must be generated. Writers must not write such atoms.
The Pruning section (#10) needs to reference the algorithm for subvalence. Unfortunately, that is currently equated to the virtual hydrogen algorithm. This makes it very hard to talk about pruning and the full implicit hydrogen algorithm.
This can be resolved with the following changes:
RR diagrams should be built from source by build.sh
script. Later they can be linked into the document.
How should a stereocenter with undefined configuration be interpreted?
No SMILES or OpenSMILES documentation explains this point. If it is not addressed, implementations will need to invent their own rules, which can cause data loss.
Options:
For its part, V2000 uses this interpretation: "It could be either of two stereoisomers, or a mixture of the two." In other words, (3).
Given the large number of V2K<->SMILES conversions being performed, and given no clear advantage to any option, (3) makes the most sense.
See also the discussion here.
The ms states that bonds must be elided to be inducted into the DS. It also states that the bond order must be one or two.
Following the policy of greatest restriction, the ms should be unified to omit any non-elided bond from the DS.
The current organization is confusing because it's organized solely around syntax with little consideration of semantics. It should bring in more semantics and organize around major data structure components.
Subheadings could address this:
Conformation describes a convention for working with partial parity bonds (BBPs). It assumes that bonds are encoded left-to-right. In other words, target always succeeds source.
This works under the assumption that the graph has no cycles. If one exists, then the edge that bites back can be forward (target succeeds), reverse (target precedes), or both. This case leads to incorrect selection of neighbor quadrant.
Whatever solution is chosen must account for the apparent dual state of the central PPB in hexadiene (C/C=C/C=C/C
).
Options:
The
hydrogens
attribute, if present, may assume integer values ranging from zero to nine.
The grammar says 1-9. Change the above statement to match the grammar.
SMILES dialects not currently covered by the introduction go here.
C%(123)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.