rdkit / knime-rdkit Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 14.0 755.4 MB

The RDKit nodes for the KNIME Analytics Platform

Java 99.56% HTML 0.19% Python 0.22% CSS 0.01% Batchfile 0.01%

knime-rdkit's Introduction

RDKit

RDKit is a collection of cheminformatics and machine-learning software written in C++ and Python.

BSD license - a business friendly license for open source
Core data structures and algorithms in C++
Python 3.x wrapper generated using Boost.Python
Java and C# wrappers generated with SWIG
2D and 3D molecular operations
Descriptor and Fingerprint generation for machine learning
Molecular database cartridge for PostgreSQL supporting substructure and similarity searches as well as many descriptor calculators
Cheminformatics nodes for KNIME
Contrib folder with useful community-contributed software harnessing the power of the RDKit

Community

Code

GitHub code and bug tracker

Web presence

Materials from user group meetings

Documentation

Available on the RDKit page and in the Docs folder on GitHub

Installation

Installation instructions are available in Docs/Book/Install.md.

Binary distributions, anaconda, homebrew

binaries for conda python or, if you are using the conda-forge stack, the RDKit is also available from conda-forge.
RPMs for RedHat Enterprise Linux, Centos, and Fedora. Contributed by Gianluca Sforna.
debs for Ubuntu and other Debian-derived Linux distros. Contributed by the Debichem team.
homebrew formula for building on the Mac. Contributed by Eddie Cao.
recipes for building using the excellent conda package manager. Contributed by Riccardo Vianello.
APKs for Alpine Linux. Contributed by da Verona
Wheels at PyPi for all major platforms and python versions. Contributed by Christopher Kuenneth

Projects using RDKit

ROBERT - Automated Machine Learning Protocols
AQME - Automated Quantum Mechanical Environment
chemprop - message passing neural networks for molecular property prediction
RMG - Reaction Mechanism Generator
RDMC - Reaction Data and Molecular Conformers - package for dealing with reactions, molecules, conformers, mainly in 3D
pychemprojections - python library for visualizing various 2D projections of molecules.
pychemovality - python library for estimating the ovality of molecules.
ChEMBL Structure Pipeline - ChEMBL protocols used to standardise and salt strip molecules.
FPSim2 - Simple package for fast molecular similarity searches.
Datamol (docs, repo) - A Python library to intuitively manipulate molecules.
Scopy (docs, paper) - an integrated negative design Python library for desirable HTS/VS database design
stk (docs, paper) - a Python library for building, manipulating, analyzing and automatic design of molecules.
gpusimilarity - A Cuda/Thrust implementation of fingerprint similarity searching
Samson Connect - Software for adaptive modeling and simulation of nanosystems
mol_frame - Chemical Structure Handling for Dask and Pandas DataFrames
RDKit.js - The official JavaScript release of RDKit
DeepChem - python library for deep learning for chemistry
mmpdb - Matched molecular pair database generation and analysis
CheTo (paper)- Chemical topic modeling
OCEAN (paper)- Optimized cross reactivity estimation
ChEMBL Beaker - standalone web server wrapper for RDKit and OSRA
ZINC - Free database of commercially-available compounds for virtual screening
sdf_viewer.py - an interactive SDF viewer
sdf2ppt - Reads an SDFile and displays molecules as image grid in powerpoint/openoffice presentation.
MolGears - A cheminformatics tool for bioactive molecules
PYPL - Simple cartridge that lets you call Python scripts from Oracle PL/SQL.
shape-it-rdkit - Gaussian molecular overlap code shape-it (from silicos it) ported to RDKit backend
WONKA - Tool for analysis and interrogation of protein-ligand crystal structures
OOMMPPAA - Tool for directed synthesis and data analysis based on protein-ligand crystal structures
OCEAN - web-tool for target-prediction of chemical structures which uses ChEMBL as datasource
chemfp - very fast fingerprint searching
rdkit_ipynb_tools - RDKit Tools for the IPython Notebook
Vernalis KNIME nodes
Erlwood KNIME nodes
AZOrange

License

Code released under the BSD license.

knime-rdkit's People

Contributors

Stargazers

Watchers

Forkers

manuelschwarze webbres monsterero gab1one chenlinkong jibsn shunsunsun chaubold yupliu siryoku steve252 ptosco steffen-fissler chrisjorg

knime-rdkit's Issues

SVG from renderer should not include svg namespace

This doesn't seem to be necessary and the svg: causes rendering problems in the javascript table view

OptimizeGeometry node: use ETKDG when generating missing conformers

When rendering make a call to prepareMolForDrawing()

RDKit From Molecule node can't render molecule with wedged bond

Dear developper,
Thanks for developping really useful node for drug discovery,
I found that RDKit From Molecule node can't render chiral modelcule with wedged bond.
I think it's better to render molecule with wedge bond.

Thanks,
Taka

Log4j Security Vulnerability

The RDKit nodes plugin makes use of the OPSIN library, which has a dependency to log4j 2.14.1 in our current RDKit nodes version. It is kind of hidden, because we built the OPSIN library into a single JAR file that bundles all dependencies. I raised an issue in the OPSIN project yesterday, and Dan has fixed it immediately updating to log4j 2.15.1. We should get that update into the RDKit nodes ASAP for the nightly build, and should also consider releasing it to KNIME 4.3, 4.4 and 4.5. @greglandrum, I will require your code review and approval.

OptimizeGeometry node: generate warnings when no coords are present

non-canonical smiles from fragmenter node

https://www.knime.com/forum/rdkit/rdkit-fragmenter-node-query

Update link to FPS format documentation

This is a better link:
https://jcheminf.springeropen.com/articles/10.1186/1758-2946-5-S1-P36

Option for useChirality for Morgan Fingerprint

Could this option which is available in the toolkit be shown to users in the RDKit Fingerprint Node?

Support chirality in the substructure searching nodes

Here's the forum thread:
https://forum.knime.com/t/rdkit-molecule-substructure-filter-ignores-chirality/15283

RDKit Molecule Extractor Node Cannot Handle Single Molecule

The RDKit Molecule Extractor Node splits a set of molecules into individual structures, but this only works if the SDF or SMILES contains two or more structures. If there's only one structure in the SDF or SMILES the RDKit Molecule Extractor generates no output structure at all resulting for instance in a blank table. This is a bug. It should in that case output the only structure that was found.

This was reported in the RDKit forum: http://tech.knime.org/forum/rdkit/rdkit-molecule-extractor-returns-empty-field-if-only-one-structure-present

BRICS/RECAP decomposition node

A node that uses either the BRICS or RECAP rules to fragment a molecule.
We'll need to add RECAP definitions to the RDKit first

An optional input port could be a table with a list of SMARTS patterns defining bonds to use for fragmentation. These could then be used to construct a call to fragmentOnBonds(), assuming that that is useable from Java.

support radius 0 fingerprints

Add NumChiralCenters descriptor

Might as well add NumSpecifiedChiralCenters and NumUnspecifiedChiralCenters at the same time.

This would be best done at the C++ level and then just wrapped, but it's possible to do from Java too in case the C++ code "doesn't get finished" quickly enough, sample code here: https://www.knime.org/blog/using-custom-data-types-with-the-java-snippet-node

Allow reaction sanitization

The backend supports it (need to verify that this is true in Java), so we should allow it (as an option) in the reaction nodes too.

Functional Group Filter: SMARTS pattern issues / bugs

There seems to be an issue(s) with the halogen patterns. First one I have observed some time ago. It seems CF3 is not filtered out. Example:

CC12C3C11CCC2OC3(C)CC1C(F)(F)F

(All examples are from GDB-17)

If I set F to < 3, I expect these molecules to be filtered out.

Second problem is related Bromine but probably also would affect all other halogens. In this specific setting I have halogens set to 0 and non fluorine halogens also set to 0. Yeah it's redundant (configurable subnode) but maybe affects the issue.

molecules passing this filter:

BrC1=NC2=C(NS1(=O)=O)SC(=C2)C#C

BrC1=NC=C2N3C(=CS2(=O)=O)C=COC=C13

Many more from GDB-17

So there seems to be an issue with the SMARTS pattern. I'm especially confused by the complexity of the Top halogen pattern. Couldn't that just be [F,Cl, Br,I]? Or what am I missing?

Reaction nodes: Allow specifying which column to use to construct output row ids

Also a good idea to have an option to bring the reactant row ids over into the output table.

MCS node breaking up rings

I've noticed some (I think) erroneous behaviour in the MCS node. In short it will, under certain circumstances, break open rings even when told not to.

I've attached a zipped up KNIME workflow to demonstrate the problem. It shows how mining a fuzzy MCS (in terms of element type) from a set of results in a SMARTS pattern containing aromatic bonds that no longer describe a ring. This is the SMARTS I'm getting:

[#6]-[#6,#7]-[#7,#6]1-[#6]-[#6]-[#7](-[#6]-[#6]-1)-[#6]1:[#6,#7]:[#6](:[#6]:[#6]:[#6]:[#6]):[#7,#6]:[#6]:[#7]:1

The input SMILES strings are:

FC(F)(F)C1=CC=CC=C1C1=CC=C2C(=C1)N=CN=C2N1CCN(CC1)C(=O)C=C
OCC#CC(=O)N1CCN(CC1)C1=C2C=C(Cl)C(=CC2=NC=N1)C1=CC=C(F)C=C1F
ClC1=CC=CC=C1C1=C2N=C(N=CC2=CC=C1)N1CCN(CC1)C(=O)C=C
ClC1=CC=C2C(=C1)N=CN=C2N1CCN(C(C1)C#N)C(=O)C=C
C=CC(=O)N1CCN(CC1C#N)C1=C2C=CC=CC2=NC=N1
ClC1=CC=C2C(=C1)N=CN=C2N1CCN(C(C1)C#N)C(=O)C=C
C=CC(=O)N1CCN(CC1)C1=C2C=CC(=CC2=NC=N1)C1=CC=CC=C1
ClC1=C(C=C2N=CN=C(N3CCC(CC3)NC(=O)C=C)C2=C1)C1=CC=CC=C1

The node configuration is as follows:

Threshold: 1.0
Ring matches ring only: checked
Complete rings only: checked
Match valences: unchecked
Atom comparisons: "Compare Any"
Bond comparisons: "Compare Order"
Timeout: 300

The node doesn't time out, but I've seen this behaviour more often when it does.

I would think this behaviour is a bug and is definitely undesired for me, since I'm using this node to identify scaffolds. Instead of breaking bonds, I would have expected the node to produce a SMARTS pattern that just doesn't include the atoms from the broken ring.

mcs_rings.zip

Switch to using new R-Group Decomposition code

This probably means deprecating the current node since the options are now different.

The constructor for the data structure with the parameters for the C++ RGD code, which includes the default values, is here:
https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/RGroupDecomposition/RGroupDecomp.h#L66

From a quick skim of the generated code, it looks like all of this is exposed to Java in a sensible way.

fix link to RDKit book in reaction nodes

it's currently to http://rdkit.svn.sourceforge.net/viewvc/rdkit/trunk/Docs/Book/RDKit_Book.pdf
but should be to http://rdkit.org/docs/RDKit_Book.html#reaction-smarts

Add support for parsing PDB cells

The relevant C++ call is: RWMol.MolFromPDBBlock

New 3D descriptor Node

Trying to understand the RDKit Aromatizer node and the comparison to 0 to determine success

I'm trying to understand an existing RDKit-based KNIME workflow and having used RDKit in Python previously I can't understand the RDKit Aromatizer node. It calls .setAromaticity(temp) where temp is a RDKit molecule and then compares the result to 0 to determine success but what is happening here? SetAromaticity in the C++ API which the Java KNIME is wrapping returns void and in Python it returns None. What does the comparison to 0 do here?

Line I'm referencing:

knime-rdkit/org.rdkit.knime.nodes/src/org/rdkit/knime/nodes/aromatize/RDKitAromatizeNodeModel.java

Line 242 in d760417

if (RDKFuncs.setAromaticity(temp) != 0) { // Success

RDKit 2D Depiction option missing for formats in preferred render settings

I can set RDKit 2D depictions as preferred renderer for smiles and sdf but not for molfile or inchi.

For molfile columns in the table view I can switch to rdkit.
For inchi columns rdkit is not an option also in table view.

Can both of the formats get rdkit as preferred renderer?
(inchi being less important but sometimes it can be good do have a structure vs a string)

Test workflows are failing because they were created with nightly builds

There's a new "feature" in KNIME 3.5 that prevents loading of workflows created with a nightly build. While the issue is being fixed on KNIME we should simply remove the flag from the workflows.

Add support for processing the new HELM cells

Output Column Naming and Append vs replace

This is most likely very subjective but I always "struggle" with the configuration of the RDKit nodes namely the output column name of the changed molecule. The actual cases where I do NOT want to simply replace the exiting column are rare. So I end up having to check "Remove Source Column" and then Change the output column name back to the original name. This still leaves the issue that now the column order needs to be fixed afterwards. It's just cumbersome and not very neat/clean.

I much prefer the way the indigo nodes handle this with the Append column check box.

I simply need to uncheck the box and it's done (ideally it should be unchecked by default). It also actually replaces the column and keeps column order intact.

The suggestion hence is do it like the indigo nodes but have "append column" unchecked by default. This would in my opinion greatly increase the ease of use of RDKit nodes for common use-cases.

BRICS molecule builder node

This would really be a fragment-based molecule builder node, but the first rule set would be the BRICS rules.

The required C++ functionality for this may not yet be present, I'm going to need to look into that.

Fingerprint bit highlighting node

Idea is to be able to provide a rendering of a molecule with the atoms involved in setting a fingerprint bit highlighted.

Example of how the highlights work for Morgan bits from Python:

http://rdkit.blogspot.ch/2016/02/morgan-fingerprint-bit-statistics.html

I need to provide sample code for the other fingerprint types that this works with

Add RDKit renderer for SMILES, MOL, and SDF cells

This would make numerous tasks easier.
Could also think about adding it for SMARTS cells

Provide better error messages when RDKit nodes fail completely or by row

Steve Roughley, Vernalis, has suggested the following interesting idea as it was successfully implemented already in the Vernalis nodes:

In the spirit of OSS/sharing etc I thought that this latest commit to the Vernalis code might be of interest:

vernalis/vernalis-knime-nodes@ebc1d9f

In particular this file - https://github.com/vernalis/vernalis-knime-nodes/blob/master/com.vernalis.knime.chem.core/src/com/vernalis/knime/chem/rdkit/RDKitRuntimeExceptionHandler.java

Basically this provides a mechanism for handling RDKit exceptions for either the new (4.1) or older versions of the RDKit Types plugin. Typical usage:

   try{
          //Do something throwing e.g. GenericRDKitException or MolSanitizeException…

   } catch (MolSanitizeException e) {
          // Just re-throw, but message is available via standard #getMessage() call
          throw new RDKitRuntimeExceptionHandler(e);
   } catch (GenericRDKitException e) {
          // Just log the message…
          getLogger().warn(new RDKitRuntimeExceptionHandler(e).getMessage());
   }

Hope that’s of use…
Steve

I think this is a very good idea, but would require some amount of refactoring as we have to go through all RDKit nodes code and rework the error handling and automated tests - that would be a looong journey.

Steve had some other idea: Maybe we can make the #getMessage() work in the wrappers in RDKit – that would solve the problem for once and all. Then we can simply rely on having a meaningful message in all exceptions coming from RDKit.

If I understand it correctly, that would mean to improve the SWIG mechanism that creates the wrapper classes... - It will be more on the RDKit side, not so much on the RDKit Nodes side.

Pre-condition Violation Number of atom mismatch Violation occurred on line 564 in file Code\GraphMol\ROMol.cpp Failed Expression: conf->getNumAtoms() == this->getNumAtoms() RDKIT: 2017.09.1 BOOST: 1_56

Hi,

When using Python Script node in Knime after the RDKit R-group Decomposition node, I am getting the following error message.

ERROR Python Script 0:54 Execute failed: Pre-condition Violation
Number of atom mismatch
Violation occurred on line 564 in file Code\GraphMol\ROMol.cpp
Failed Expression: conf->getNumAtoms() == this->getNumAtoms()
RDKIT: 2017.09.1
BOOST: 1_56

The python script only contains the following lines of code

_from rdkit import Chem
from rdkit.Chem import AllChem
import pandas as pd

mols = input_table_1['R1']
output_table_1 = input_table_1.copy()_

where R1 is generated by the RDkit R-group decomposition node.

If I feed the structures prior to R-group decomposition directly to the python script then it works just fine indicating that the Rgroup decomposition node output is likely the source of this error.

RDKIt G-Group decomp node configuration is as follows:

Knime: 4.5.0
Python: 2.7.18 installed as part of conda environment
OS: Windows 10

If you have a question about the RDKit or aren't sure if what you're seeing is the right behavior please use the Discussions tab above (https://github.com/rdkit/rdkit/discussions) instead of posting it here.

Please help us help you: We normally need the information below in order to answer questions or understand bug reports. If you do not provide the information requested we may not be able to help you and will probably close the issue.

PLEASE DELETE THIS SECTION AFTER YOU HAVE READ IT

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior, feel free to paste a bit of Python in here.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Configuration (please complete the following information):

RDKit version:
OS: [e.g. Ubuntu 20.04]
Python version (if relevant):
Are you using conda?
If you are using conda, which channel did you install the rdkit from?
If you are not using conda: how did you install the RDKit?

Additional context
Add any other context about the problem here.

Reaction nodes: support "chain reactions"

This would involve adding two new options:

bool allowChainReactions=false
unsigned maxChainReactionProducts=10

This would only work for reactions that have a single product.

If allowChainReactions is true, and the product of the reaction is a possible reactant (as either reactant 1 or reactant 2 for the two-component reaction), then the reaction is repeated with the product as a reactant. This repeats recursively until the product is no longer a possible reactant or maxChainReactionProducts is reached.

Add Conformers Node: multi-threading

A great enhancement would be to make this node multi-threaded. Currently it only uses 1 thread. With a 8-core CPU you could hence be near than 8-times faster.

NumHeavyAtoms in Descriptor Calculation node wrong - Edge Case?

There seems to be an edge-case in the calculation of number of heavy atoms when using attachment points [*] in SMILES and hydrogens in brackets. Below "table" contains the smiles, the NumHeavyAtoms and MQN12 which also is heavy atom count also from Descriptor Calculation node. Here they differ and MQN12 is right while NumHeavyAtoms is wrong:

[H] | 1 | 0
[H][*] | 2 | 0
[H]C#CC[*] | 4 | 3
C[*] | 2 | 1
CC[*] | 3 | 2
OCC[*] | 4 | 3

So it seems anything in brackets is counted as heavy atom even if it is a hydrogen or attachment point / any atom (the attachment point as [*] is copied like this from ChemDraw). At the same time MQN12 gets it right in all cases.

Not sure this is the right place to post as it probably is an issue within RDKit itself and not the KNIME node?

FilterCatalog node

Provides additional functionality relative to standard substructure filtering.

Python examples:
http://rdkit.blogspot.ch/2016/04/changes-in-201603-release-filtercatalog.html

support enhanced stereo in the nodes which do substructure searching

We should add an option for this like we have one for using chirality.

AdjustQueryProperties() node to prepare molecules for substructure queries

Here's a blog post describing the functionality:
http://rdkit.blogspot.ch/2016/07/tuning-substructure-queries-ii.html

RDKit Rooted Fingerprints Have Wrong Results in KNIME

We have discovered that rooted fingerprints in KNIME have are different from Python and pure Java. The reason we found is that there is a bug how the atom list for the rooting feature is created before it passed to the RDKit native code.

Paolo Tosco's Java code created it like this, as example with atom index 17:

UInt_Vect atomList = new UInt_Vect(1);
atomList.set(0, atom);

This is correct. It results in [ 17 ]

I had created it many years ago like this:

atomList = new UInt_Vect(1);
atomList.add(atom);

This is incorrect. It results in [ 0, 17 ], where "0" could also be something that is not well-defined.

In addition to this code, other buggy code also related to the same feature was discovered in InputDataObject class, which processes again such UInt_Vect objects and has the same bug.

Consequences of this bug:

InputDataInfo.getRDKitIntegerVector(DataRow row) – Always was double the size that it should be, and the first half was filled with 0 (or undefined).

RDKit Highlighting Node – Should have been leading to always coloring of atom at index 0, but for some reason this did not happen, hence no problem that a user would actually see
RDKit Highlighting Atoms Node (deprecated) – Influence on SVG creation, but no problem that a user would actually see
InputDataInfo.getRDKitUIntegerVector (DataRow row) – Always was double the size that it should be, and the first half was filled with 0 (or undefined).

RDKit Fingerprint Nodes – Rooted fingerprints (Morgan, FeatMorgan, AtomPair, Torsion, RDKit, Layered, always included the atom indexes 0 (or undefined)
AbstractFingerprintNodeModel.$AbstractRDKitCellFactory.process(…) – When a single atom was selected for rooting, it was always using atom at index 0 and the selected one, so it used 2 atoms instead of a single one, every fingerprint calculated with rooted atoms has in the moment wrong results: There are more bits set in the fingerprint than it would have without the additional badly added atom index.

Tracked at Novartis as KNIME-1794

Change the error message from failed DLL load

Instead of suggesting that the user un-install and re-install the nodes, it should provide the link to the VS2010 redistributables so that they can fix the problem.

Knime RDKit nodes "RDKit Canon Smiles" and "RDKit to InChI" are crashing Knime 4.4.4

Discussed in rdkit/rdkit#5227

^{Originally posted by ddsneo4j April 22, 2022}
Hi,

unfortunately, the RDKit nodes "RDKit Canon Smiles" and "RDKit to InChI" are crashing Knime 4.4.4 - see attached Knime workflow and input structure. Could this bug please be fixed, i.e. Knime should not be crashed by these nodes because of a structure where canonical smiles and InChI keys cannot be created for?

Thanks for your effort in advance.

Best regards,
Dan
RdKit.zip

New Node: Generating an SVG column as rendering

This node would function like the highlighting node, but without the highlighting. The new node should be streamable.

Optionally, we could provide PNG rendering as well from it (need to figure out how to do this) based on a given size.

Add MolStandardize node

Using the new C++ code.

What fingerprint settings does the RDKit Molecule Substructure Filter use?

I'm trying to figure out what fingerprint settings the RDKit Molecule Substructure Filter node uses when it pre-calculates fingerprints. I looked at the documentation of the node on KNIME Hub but there's nothing that specifies the fingerprint settings. Does it just use RDKit defaults?, i.e.

The default set of parameters used by the fingerprinter is: - minimum path size: 1 bond - maximum path size: 7 bonds - fingerprint size: 2048 bits - number of bits set per hash: 2 - minimum fingerprint size: 64 bits - target on-bit density 0.0

add chirality info for fingerprints to RDKit Interactive table

SMILES Canonicalization With Very Long SMILES Values Crashed KNIME

A colleague encountered that KNIME crashes based on an RDKit crash that is caused by very long SMILES values when passed into the RDKit Canonicalize SMILES node. The SMILES in which this occurred link together Cyclooctine. The correct behavior would be to generate an error for such a SMILES if atom count is too large, but it should of course not crash.

Rendering defaults and render options

I like having rdkit as rendering option in preferred renders. However the render settings used by default and I can't seem to change them are pretty bad in my opinion. There is to much of a margin in vertical space making structures way too small. Example:

The structure could be rendered a lot bigger and get rid of the huge amount of white space. This should be solved in general by changing the defaults.
(Indigo has the opposite problem of too small margins)

On top of this exposing as many rendering options as possible would also be great especially such regarding the sizing of the structures, bond thickness and label size. Right now none of these can be controlled like they can with Marvin and to lesser degree with indigo.

Remove the warnings associated with a non-english locale

I believe these should no longer be necessary and they are irritating to users.