Code Monkey home page Code Monkey logo

knime-rdkit's Introduction

RDKit

Azure build Status Documentation Status DOI

RDKit is a collection of cheminformatics and machine-learning software written in C++ and Python.

  • BSD license - a business friendly license for open source
  • Core data structures and algorithms in C++
  • Python 3.x wrapper generated using Boost.Python
  • Java and C# wrappers generated with SWIG
  • 2D and 3D molecular operations
  • Descriptor and Fingerprint generation for machine learning
  • Molecular database cartridge for PostgreSQL supporting substructure and similarity searches as well as many descriptor calculators
  • Cheminformatics nodes for KNIME
  • Contrib folder with useful community-contributed software harnessing the power of the RDKit

Community

Code

Web presence

Materials from user group meetings

Documentation

Available on the RDKit page and in the Docs folder on GitHub

Installation

Installation instructions are available in Docs/Book/Install.md.

Binary distributions, anaconda, homebrew

  • binaries for conda python or, if you are using the conda-forge stack, the RDKit is also available from conda-forge.
  • RPMs for RedHat Enterprise Linux, Centos, and Fedora. Contributed by Gianluca Sforna.
  • debs for Ubuntu and other Debian-derived Linux distros. Contributed by the Debichem team.
  • homebrew formula for building on the Mac. Contributed by Eddie Cao.
  • recipes for building using the excellent conda package manager. Contributed by Riccardo Vianello.
  • APKs for Alpine Linux. Contributed by da Verona
  • Wheels at PyPi for all major platforms and python versions. Contributed by Christopher Kuenneth

Projects using RDKit

  • ROBERT - Automated Machine Learning Protocols
  • AQME - Automated Quantum Mechanical Environment
  • chemprop - message passing neural networks for molecular property prediction
  • RMG - Reaction Mechanism Generator
  • RDMC - Reaction Data and Molecular Conformers - package for dealing with reactions, molecules, conformers, mainly in 3D
  • pychemprojections - python library for visualizing various 2D projections of molecules.
  • pychemovality - python library for estimating the ovality of molecules.
  • ChEMBL Structure Pipeline - ChEMBL protocols used to standardise and salt strip molecules.
  • FPSim2 - Simple package for fast molecular similarity searches.
  • Datamol (docs, repo) - A Python library to intuitively manipulate molecules.
  • Scopy (docs, paper) - an integrated negative design Python library for desirable HTS/VS database design
  • stk (docs, paper) - a Python library for building, manipulating, analyzing and automatic design of molecules.
  • gpusimilarity - A Cuda/Thrust implementation of fingerprint similarity searching
  • Samson Connect - Software for adaptive modeling and simulation of nanosystems
  • mol_frame - Chemical Structure Handling for Dask and Pandas DataFrames
  • RDKit.js - The official JavaScript release of RDKit
  • DeepChem - python library for deep learning for chemistry
  • mmpdb - Matched molecular pair database generation and analysis
  • CheTo (paper)- Chemical topic modeling
  • OCEAN (paper)- Optimized cross reactivity estimation
  • ChEMBL Beaker - standalone web server wrapper for RDKit and OSRA
  • ZINC - Free database of commercially-available compounds for virtual screening
  • sdf_viewer.py - an interactive SDF viewer
  • sdf2ppt - Reads an SDFile and displays molecules as image grid in powerpoint/openoffice presentation.
  • MolGears - A cheminformatics tool for bioactive molecules
  • PYPL - Simple cartridge that lets you call Python scripts from Oracle PL/SQL.
  • shape-it-rdkit - Gaussian molecular overlap code shape-it (from silicos it) ported to RDKit backend
  • WONKA - Tool for analysis and interrogation of protein-ligand crystal structures
  • OOMMPPAA - Tool for directed synthesis and data analysis based on protein-ligand crystal structures
  • OCEAN - web-tool for target-prediction of chemical structures which uses ChEMBL as datasource
  • chemfp - very fast fingerprint searching
  • rdkit_ipynb_tools - RDKit Tools for the IPython Notebook
  • Vernalis KNIME nodes
  • Erlwood KNIME nodes
  • AZOrange

License

Code released under the BSD license.

knime-rdkit's People

Contributors

bernd-wiswedel avatar chaubold avatar gab1one avatar greglandrum avatar manuelschwarze avatar mrberthold avatar peterohl avatar ptosco avatar thorsten-meinl-knime avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

knime-rdkit's Issues

Log4j Security Vulnerability

The RDKit nodes plugin makes use of the OPSIN library, which has a dependency to log4j 2.14.1 in our current RDKit nodes version. It is kind of hidden, because we built the OPSIN library into a single JAR file that bundles all dependencies. I raised an issue in the OPSIN project yesterday, and Dan has fixed it immediately updating to log4j 2.15.1. We should get that update into the RDKit nodes ASAP for the nightly build, and should also consider releasing it to KNIME 4.3, 4.4 and 4.5. @greglandrum, I will require your code review and approval.

RDKit Molecule Extractor Node Cannot Handle Single Molecule

The RDKit Molecule Extractor Node splits a set of molecules into individual structures, but this only works if the SDF or SMILES contains two or more structures. If there's only one structure in the SDF or SMILES the RDKit Molecule Extractor generates no output structure at all resulting for instance in a blank table. This is a bug. It should in that case output the only structure that was found.

This was reported in the RDKit forum: http://tech.knime.org/forum/rdkit/rdkit-molecule-extractor-returns-empty-field-if-only-one-structure-present

BRICS/RECAP decomposition node

A node that uses either the BRICS or RECAP rules to fragment a molecule.
We'll need to add RECAP definitions to the RDKit first

An optional input port could be a table with a list of SMARTS patterns defining bonds to use for fragmentation. These could then be used to construct a call to fragmentOnBonds(), assuming that that is useable from Java.

Allow reaction sanitization

The backend supports it (need to verify that this is true in Java), so we should allow it (as an option) in the reaction nodes too.

Functional Group Filter: SMARTS pattern issues / bugs

There seems to be an issue(s) with the halogen patterns. First one I have observed some time ago. It seems CF3 is not filtered out. Example:

CC12C3C11CCC2OC3(C)CC1C(F)(F)F

(All examples are from GDB-17)

If I set F to < 3, I expect these molecules to be filtered out.

Second problem is related Bromine but probably also would affect all other halogens. In this specific setting I have halogens set to 0 and non fluorine halogens also set to 0. Yeah it's redundant (configurable subnode) but maybe affects the issue.

molecules passing this filter:

BrC1=NC2=C(NS1(=O)=O)SC(=C2)C#C

BrC1=NC=C2N3C(=CS2(=O)=O)C=COC=C13

Many more from GDB-17

So there seems to be an issue with the SMARTS pattern. I'm especially confused by the complexity of the Top halogen pattern. Couldn't that just be [F,Cl, Br,I]? Or what am I missing?

MCS node breaking up rings

I've noticed some (I think) erroneous behaviour in the MCS node. In short it will, under certain circumstances, break open rings even when told not to.

I've attached a zipped up KNIME workflow to demonstrate the problem. It shows how mining a fuzzy MCS (in terms of element type) from a set of results in a SMARTS pattern containing aromatic bonds that no longer describe a ring. This is the SMARTS I'm getting:

[#6]-[#6,#7]-[#7,#6]1-[#6]-[#6]-[#7](-[#6]-[#6]-1)-[#6]1:[#6,#7]:[#6](:[#6]:[#6]:[#6]:[#6]):[#7,#6]:[#6]:[#7]:1

The input SMILES strings are:

FC(F)(F)C1=CC=CC=C1C1=CC=C2C(=C1)N=CN=C2N1CCN(CC1)C(=O)C=C
OCC#CC(=O)N1CCN(CC1)C1=C2C=C(Cl)C(=CC2=NC=N1)C1=CC=C(F)C=C1F
ClC1=CC=CC=C1C1=C2N=C(N=CC2=CC=C1)N1CCN(CC1)C(=O)C=C
ClC1=CC=C2C(=C1)N=CN=C2N1CCN(C(C1)C#N)C(=O)C=C
C=CC(=O)N1CCN(CC1C#N)C1=C2C=CC=CC2=NC=N1
ClC1=CC=C2C(=C1)N=CN=C2N1CCN(C(C1)C#N)C(=O)C=C
C=CC(=O)N1CCN(CC1)C1=C2C=CC(=CC2=NC=N1)C1=CC=CC=C1
ClC1=C(C=C2N=CN=C(N3CCC(CC3)NC(=O)C=C)C2=C1)C1=CC=CC=C1

The node configuration is as follows:

Threshold: 1.0
Ring matches ring only: checked
Complete rings only: checked
Match valences: unchecked
Atom comparisons: "Compare Any"
Bond comparisons: "Compare Order"
Timeout: 300

The node doesn't time out, but I've seen this behaviour more often when it does.

I would think this behaviour is a bug and is definitely undesired for me, since I'm using this node to identify scaffolds. Instead of breaking bonds, I would have expected the node to produce a SMARTS pattern that just doesn't include the atoms from the broken ring.

mcs_rings.zip

fix link to RDKit book in reaction nodes

it's currently to http://rdkit.svn.sourceforge.net/viewvc/rdkit/trunk/Docs/Book/RDKit_Book.pdf
but should be to http://rdkit.org/docs/RDKit_Book.html#reaction-smarts

Trying to understand the RDKit Aromatizer node and the comparison to 0 to determine success

I'm trying to understand an existing RDKit-based KNIME workflow and having used RDKit in Python previously I can't understand the RDKit Aromatizer node. It calls .setAromaticity(temp) where temp is a RDKit molecule and then compares the result to 0 to determine success but what is happening here? SetAromaticity in the C++ API which the Java KNIME is wrapping returns void and in Python it returns None. What does the comparison to 0 do here?

Line I'm referencing:

RDKit 2D Depiction option missing for formats in preferred render settings

I can set RDKit 2D depictions as preferred renderer for smiles and sdf but not for molfile or inchi.

For molfile columns in the table view I can switch to rdkit.
For inchi columns rdkit is not an option also in table view.

Can both of the formats get rdkit as preferred renderer?
(inchi being less important but sometimes it can be good do have a structure vs a string)

Output Column Naming and Append vs replace

This is most likely very subjective but I always "struggle" with the configuration of the RDKit nodes namely the output column name of the changed molecule. The actual cases where I do NOT want to simply replace the exiting column are rare. So I end up having to check "Remove Source Column" and then Change the output column name back to the original name. This still leaves the issue that now the column order needs to be fixed afterwards. It's just cumbersome and not very neat/clean.

I much prefer the way the indigo nodes handle this with the Append column check box.

image

I simply need to uncheck the box and it's done (ideally it should be unchecked by default). It also actually replaces the column and keeps column order intact.

The suggestion hence is do it like the indigo nodes but have "append column" unchecked by default. This would in my opinion greatly increase the ease of use of RDKit nodes for common use-cases.

BRICS molecule builder node

This would really be a fragment-based molecule builder node, but the first rule set would be the BRICS rules.

The required C++ functionality for this may not yet be present, I'm going to need to look into that.

Provide better error messages when RDKit nodes fail completely or by row

Steve Roughley, Vernalis, has suggested the following interesting idea as it was successfully implemented already in the Vernalis nodes:

In the spirit of OSS/sharing etc I thought that this latest commit to the Vernalis code might be of interest:

vernalis/vernalis-knime-nodes@ebc1d9f

In particular this file - https://github.com/vernalis/vernalis-knime-nodes/blob/master/com.vernalis.knime.chem.core/src/com/vernalis/knime/chem/rdkit/RDKitRuntimeExceptionHandler.java

Basically this provides a mechanism for handling RDKit exceptions for either the new (4.1) or older versions of the RDKit Types plugin. Typical usage:

   try{
          //Do something throwing e.g. GenericRDKitException or MolSanitizeException…

   } catch (MolSanitizeException e) {
          // Just re-throw, but message is available via standard #getMessage() call
          throw new RDKitRuntimeExceptionHandler(e);
   } catch (GenericRDKitException e) {
          // Just log the message…
          getLogger().warn(new RDKitRuntimeExceptionHandler(e).getMessage());
   }

Hope that’s of use…
Steve

I think this is a very good idea, but would require some amount of refactoring as we have to go through all RDKit nodes code and rework the error handling and automated tests - that would be a looong journey.

Steve had some other idea: Maybe we can make the #getMessage() work in the wrappers in RDKit – that would solve the problem for once and all. Then we can simply rely on having a meaningful message in all exceptions coming from RDKit.

If I understand it correctly, that would mean to improve the SWIG mechanism that creates the wrapper classes... - It will be more on the RDKit side, not so much on the RDKit Nodes side.

Pre-condition Violation Number of atom mismatch Violation occurred on line 564 in file Code\GraphMol\ROMol.cpp Failed Expression: conf->getNumAtoms() == this->getNumAtoms() RDKIT: 2017.09.1 BOOST: 1_56

Hi,

When using Python Script node in Knime after the RDKit R-group Decomposition node, I am getting the following error message.

ERROR Python Script 0:54 Execute failed: Pre-condition Violation
Number of atom mismatch
Violation occurred on line 564 in file Code\GraphMol\ROMol.cpp
Failed Expression: conf->getNumAtoms() == this->getNumAtoms()
RDKIT: 2017.09.1
BOOST: 1_56

The python script only contains the following lines of code

_from rdkit import Chem
from rdkit.Chem import AllChem
import pandas as pd

mols = input_table_1['R1']
output_table_1 = input_table_1.copy()_

where R1 is generated by the RDkit R-group decomposition node.

If I feed the structures prior to R-group decomposition directly to the python script then it works just fine indicating that the Rgroup decomposition node output is likely the source of this error.

RDKIt G-Group decomp node configuration is as follows:

image

image

Knime: 4.5.0
Python: 2.7.18 installed as part of conda environment
OS: Windows 10


If you have a question about the RDKit or aren't sure if what you're seeing is the right behavior please use the Discussions tab above (https://github.com/rdkit/rdkit/discussions) instead of posting it here.

Please help us help you: We normally need the information below in order to answer questions or understand bug reports. If you do not provide the information requested we may not be able to help you and will probably close the issue.

PLEASE DELETE THIS SECTION AFTER YOU HAVE READ IT


Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior, feel free to paste a bit of Python in here.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Configuration (please complete the following information):

  • RDKit version:
  • OS: [e.g. Ubuntu 20.04]
  • Python version (if relevant):
  • Are you using conda?
  • If you are using conda, which channel did you install the rdkit from?
  • If you are not using conda: how did you install the RDKit?

Additional context
Add any other context about the problem here.

Reaction nodes: support "chain reactions"

This would involve adding two new options:

  • bool allowChainReactions=false
  • unsigned maxChainReactionProducts=10

This would only work for reactions that have a single product.

If allowChainReactions is true, and the product of the reaction is a possible reactant (as either reactant 1 or reactant 2 for the two-component reaction), then the reaction is repeated with the product as a reactant. This repeats recursively until the product is no longer a possible reactant or maxChainReactionProducts is reached.

Add Conformers Node: multi-threading

A great enhancement would be to make this node multi-threaded. Currently it only uses 1 thread. With a 8-core CPU you could hence be near than 8-times faster.

NumHeavyAtoms in Descriptor Calculation node wrong - Edge Case?

There seems to be an edge-case in the calculation of number of heavy atoms when using attachment points [*] in SMILES and hydrogens in brackets. Below "table" contains the smiles, the NumHeavyAtoms and MQN12 which also is heavy atom count also from Descriptor Calculation node. Here they differ and MQN12 is right while NumHeavyAtoms is wrong:

[H] | 1 | 0
[H][*] | 2 | 0
[H]C#CC[*] | 4 | 3
C[*] | 2 | 1
CC[*] | 3 | 2
OCC[*] | 4 | 3

So it seems anything in brackets is counted as heavy atom even if it is a hydrogen or attachment point / any atom (the attachment point as [*] is copied like this from ChemDraw). At the same time MQN12 gets it right in all cases.

Not sure this is the right place to post as it probably is an issue within RDKit itself and not the KNIME node?

RDKit Rooted Fingerprints Have Wrong Results in KNIME

We have discovered that rooted fingerprints in KNIME have are different from Python and pure Java. The reason we found is that there is a bug how the atom list for the rooting feature is created before it passed to the RDKit native code.

Paolo Tosco's Java code created it like this, as example with atom index 17:

UInt_Vect atomList = new UInt_Vect(1);
atomList.set(0, atom);

This is correct. It results in [ 17 ]

I had created it many years ago like this:

atomList = new UInt_Vect(1);
atomList.add(atom);

This is incorrect. It results in [ 0, 17 ], where "0" could also be something that is not well-defined.

In addition to this code, other buggy code also related to the same feature was discovered in InputDataObject class, which processes again such UInt_Vect objects and has the same bug.

Consequences of this bug:

InputDataInfo.getRDKitIntegerVector(DataRow row) – Always was double the size that it should be, and the first half was filled with 0 (or undefined).

RDKit Highlighting Node – Should have been leading to always coloring of atom at index 0, but for some reason this did not happen, hence no problem that a user would actually see
RDKit Highlighting Atoms Node (deprecated) – Influence on SVG creation, but no problem that a user would actually see
InputDataInfo.getRDKitUIntegerVector (DataRow row) – Always was double the size that it should be, and the first half was filled with 0 (or undefined).

RDKit Fingerprint Nodes – Rooted fingerprints (Morgan, FeatMorgan, AtomPair, Torsion, RDKit, Layered, always included the atom indexes 0 (or undefined)
AbstractFingerprintNodeModel.$AbstractRDKitCellFactory.process(…) – When a single atom was selected for rooting, it was always using atom at index 0 and the selected one, so it used 2 atoms instead of a single one, every fingerprint calculated with rooted atoms has in the moment wrong results: There are more bits set in the fingerprint than it would have without the additional badly added atom index.

Tracked at Novartis as KNIME-1794

Knime RDKit nodes "RDKit Canon Smiles" and "RDKit to InChI" are crashing Knime 4.4.4

Discussed in rdkit/rdkit#5227

Originally posted by ddsneo4j April 22, 2022
Hi,

unfortunately, the RDKit nodes "RDKit Canon Smiles" and "RDKit to InChI" are crashing Knime 4.4.4 - see attached Knime workflow and input structure. Could this bug please be fixed, i.e. Knime should not be crashed by these nodes because of a structure where canonical smiles and InChI keys cannot be created for?

Thanks for your effort in advance.

Best regards,
Dan
RdKit.zip

New Node: Generating an SVG column as rendering

This node would function like the highlighting node, but without the highlighting. The new node should be streamable.

Optionally, we could provide PNG rendering as well from it (need to figure out how to do this) based on a given size.

What fingerprint settings does the RDKit Molecule Substructure Filter use?

I'm trying to figure out what fingerprint settings the RDKit Molecule Substructure Filter node uses when it pre-calculates fingerprints. I looked at the documentation of the node on KNIME Hub but there's nothing that specifies the fingerprint settings. Does it just use RDKit defaults?, i.e.

The default set of parameters used by the fingerprinter is: - minimum path size: 1 bond - maximum path size: 7 bonds - fingerprint size: 2048 bits - number of bits set per hash: 2 - minimum fingerprint size: 64 bits - target on-bit density 0.0

SMILES Canonicalization With Very Long SMILES Values Crashed KNIME

A colleague encountered that KNIME crashes based on an RDKit crash that is caused by very long SMILES values when passed into the RDKit Canonicalize SMILES node. The SMILES in which this occurred link together Cyclooctine. The correct behavior would be to generate an error for such a SMILES if atom count is too large, but it should of course not crash.

Rendering defaults and render options

I like having rdkit as rendering option in preferred renders. However the render settings used by default and I can't seem to change them are pretty bad in my opinion. There is to much of a margin in vertical space making structures way too small. Example:

image

The structure could be rendered a lot bigger and get rid of the huge amount of white space. This should be solved in general by changing the defaults.
(Indigo has the opposite problem of too small margins)

On top of this exposing as many rendering options as possible would also be great especially such regarding the sizing of the structures, bond thickness and label size. Right now none of these can be controlled like they can with Marvin and to lesser degree with indigo.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.