Code Monkey home page Code Monkey logo

uclcheminformatics / scaffoldgraph Goto Github PK

View Code? Open in Web Editor NEW
158.0 9.0 38.0 800 KB

ScaffoldGraph is an open-source cheminformatics library, built using RDKit and NetworkX, for the generation and analysis of scaffold networks and scaffold trees.

Home Page: https://doi.org/10.1093/bioinformatics/btaa219

License: MIT License

Shell 0.19% Python 99.81%
rdkit networkx scaffold-networks scaffold-trees cheminformatics murcko fragments molecules scaffolds hiers

scaffoldgraph's Introduction

Conda Anaconda Release Build Status Contributing License: MIT DOI

⌬ ScaffoldGraph ⌬

ScaffoldGraph is an open-source cheminformatics library, built using RDKit and NetworkX, for the generation and analysis of scaffold networks and scaffold trees.

Features | Installation | Quick-start | Examples | Contributing | References | Citation

Features

  • Scaffold Network generation (Varin, 2011)
    • Explore scaffold-space through the iterative removal of available rings, generating all possible sub-scaffolds for a set of input molecules. The output is a directed acyclic graph of molecular scaffolds
  • HierS Network Generation (Wilkens, 2005)
    • Explore scaffold-space through the iterative removal of available rings, generating all possible sub-scaffolds without dissecting fused ring-systems
  • Scaffold Tree generation (Schuffenhauer, 2007)
    • Explore scaffold-space through the iterative removal of the least-characteristic ring from a molecular scaffold. The output is a tree of molecular scaffolds
  • Murcko Fragment generation (Bemis, 1996)
    • Generate a set of murcko fragments for a molecule through the iterative removal of available rings.
  • Compound Set Enrichment (Varin, 2010, 2011)
    • Identify active chemical series from primary screening data

Comparison to existing software

  • Scaffold Network Generator (SNG) (Matlock 2013)
  • Scaffold Hunter (SH) (Wetzel, 2009)
  • Scaffold Tree Generator (STG) (SH CLI predecessor)
SG SNG SH STG
Computes Scaffold Networks X X - -
Computes HierS Networks X - - -
Computes Scaffold Trees X X X X
Command Line Interface X X - X
Graphical Interface - * - X -
Accessible Library X - - -
Results can be computed in parallel X X - -
Benchmark for 150,000 molecules ** 15m 25s 27m 6s - -
Limit on input molecules N/A *** 10,000,000 200,000 **** 10,000,000

* While ScaffoldGraph has no explicit GUI, it contains functions for interactive scaffoldgraph visualization.

** Tests performed on an Intel Core i7-6700 @ 3.4 GHz with 32GB of RAM, without parallel processing. I could not find the code for STG and do not intend to search for it, SNG report that both itself and SH are both faster in the benchmark test.

*** Limited by available memory

**** Graphical interface has an upper limit of 2,000 scaffolds


Installation

  • ScaffoldGraph currently supports Python 3.6 and above.

Install with conda (recommended)

conda config --add channels conda-forge
conda install -c uclcheminformatics scaffoldgraph

Install with pip

# Basic installation.
pip install scaffoldgraph

# Install with ipycytoscape.
pip install scaffoldgraph[vis]

# Install with rdkit-pypi (Linux, MacOS).
pip install scaffoldgraph[rdkit]

# Install with all optional packages. 
pip install scaffoldgraph[rdkit, vis]

Warning: rdkit cannot be installed with pip, so must be installed through other means

Update (17/06/21): rdkit can now be installed through the rdkit-pypi wheels for Linux and MacOS, and can be installed alongside ScaffoldGraph optionally (see above instructions).

Update (16/11/21): Jupyter lab users may also need to follow the extra installation instructions here / here when using the ipycytoscape visualisation utility.


Quick Start

CLI usage

The ScaffoldGraph CLI is almost analogous to SNG consisting of a two step process (Generate --> Aggregate).

ScaffoldGraph can be invoked from the command-line using the following command:

$ scaffoldgraph <command> <input-file> <options>

Where "command" is one of: tree, network, hiers, aggregate or select.

  • Generating Scaffold Networks/Trees

    The first step of the process is to generate an intermediate scaffold graph. The generation commands are: network, hiers and tree

    For example, if a user would like to generate a network from two files:

    $ ls
    file_1.sdf  file_2.sdf

    They would first use the commands:

    $ scaffoldgraph network file_1.sdf file_1.tmp
    $ scaffoldgraph network file_2.sdf file_2.tmp

    Further options:

    --max-rings, -m : ignore molecules with # rings > N (default: 10)
    --flatten-isotopes -i : remove specific isotopes
    --keep-largest-fragment -f : only process the largest disconnected fragment
    --discharge-and-deradicalize -d : remove charges and radicals from scaffolds 
    
  • Aggregating Scaffold Graphs

    The second step of the process is aggregating the temporary files into a combined graph representation.

    $ scaffoldgraph aggregate file_1.tmp file_2.tmp file.tsv

    The final network is now available in 'file.tsv'. Output formats are explained below.

    Further options:

    --map-mols, -m  <file>   : generate a file mapping molecule IDs to scaffold IDs 
    --map-annotations <file> : generate a file mapping scaffold IDs to annotations
    --sdf                    : write the output as an SDF file
    
  • Selecting Subsets

    ScaffoldGraph allows a user to select a subset of a scaffold network or tree using a molecule-based query, i.e. selecting only scaffolds for molecules of interest.

    This command can only be performed on an aggregated graph (Not SDF).

    $ scaffoldgraph select <graph input-file> <input molecules> <output-file> <options>

    Options:

    <graph input-file>   : A TSV graph constructed using the aggregate command
    <input molecules>    : Input query file (SDF, SMILES)
    <output-file>        : Write results to specified file
    --sdf                : Write the output as an SDF file
    
  • Input Formats

    ScaffoldGraphs CLI utility supports input files in the SMILES and SDF formats. Other file formats can be converted using OpenBabel.

    • Smiles Format:

    ScaffoldGraph expects a delimited file where the first column defines a SMILES string, followed by a molecule identifier. If an identifier is not specified the program will use a hash of the molecule as an identifier.

    Example SMILES file:

    CCN1CCc2c(C1)sc(NC(=O)Nc3ccc(Cl)cc3)c2C#N   CHEMBL4116520
    CC(N1CC(C1)Oc2ccc(Cl)cc2)C3=Nc4c(cnn4C5CCOCC5)C(=O)N3   CHEMBL3990718
    CN(C\C=C\c1ccc(cc1)C(F)(F)F)Cc2coc3ccccc23  CHEMBL4116665
    N=C1N(C(=Nc2ccccc12)c3ccccc3)c4ccc5OCOc5c4  CHEMBL4116261
    ...
    
    • SDF Format:

    ScaffoldGraph expects an SDF file, where the molecule identifier is specified in the title line. If the title line is blank, then a hash of the molecule will be used as an identifier.

    Note: selecting subsets of a graph will not be possible if a name is not supplied

  • Output Formats

    • TSV Format (default)

    The generate commands (network, hiers, tree) produce an intermediate tsv containing 4 columns:

    1. Number of rings (hierarchy)
    2. Scaffold SMILES
    3. Sub-scaffold SMILES
    4. Molecule ID(s) (top-level scaffolds (Murcko))

    The aggregate command produces a tsv containing 4 columns

    1. Scaffold ID
    2. Number of rings (hierarchy)
    3. Scaffold SMILES
    4. Sub-scaffold IDs
    • SDF Format

    An SDF file can be produced by the aggregate and select commands. This SDF is formatted according to the SDF specification with added property fields:

    1. TITLE field = scaffold ID
    2. SUBSCAFFOLDS field = list of sub-scaffold IDs
    3. HIERARCHY field = number of rings
    4. SMILES field = scaffold canonical SMILES

Library usage

ScaffoldGraph makes it simple to construct a graph using the library API. The resultant graphs follow the same API as a NetworkX DiGraph.

Some example notebooks can be found in the 'examples' directory.

import scaffoldgraph as sg

# construct a scaffold network from an SDF file
network = sg.ScaffoldNetwork.from_sdf('my_sdf_file.sdf')

# construct a scaffold tree from a SMILES file
tree = sg.ScaffoldTree.from_smiles('my_smiles_file.smi')

# construct a scaffold tree from a pandas dataframe
import pandas as pd
df = pd.read_csv('activity_data.csv')
network = sg.ScaffoldTree.from_dataframe(
    df, smiles_column='Smiles', name_column='MolID',
    data_columns=['pIC50', 'MolWt'], progress=True,
)

Advanced Usage

  • Multi-processing

    It is simple to construct a graph from multiple input source in parallel, using the concurrent.futures module and the sg.utils.aggregate function.

    from concurrent.futures import ProcessPoolExecutor
    from functools import partial
    import scaffoldgraph as sg
    import os
        
    directory = './data'
    sdf_files = [f for f in os.listdir(directory) if f.endswith('.sdf')]
        
    func = partial(sg.ScaffoldNetwork.from_sdf, ring_cutoff=10)
          
    graphs = []
    with ProcessPoolExecutor(max_workers=4) as executor:
        futures = executor.map(func, sdf_files)
        for future in futures:
            graphs.append(future)
          
    network = sg.utils.aggregate(graphs)
  • Creating custom scaffold prioritisation rules

    If required a user can define their own rules for prioritizing scaffolds during scaffold tree construction. Rules can be defined by subclassing one of four rule classes:

    BaseScaffoldFilterRule, ScaffoldFilterRule, ScaffoldMinFilterRule or ScaffoldMaxFilterRule

    When subclassing a name property must be defined and either a condition, get_property or filter function. Examples are shown below:

    import scaffoldgraph as sg
    from scaffoldgraph.prioritization import *
      
    """
    Scaffold filter rule (must implement name and condition)
    The filter will retain all scaffolds which return a True condition
    """
    
    class CustomRule01(ScaffoldFilterRule):
        """Do not remove rings with >= 12 atoms if there are smaller rings to remove"""
    
        def condition(self, child, parent):
            removed_ring = child.rings[parent.removed_ring_idx]
            return removed_ring.size < 12
              
        @property
        def name(self):
            return 'custom rule 01'
            
    """
    Scaffold min/max filter rule (must implement name and get_property)
    The filter will retain all scaffolds with the min/max property value
    """
      
    class CustomRule02(ScaffoldMinFilterRule):
        """Smaller rings are removed first"""
      
        def get_property(self, child, parent):
            return child.rings[parent.removed_ring_idx].size
              
        @property
        def name(self):
            return 'custom rule 02'
          
        
    """
    Scaffold base filter rule (must implement name and filter)
    The filter method must return a list of filtered parent scaffolds
    This rule is used when a more complex rule is required, this example
    defines a tiebreaker rule. Only one scaffold must be left at the end
    of all filter rules in a rule set
    """
      
    class CustomRule03(BaseScaffoldFilterRule):
        """Tie-breaker rule (alphabetical)"""
      
        def filter(self, child, parents):
            return [sorted(parents, key=lambda p: p.smiles)[0]]
      
        @property
        def name(self):
            return 'custom rule 03'  

    Custom rules can subsequently be added to a rule set and supplied to the scaffold tree constructor:

    ruleset = ScaffoldRuleSet(name='custom rules')
    ruleset.add_rule(CustomRule01())
    ruleset.add_rule(CustomRule02())
    ruleset.add_rule(CustomRule03())
     
    graph = sg.ScaffoldTree.from_sdf('my_sdf_file.sdf', prioritization_rules=ruleset)

Contributing

Contributions to ScaffoldGraph will most likely fall into the following categories:

  1. Implementing a new Feature:
    • New Features that fit into the scope of this package will be accepted. If you are unsure about the idea/design/implementation, feel free to post an issue.
  2. Fixing a Bug:
    • Bug fixes are welcomed, please send a Pull Request each time a bug is encountered. When sending a Pull Request please provide a clear description of the encountered bug. If unsure feel free to post an issue

Please send Pull Requests to: http://github.com/UCLCheminformatics/ScaffoldGraph

Testing

ScaffoldGraphs testing is located under test/. Run all tests using:

$ python setup.py test

or run an individual test: pytest --no-cov tests/core

When contributing new features please include appropriate test files

Continuous Integration

ScaffoldGraph uses Travis CI for continuous integration


References

  • Bemis, G. W. and Murcko, M. A. (1996). The properties of known drugs. 1. molecular frameworks. Journal of Medicinal Chemistry, 39(15), 2887–2893.
  • Matlock, M., Zaretzki, J., Swamidass, J. S. (2013). Scaffold network generator: a tool for mining molecular structures. Bioinformatics, 29(20), 2655-2656
  • Schuffenhauer, A., Ertl, P., Roggo, S., Wetzel, S., Koch, M. A., and Waldmann, H. (2007). The scaffold tree visualization of the scaffold universe by hierarchical scaffold classification. Journal of Chemical Information and Modeling, 47(1), 47–58. PMID: 17238248.
  • Varin, T., Schuffenhauer, A., Ertl, P., and Renner, S. (2011). Mining for bioactive scaffolds with scaffold networks: Improved compound set enrichment from primary screening data. Journal of Chemical Information and Modeling, 51(7), 1528–1538.
  • Varin, T., Gubler, H., Parker, C., Zhang, J., Raman, P., Ertl, P. and Schuffenhauer, A. (2010) Compound Set Enrichment: A Novel Approach to Analysis of Primary HTS Data. Journal of Chemical Information and Modeling, 50(12), 2067-2078.
  • Wetzel, S., Klein, K., Renner, S., Rennerauh, D., Oprea, T. I., Mutzel, P., and Waldmann, H. (2009). Interactive exploration of chemical space with scaffold hunter. Nat Chem Biol, 1875(8), 581–583.
  • Wilkens, J., Janes, J. and Su, A. (2005). HierS:  Hierarchical Scaffold Clustering Using Topological Chemical Graphs. Journal of Medicinal Chemistry, 48(9), 3182-3193.

Citation

If you use this software in your own work please cite our paper, and the respective papers of the methods used.

@article{10.1093/bioinformatics/btaa219,
    author = {Scott, Oliver B and Chan, A W Edith},
    title = "{ScaffoldGraph: an open-source library for the generation and analysis of molecular scaffold networks and scaffold trees}",
    journal = {Bioinformatics},
    year = {2020},
    month = {03},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btaa219},
    url = {https://doi.org/10.1093/bioinformatics/btaa219},
    note = {btaa219}
    eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaa219/32984904/btaa219.pdf},
}

scaffoldgraph's People

Contributors

oliverbscott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scaffoldgraph's Issues

[Q] Aromatic bonds removal

❓ Questions & Help

Hi!
Thank you for this really useful and great module! But there is a problem.
When generating scaffold tree, aromatic bonds are removed. Therefore, it is rather complicated to work with the structures returned. Is there a way how to solve this problem?

Potential bugs in dealing with bridged rings

When dealing with bridged rings using ScaffoldNetwork, the output is different from a similar structure with an aliphatic ring. Some nodes are just missing from the network.

A question about visualizing scaffold networks[Q]

❓ Questions & Help

When I using jupyter lab run relation codes to create and visualize a scaffold network, it could not draw the picture like follows:
Snipaste_2021-11-16_15-32-28
Could give me some advices to solve this problem? Thank you!

[Q] SMILES from frags

How can i get smiles from frags in basic function:

scaffold = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(mol)
frags = sg.get_next_murcko_fragments(scaffold)
Draw.MolsToGridImage(frags)

[Q] hierarchy representations

❓ Questions & Help

Hi,

Thanks for this so useful library...I have a doubt, when you are building a hierarchy representation, each bar represent the times of one scaffold is repeated in a dataset??? For example in your example: https://github.com/UCLCheminformatics/ScaffoldGraph/blob/master/examples/scaffold_graphs.ipynb for a network representation. How I could the molecule for earch hierarchy class? I want Know the scaffolds most repeated in the dataset.

Thanks!

Pablo

about the visualization utils

❓ Questions & Help

Hello~
When I have installed the ipycytoscape module today and updated scaffoldgraph, it seems I still have problem in importing cytoscape from the vis uilt. The error proved to be ' no module named scaffoldgraph.vis.'. I have checked packages in my envs and found there seems truly no vis module. Is there anything wrong with my implementation?
(Maybe I'll choose to use ipycytoscape directly.

[Q]Can not read SMILES file

this command "scaffoldgraph network biomol.smi biomol.tmp " give an error as shown below.

scaffold-graph: 15:21:29 19315 INFO: Generating Scaffold Network...
scaffold-graph: 15:21:29 19315 CRITICAL: Unknown error: Python argument types in
SmilesMolSupplier.init(SmilesMolSupplier, str, int, int, int, bool, bool)
did not match C++ signature:
init(_object*)
init(_object*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > data, std::__cxx11::basic_string<char, std::char_traits, std::allocator > delimiter=' ', int smilesColumn=0, int nameColumn=1, bool titleLine=True, bool sanitize=True)
scaffold-graph: 15:21:29 19315 INFO: Exiting scaffoldgraph...

But we can read "biomol.smi" using "Chem.SmilesMolSupplier".How to solve this problem?

[Q] Question from SNG author.

❓ Questions & Help

Glad to see our work taken up in a better library. Are you taking contributions tot his project?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.