Code Monkey home page Code Monkey logo

chemoinformatics_tools's Introduction

Chemoinformatics_Tools

  1. 機械学習QSARモデルの予測根拠可視化
  2. Mordred記述子算出


1 Visualization of prediction basis from finger print based machine learning model

The aim of this code is to improve interpretability of QSAR(Quantitative Structure-Activity Relationships) model which uses machine learning based on fingert print. Applying this code, you can map colors onto the chemical structure according to contribution to the prediction of substructures, it enables us to interptet substructures contributing to the prediction of pharmacological, physicochemical, and toxicological activities. Here we show a simple implementation using RandomForest which is a typical CART (Classification and Regression Tree) algorithm. Note that this interpretation algrithm can be applied to other machne learning algorithms, e.g. ,LightGBM, XGBoost and Neural Network (which is applied Permutation Importance to get feature importance), is theoretically possible. Therefore, this interpretation algrithm can be used for general purpose.

In this implementation, colors are assigned to chemical structures based on the weight calculated from the importance of features obtained from the machine learning model. Weight can be obtained by the following simple calculation.  

  • Gain feature importances respect to each bit of fingerprint
  • Averaging feature importance using the numbet og atoms that are belong to a substructure corresponding to bit.
  • Averaged feature importance was assigned as the a weight that represent contribution of its substructure.

The above process calculates the weight and assigns a color to the chemical structure based on the weight value.

chem table

Sample data

Mutagenicity data set which Hansen et al. provided in 2009 was used.

2 Mordred Calculator

ディレクトリ名

descriptors

概要

三次元化した化合物のsdfファイルから, Mordred記述子を算出しcsvファイルを出力するプログラム

機能

記述子にはMordredの記述子を使用.
2D, 3D記述子の両者を算出
※ FingerPrint等の算出にも対応したバージョンにアップグレード予定
※ RDKitにはロードに失敗する化合物が存在する. そのような化合物の記述子計算はスキップされ行から外される.
※ 入力化合物数と出力化合物数が異なる場合があるため, 出力csvファイルのindexとSMILESを参照する必要がある.

以下の環境での動作確認済です.

環境

  • Miniconda3
  • Python 3.7.4

パッケージ

・RDkit

$ conda install -c conda-forge rdkit

・Mordred

conda install -c mordred-descriptor mordred

使用方法

コマンドラインからDescriptor_ver1.0.1.pyを実行する.
引数は以下のように設定する

  • sdf_path: 参照するsdfファイルのパス
  • csv_path: 保存するcsvファイルのパス
  • 算出する記述子のタイプ ( 2D or 3D )
    ※3Dに設定した場合は2D記述子と3D記述子の両方が算出される.

実行例

$ python Descriptor_ver1.0.1.py test.sdf test.csv 3D

chemoinformatics_tools's People

Contributors

teddyglass avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.