Code Monkey home page Code Monkey logo

molucn's Introduction

Introduction

Supporting code and analyses for Explaining compound activity predictions with a substructure-aware loss for graph neural networks

Getting Started

Installation process

We recommend the conda Python package manager, and while an GPU is technically not required to run the models and feature attribution methods reported here, it is heavily encouraged. Furthermore, the code has only been tested under Linux. Make a new environment with the provided environment.yml file:

conda env create --file=environment.yaml
conda activate molucn

Prerequisites and data structure

To reproduce the results presented in the paper, download the following compressed data tarball from here (~26GB, when uncompressed):

wget -O data.tar.gz https://figshare.com/ndownloader/files/37624043
tar -xf data.tar.gz

As the original benchmark study, the data provided here is composed of subdirectories organized by PDB identifier, contaning activity data for each target considered in the benchmark. Subfolders have the following structure:

(molucn):~/molucn/data/1D3G-BRE$ tree
.
├── 1D3G-BRE_heterodata_list.pt
├── 1D3G-BRE_seed_1337_info.txt
├── 1D3G-BRE_seed_1337_stats.csv
├── 1D3G-BRE_seed_1337_test.pt
└── 1D3G-BRE_seed_1337_train.pt

An explanation for each file in the subdirectories is provided below:

  • 1D3G-BRE_heterodata_list.pt: dataset with all pairs of ligands saved as torch_geometric.data.HeteroData objects with information containing the SMILES, grount-truth colorings, ligand activities, and molecule structures in torch_geometric.data.Data objects and the MCS boolean lists at different thresholds.
  • 1D3G-BRE_seed_1337_info.txt: text file containing information on the congeneric series: number of different ligands/compounds, number of pairs, number of training and testing pairs after 1. splitting the compounds in training and testing sets, 2. keeping pairs with no overlap, 3. rebalancing the training and testing pairs to have a 80%/20% ratio.
  • 1D3G-BRE_seed_1337_stats.csv: summarizes the previous .txt file into a .csv file to facilitate information extraction.
  • 1D3G-BRE_seed_1337_test.pt and 1D3G-BRE_seed_1337_train.pt: contains the test and train pairs, respectively, saved as torch_geometric.data.HeteroData objects obtained after the preprocessing and rebalancing pipelines.

All the .pt files can be read with the Python dill module.

Build and Test

Given a specific target protein and a feature attribution, the main.py file trains a GNN model and generates node colorings using the explainability method selected for all 3 losses proposed in the study: MSE, MSE+AC and MSE+UCN.

The trained GNN models and their logs (metrics) will be saved under models/ and logs/ subdirectories in in the root directory of the repo. The atom coloring produced by the feature attribution methods are saved in colors/. Metrics measuring the performance of the different feature attribution techniques will be saved under results/.

Example: Test on the data from 1 protein target

To train the GNN model and run a feature attribution for one target protein (e.g., 1D3G-BRE) run:

python molucn/main.py --target "1D3G-BRE" --explainer {"diff" | "gradinput" | "ig" | "cam" | "gradcam"}

For the random forest and masking baseline:

python molucn/main_rf.py --target "1D3G-BRE"

Test on all 350 protein targets

To reproduce the results for the 350 protein targets:

  • GNN-based methods :
bash main.sh {diff|gradinput|ig|cam|gradcam}
  • RF masking:
bash main_rf.sh

Citation

If you find this work or parts thereof useful, please consider citing:

@article{amara2022substructure,
  title={A substructure-aware loss for feature attribution in drug discovery},
  author={Amara, Kenza and Rodriguez-Perez, Raquel and Luna, Jos{\'e} Jim{\'e}nez},
  year={2022}
}

molucn's People

Contributors

josejimenezluna avatar k-amara avatar microsoft-github-operations[bot] avatar microsoftopensource avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

jcheminform

molucn's Issues

Write tests

Currently there are no tests for the code. Ideally, we should have both units and integration tests run via pytest on github workflows.

These should live under a tests/ folder under the root directory of the project.

Organize analysis code

Currently everything under molucn/analyses consists of pytorch scripts that seem to have been exported from jupyter notebooks. These need to be properly organized/cleaned and documented for publication release.

Use a `utils.py` file to define paths relative to the root

data_ori_path = "data/selected_processed_data"

Avoid hardcoding paths like you do here. A somewhat simple strategy is to have a utils.py or a paths.py file in the root folder of the project where you can define things like:

import os

ROOT_PATH = os.path.dirname(os.path.realpath(__file__))
DATA_PATH = os.path.join(ROOT_PATH, "data")
RESULT_PATH = os.path.join(ROOT_PATH, "results")
...

Then import these variables across the project to load/save data as needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.