Code Monkey home page Code Monkey logo

onconetexplainer's Introduction

OncoNetExplainer: Explainable Predictions of Cancer Types Based on Gene Expression Data

Code and supplementary materials for our paper titled "Explainable Prediction of Cancer Types Based on Gene Expression Data", in proc. of The 19th annual IEEE International Conference on Bioinformatics and Bioengineering(BIBE 2019) to be held in Athens, Greece.

Methods

In this paper, we collect genomics data about 9,074 cancer patients covering 33 different cancer types from The Cancer Genome Atlas(TCGA) and train a CNN and VGG16 networks using gradient-guided class activation map (Grad-CAM and Grad-CAM++) and Layer-wise relevance propagation (LRP). The following figure shows the workflow of the approach we followed:

Then we identify most significant biomarkers and rank top genes across different cancer types based on mean absolute impact. Both models show high confidence at predicting different cancer types correctly at least 94% of the cases.

To provide comparison with baselines, we further identify top genes for each cancer type and cancer specific driver genes using gradient boosted trees and SHapley Additive exPlanations(SHAP), which are further validated with the annotation from the TumorPortal.

Data collections

Gene expression about 33 different tumor types have been downloaded from The Cancer Genome Atlas(TCGA) portal covering 9,074 samples. See here for more details about the data.

Data availability

The preprocessed data can be downloaded from here with the password of '123' (without quote).

A quick instructions on using GradCAM and ranking important biomarkers

A quick example on a small dataset can be performed as follows:

  • $ cd GradCAM_FI
  • $ python3 load_data.py (make sure that the data in CSV format in the 'data' folder)
  • $ python3 model.py
  • $ python3 grad_cam.py

Examples of explanation using LRP

An example predictions with decision visualization with all the methods, in particular for LRP can be found in a notebook.

Examples of explanation using CNN and SHAP

CNN and VGG16 networks using guided-gradient class activation map~(GradCAM). Further, we generated heat-maps for all the classes based on GradCAM to identify the most significant biomarkers and compute the feature importance in terms of mean absolute impact~(MAI) to rank top genes across cancer types.

The following figures shows the generated heat-map examples for selected cancer types. Each column represents the result from one fold. Rows represent the heat-maps of BRCA, KIRC, COAD, LUAD, and PRAD cancer types (from top-down):

We found that KRTAP1-1, INPP5K, GAS8, MC1R, POLR2A, BET1P1, NAT2, PSD3, KAT6A, and INTS10 genes are common across cancer types giving the highest feature importance of 0.6 by protein-coding gene INTS10. The following 25 are top genes/biomarkers across 5 cancer types(sorted based on mean absolute impact(MAI) value:

The following figures shows common driver genes across 33 cancer types:

Nevertheless, a Python notebook will be added soon to show the steps more transparently.

Examples of explanation using SHAP and Gradient Boosted Trees

Refer the Python notebook to get an idea how to use SHAP and gradient boosted trees to generate feature importance and explanation about the prediction.

First, we process the data (see the Python notebook). Then we train a GBT algorithm. Then SHAP explainer is used to provide explanation. The following figure shows clinical features contribution pushing the prediction higher in red and pushing the prediction lower are in blue:

The following figure shows clinical features contribution in which features are ordered on the y-axis in a descending order according to their MAI~(each dot represents SHAP value for a specific feature):

As seen, features pushing the prediction higher(i.e. NACA2, LOC442454, and C19orf6 are most significant features) are shown in red~(i.e. how much the probability for which the target is is increased), those pushing the prediction lower are in blue~(ASTN2 and PCDHGC3 are least significant).

Citation request

If you use the code of this repository in your research, please consider citing the folowing papers:

@inproceedings{karim2019XAI,
    title={OncoNetExplainer: Explainable Prediction of Cancer Types Based on Gene Expression Data},
    author={Karim, Md Rezaul and Cochez, Michael and Beyan, Oya and Decker, Stefan and Lange, Christoph},
    booktitle={The 19th annual IEEE International Conference on Bioinformatics and Bioengineering (BIBE 2019)},
    year={2019}
}

Contributing

For any questions, feel free to open an issue or contact at [email protected]

onconetexplainer's People

Contributors

rezacsedu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

onconetexplainer's Issues

Data download link failed

Hello
Please provide the data set of broken links so that you can reproduce your work

Thank you very much

Data Set Link Broken

The dataset linked is not working. Could you give more information about the dataset used? Or documentation on preprocessing and where the data is sourced?

Can't Reproduce the Same Accuracy (GradCAM_F1/model.py)

Dear Sir,

Thanks for this wonderful repository! I have been trying to reproduce the results for the GradCAM method. Like as you mentioned in the repo, we should run- python3 load_data.py (for preprocess), python3 model.py (train and valid accuracy). But, firstly and unfortunately I got quite low accuracies when the model is trained for 50 epochs:
Epoch 46/50
20/20 [==============================] - 27s 1s/step - loss: 2.2263 - acc: 0.2986 - val_loss: 2.2967 - val_acc: 0.3482

Please let us know the reasons, I would like to work on this method.

Secondly, I am unable to find file: "data/pancan_genes.csv" used in load_data.py.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.