Permutation feature importance

The ability to determine the primary features that drive model predictions can facilitate in understanding the inner mechanisms of learning algorithms and encourage adaptation of AI-based tools by the end-users and healthcare practitioners. Certain algorithms inherently carry information regarding the significance of the individual features for a given ML problem. However, most of the existing algorithms, including neural networks, do not provide such information as is. We developed a feature shuffling heuristic that determines feature importance in structured data. First, a model is trained and a baseline prediction score is recorded. The feature importance is determined by a series of predictions, iteratively executed using the trained model. For each iteration, a single feature is selected and all the sample values within the feature are shuffled, while other features remain unchanged. The importance of the shuffled feature is evaluated by the change in prediction score as compared to the baseline. Important features exhibit degradation in prediction score, while redundant or irrelevant features result in almost no decrease or even an improvement in prediction score. The output of this heuristic is a ranked list of features with the corresponding decrease in prediction score. This approach is model-agnostic (i.e., does not depend on the choice of ML model). Thus, it can be applied to a variety of existing algorithms that do not inherently prioritize features, and yet without compromising the interpretability and performance of the model. We demonstrate the algorithm by training a fully-connected neural network on TCGA (The Cancer Genome Atlas) dataset for cancer type classification. TCGA contains RNA-Seq expression for 33 tissue cancer types with over 11,000 samples. Using TCGA we created a balanced dataset with 10 classes (400 samples per class): 9 most prevalent cancer types from TCGA, and one class that contains a mix of the remaining cancer types (labeled as “others”). Below is a heatmap of the topmost predictive genes across all cancer types. Darker colors indicate high importance. A number of highlights in this map are supported by the literature: e.g., SOX2 in LGG, GATA3 in BRCA, and PAX8 and THCA).

adpartin / feature-importance Goto Github PK

feature-importance's Introduction

Permutation feature importance

feature-importance's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent