Code Monkey home page Code Monkey logo

feature-importance's Introduction

Permutation feature importance

The ability to determine the primary features that drive model predictions can facilitate in understanding the inner mechanisms of learning algorithms and encourage adaptation of AI-based tools by the end-users and healthcare practitioners. Certain algorithms inherently carry information regarding the significance of the individual features for a given ML problem. However, most of the existing algorithms, including neural networks, do not provide such information as is. We developed a feature shuffling heuristic that determines feature importance in structured data. First, a model is trained and a baseline prediction score is recorded. The feature importance is determined by a series of predictions, iteratively executed using the trained model. For each iteration, a single feature is selected and all the sample values within the feature are shuffled, while other features remain unchanged. The importance of the shuffled feature is evaluated by the change in prediction score as compared to the baseline. Important features exhibit degradation in prediction score, while redundant or irrelevant features result in almost no decrease or even an improvement in prediction score. The output of this heuristic is a ranked list of features with the corresponding decrease in prediction score. This approach is model-agnostic (i.e., does not depend on the choice of ML model). Thus, it can be applied to a variety of existing algorithms that do not inherently prioritize features, and yet without compromising the interpretability and performance of the model. We demonstrate the algorithm by training a fully-connected neural network on TCGA (The Cancer Genome Atlas) dataset for cancer type classification. TCGA contains RNA-Seq expression for 33 tissue cancer types with over 11,000 samples. Using TCGA we created a balanced dataset with 10 classes (400 samples per class): 9 most prevalent cancer types from TCGA, and one class that contains a mix of the remaining cancer types (labeled as “others”). Below is a heatmap of the topmost predictive genes across all cancer types. Darker colors indicate high importance. A number of highlights in this map are supported by the literature: e.g., SOX2 in LGG, GATA3 in BRCA, and PAX8 and THCA).

feature-importance's People

Contributors

adpartin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.