Code Monkey home page Code Monkey logo

short-read-tax-assignment's Introduction

TAX CREdiT: TAXonomic ClassifieR Evaluation Tool

Build Status

A standardized and extensible evaluation framework for taxonomic classifiers

To view static versions of the reports , start here.

Environment

This repository contains python-3 code and Jupyter notebooks, but some taxonomy assignment methods (e.g., using QIIME-1 legacy methods) may require different python or software versions. Hence, we use conda parallel environments to support comparison of myriad methods in a single framework.

The first step is to install conda and install QIIME2 following the instructions provided here.

An example of how to load different environments to support other methods can be see in the QIIME-1 taxonomy assignment notebook.

Setup and install

The library code and IPython Notebooks are then installed as follows:

git clone https://github.com/gregcaporaso/tax-credit.git
cd tax-credit/
pip install .

Finally, download and unzip the reference databases:

wget https://unite.ut.ee/sh_files/sh_qiime_release_20.11.2016.zip
wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz
unzip sh_qiime_release_20.11.2016.zip
tar -xzf gg_13_8_otus.tar.gz

Equipment

The analyses included here can all be run in standard, modern laptop, provided you don't mind waiting a few hours on the most memory-intensive step (taxonomy classification of millions of sequences). With the exception of the q2-feature-classifier naive-bayes* classifier sweeps, which were run on a high-performance cluster, all analyses presented in tax-credit were run in a single day using a MacBook Pro with the following specifications: OS OS X 10.11.6 "El Capitan" Processor 2.3 GHz Intel Core i7 Memory 8 GB 1600 MHz DDR3

If you intend to perform extensive parameter sweeps on a classifier (e.g., several hundred or more parameter combinations), you may want to consider running these analyses using cluster resources, if available.

Using the Jupyter Notebooks included in this repository

To view and interact with Jupyter Notebook, change into the /tax-credit/ipynb directory and run Jupyter Notebooks from the terminal with the command:

jupyter notebook index.ipynb

The notebooks menu should open in your browser. From the main index, you can follow the menus to browse different analyses, or use File --> Open from the notebook toolbar to access the full file tree.

Citing

A publication is on its way! For now, if you use any of the data or code included in this repository, please cite https://github.com/caporaso-lab/tax-credit

short-read-tax-assignment's People

Contributors

benkaehler avatar ebolyen avatar gregcaporaso avatar jairideout avatar nbokulich avatar zellett avatar

Watchers

 avatar  avatar

Forkers

nbokulich

short-read-tax-assignment's Issues

implement test for classification of "novel" taxa

How do taxonomy classifiers perform when they encounter a query sequence that is not represented in the reference database? To what degree do they "overclassify"?

For this test, "novel taxa" consist of query sequences randomly drawn from a reference database (source). Taxonomy assignment is then performed using a modified reference database (ref), which consists of the source minus novel taxa AND all seqs with matching taxonomy annotations at the taxonomic level (L) being tested (species, genus, family, etc). Also remove any taxa from the ref that do not have near neighbors at level L, e.g., other species in the genus

Following this method,
Match: assignment == L - 1 (e.g., a novel species is assigned the correct genus)
overclassification: assignment == L (e.g., correct genus but assigns to a near neighbor)
misclassification: incorrect assignment at L - 1 (e.g., wrong genus-level assignment)

One question: is it worth also defining underclassification, i.e., assignment < L (e.g., correct family but no genus)? My gut feeling is NO, since this will complicate matters and we are left asking at which level L this becomes irrelevant. (e.g., if species X is assigned to the correct phylum but wrong class, is this still underclassification and is that important?) Unlike overclassification, I also question whether this would yield meaningful interpretation.

Select and evaluate several alternative scikit-learn classifiers

scikit-learn offers several different classifiers in different categories.

A reasonable overview of what's available and appropriate is here.

Would suggest that we should try at least BernoulliNB, SVCs and at least one of Decision Trees, Random Forests, Nearest Neighbors, Ridge Regression, or MLPClassifier.

Should also revisit feature selection. SelectPercentile doesn't seem to help for MultinomialNB, but it may work elsewhere and other feature selection techniques exist.

Also try TfidfTransformer for feature extraction.

All of this should be possible using the general fit-classifier method in feature-classifier, but may require some fixes to the code.

Do we need to be more careful about prior (or marginal) distributions?

I am concerned about classifier training priors. This came up in my conversation with Steve.

When we are training our classifier, either implicitly or explicitly we bias our classifier to be more likely to predict a taxon according to some prior distribution. This is natural and desirable in some machine learning contexts: if a classifier is uncertain about a prediction then it is going to do better if it guesses according to the unconditional distribution of classes. Some classifiers set the priors to be uniform, some set them according to the distribution of classes in the training sample.

In the normal machine learning context this is uncontroversial because we usually train and validate our classifier on samples that are presumably drawn from the same population.

In our case things are different. We are training our data on a reference set, then testing it on samples that in some cases have contrived and unnatural distributions of classes.

So in my mind there are two questions:

  1. Is our training prior appropriate? Does the reference sample represent some sort of global prior that we would expect for taxa, or is the distribution of classes in the reference data set an artefact of the historical forces by which the reference set has been accumulated?
  2. Are our test sample distributions appropriate for benchmarks? Should we be tuning our classifiers to do well on data sets with distributions of taxa that are unrealistic?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.