benkaehler / short-read-tax-assignment Goto Github PK

View Code? Open in Web Editor NEW

This project forked from caporaso-lab/tax-credit

0.0 2.0 1.0 3.36 GB

A repository for storing code and data related to a systematic comparison of short read taxonomy assignment tools

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 99.41% Python 0.59%

short-read-tax-assignment's Introduction

TAX CREdiT: TAXonomic ClassifieR Evaluation Tool

A standardized and extensible evaluation framework for taxonomic classifiers

To view static versions of the reports , start here.

Environment

This repository contains python-3 code and Jupyter notebooks, but some taxonomy assignment methods (e.g., using QIIME-1 legacy methods) may require different python or software versions. Hence, we use conda parallel environments to support comparison of myriad methods in a single framework.

The first step is to install conda and install QIIME2 following the instructions provided here.

An example of how to load different environments to support other methods can be see in the QIIME-1 taxonomy assignment notebook.

Setup and install

The library code and IPython Notebooks are then installed as follows:

git clone https://github.com/gregcaporaso/tax-credit.git
cd tax-credit/
pip install .

Finally, download and unzip the reference databases:

wget https://unite.ut.ee/sh_files/sh_qiime_release_20.11.2016.zip
wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz
unzip sh_qiime_release_20.11.2016.zip
tar -xzf gg_13_8_otus.tar.gz

Equipment

The analyses included here can all be run in standard, modern laptop, provided you don't mind waiting a few hours on the most memory-intensive step (taxonomy classification of millions of sequences). With the exception of the q2-feature-classifier naive-bayes* classifier sweeps, which were run on a high-performance cluster, all analyses presented in tax-credit were run in a single day using a MacBook Pro with the following specifications: OS OS X 10.11.6 "El Capitan" Processor 2.3 GHz Intel Core i7 Memory 8 GB 1600 MHz DDR3

If you intend to perform extensive parameter sweeps on a classifier (e.g., several hundred or more parameter combinations), you may want to consider running these analyses using cluster resources, if available.

Using the Jupyter Notebooks included in this repository

To view and interact with Jupyter Notebook, change into the /tax-credit/ipynb directory and run Jupyter Notebooks from the terminal with the command:

jupyter notebook index.ipynb

The notebooks menu should open in your browser. From the main index, you can follow the menus to browse different analyses, or use File --> Open from the notebook toolbar to access the full file tree.

Citing

A publication is on its way! For now, if you use any of the data or code included in this repository, please cite https://github.com/caporaso-lab/tax-credit

short-read-tax-assignment's People

Contributors

Watchers

Forkers

nbokulich

short-read-tax-assignment's Issues

Assess the feasibility of trying all the sci-kit learn classifiers

Nick suggests a tiered approach. Ben will do a survey and see how many classifiers there are and whether it could be done.

implement test for classification of "novel" taxa

How do taxonomy classifiers perform when they encounter a query sequence that is not represented in the reference database? To what degree do they "overclassify"?

For this test, "novel taxa" consist of query sequences randomly drawn from a reference database (source). Taxonomy assignment is then performed using a modified reference database (ref), which consists of the source minus novel taxa AND all seqs with matching taxonomy annotations at the taxonomic level (L) being tested (species, genus, family, etc). Also remove any taxa from the ref that do not have near neighbors at level L, e.g., other species in the genus

Following this method,
Match: assignment == L - 1 (e.g., a novel species is assigned the correct genus)
overclassification: assignment == L (e.g., correct genus but assigns to a near neighbor)
misclassification: incorrect assignment at L - 1 (e.g., wrong genus-level assignment)

One question: is it worth also defining underclassification, i.e., assignment < L (e.g., correct family but no genus)? My gut feeling is NO, since this will complicate matters and we are left asking at which level L this becomes irrelevant. (e.g., if species X is assigned to the correct phylum but wrong class, is this still underclassification and is that important?) Unlike overclassification, I also question whether this would yield meaningful interpretation.

Migrate the Jupyter Notebooks to Python 3

Select and evaluate several alternative scikit-learn classifiers

scikit-learn offers several different classifiers in different categories.

A reasonable overview of what's available and appropriate is here.

Would suggest that we should try at least BernoulliNB, SVCs and at least one of Decision Trees, Random Forests, Nearest Neighbors, Ridge Regression, or MLPClassifier.

Should also revisit feature selection. SelectPercentile doesn't seem to help for MultinomialNB, but it may work elsewhere and other feature selection techniques exist.

Also try TfidfTransformer for feature extraction.

All of this should be possible using the general fit-classifier method in feature-classifier, but may require some fixes to the code.

short-read-tax-assignment is a mouthful

@GavinHuttley has suggested that we call it the SHARK.

@gregcaporaso seems naturally apprehensive.

How about the following backronym?

SHort read tAxonomy Rating for Classifiers?

So it would be SHARC. Close enough.

Add evaluation for the naïve bayes classifier from BenKaehler/q2-feature-classifier

Do we need to be more careful about prior (or marginal) distributions?

I am concerned about classifier training priors. This came up in my conversation with Steve.

When we are training our classifier, either implicitly or explicitly we bias our classifier to be more likely to predict a taxon according to some prior distribution. This is natural and desirable in some machine learning contexts: if a classifier is uncertain about a prediction then it is going to do better if it guesses according to the unconditional distribution of classes. Some classifiers set the priors to be uniform, some set them according to the distribution of classes in the training sample.

In the normal machine learning context this is uncontroversial because we usually train and validate our classifier on samples that are presumably drawn from the same population.

In our case things are different. We are training our data on a reference set, then testing it on samples that in some cases have contrived and unnatural distributions of classes.

So in my mind there are two questions:

Is our training prior appropriate? Does the reference sample represent some sort of global prior that we would expect for taxa, or is the distribution of classes in the reference data set an artefact of the historical forces by which the reference set has been accumulated?
Are our test sample distributions appropriate for benchmarks? Should we be tuning our classifiers to do well on data sets with distributions of taxa that are unrealistic?