Light

qiita-spots / qp-shogun Goto Github PK

License: BSD 3-Clause "New" or "Revised" License

Python 56.26% Jupyter Notebook 43.74%

qp-shogun's Introduction

Shogun Qiita Plugin

Qiita (canonically pronounced cheetah) is an analysis environment for microbiome (and other "comparative -omics") datasets.

This package includes the shogun functionality for Qiita.

qp-shogun's People

Contributors

Watchers

Forkers

antgonza tanaes jdereus josenavas yidingfang semarpetrus qiyunzhu earmingol eldeveloper smruthi98

qp-shogun's Issues

Add multiple trimmed fastq file types to fastq artifact

After trimming, we can conceivably have four fastq file types derived from a single original read pair:

Trimmed forward reads with surviving reverse mate pairs
Trimmed reverse reads with surviving forward mate pairs
Trimmed forward reads without surviving reverse mate pairs
Trimmed reverse reads without surviving forward mate pairs

Each of these read types would be useful to track independently for downstream applications -- for instance, in assembly or read mapping, type 1 and 2 above will frequently need to be provided in a set, while type 3 and 4 might optionally be concatenated and provided as independent single-ended reads.

Another scenario, when I'm doing HUMAnN2, I typically ignore reverse pairs altogether and only analyze Type 1 and Type 3 forward reads after trimming.

Include version string for shogun

It would be handy if you could see the shogun version in the software list in Qiita:

Currently only Atropos shows the version.

Add kneaddata functionality to plugin

KneadData plugin should be able to:

Accept forward and reverse raw reads
Screen using trimmomatic for quality and adapter content
Screen against genome databases to remove host genomic contamination (or PhiX or similar)
Output trimmed and decontaminated sequences + FastQC reports

improvements to kneaddata plugin

Add parameter defaults for Nextera and Truseq-compatible adapters
Add human reference genome download and reference choices
Add mouse reference genome download and reference choices

Last little issues to finish preliminary shogun plugin

remove humann2 command
add biom convert function (which also adds shogun functional taxonomies)
remove levels from selectable options
change to output in output dir instead of temp dir
test other aligners (UTREE and BURST)

Add INSTALL.md

It will simplify set up for new devs

add humann2_split_stratified_table to humann2

We need to add an extra step in humann2 so we generate stratified tables, humann2_split_stratified_table. However, the current version doesn't support biom so we will add once is available.

RFC: Create containerized hdf5 fastq object / artifact

I think it would be great to have an object / file in Qiita that we can use to keep track of the various fastq permutations. In particular, once we start doing cleaning and read trimming, we start having to keep track of a bunch of files for each sample:

` ainfo = []
for run_prefix, sample, f_fp, r_fp in samples:
sam_out_dir = join(out_dir, run_prefix)

    if r_fp:
        ainfo += [
            ArtifactInfo('clean paired R1', 'per_sample_FASTQ',
                         [(join(sam_out_dir, '%s_paired_1.fastq' % prefix),
                          'per_sample_FASTQ')]),
            ArtifactInfo('clean paired R2', 'per_sample_FASTQ',
                         [(join(sam_out_dir, '%s_paired_2.fastq' % prefix),
                          'per_sample_FASTQ')]),
            ArtifactInfo('clean unpaired R1', 'per_sample_FASTQ',
                         [(join(sam_out_dir, '%s_unmatched_1.fastq' % prefix),
                          'per_sample_FASTQ')]),
            ArtifactInfo('clean unpaired R2', 'per_sample_FASTQ',
                         [(join(sam_out_dir, '%s_unmatched_2.fastq' % prefix),
                          'per_sample_FASTQ')])]
    else:
        ainfo += [
            ArtifactInfo('cleaned reads', 'per_sample_FASTQ',
                         [(join(sam_out_dir, '%s.fastq' % prefix),
                          'per_sample_FASTQ')])]`

This isn't such a big deal when working on a single plugin, but negotiating between tools can be confusing. E.g. what if I did trimmomatic on paired reads previously, and now I have four fastq files? If I'm working on a downstream plugin, how can I be sure what kind of input I'm getting? In my own work as long as I have lots of good quality data, I'll often just ignore unpaired read. But with shallow sequencing and poorer overall quality (i.e. high throughput Nextera, often), sometimes those unpaired reads become important.

I'm imagining that having a consistent (possibly per-sample) compressed file format to organize these fastqs in a dependable way would really simplify development. Instead of planning your Qiita plugin to accept various input files and moving those around in artifacts, each plugin would just need to reference a method to generate the kind of serialized file format necessary for the particular tool.

I spoke a bit with @josenavas and @wasade about this, and I can see that it might be possible to represent something like this entirely within Qiita using a reworked artifact type that did a better job of keeping track of various permutations of Fastqs. However, it would also be super handy to have a hdf5 format representation for use in other contexts as well (I would love to have this in my snakemake workflows! :) ).

One possible structure @wasade and I came up with:

root
- index
  - index seq: [n: ...]
- corrected index
  - index seq: [n: ...]
- read headers
  - header text: [n: ...]
- rev
  - sequences: [n: ...]
  - quals: [n: ...]
- fwd
  - sequences: [n: ...]
  - quals: [n: ...]

The index / fwd / rev reads would all be stored in the same order, with unpaired fwd / rev reads being stored as null strings (and maybe some sort of masking bit to indicate pairing). Pulling appropriately matched forward / reverse / paired serialized fastq format files out could then be done temporarily on a per-tool basis.

Thoughts?

Correct call to humann2 with PE reads

Right now is doing 2 calls, one with the FWD and one with the REV. However, this generates a downstream error. The solution, after discussing with @tanaes and @antgonza the solution would be to add a new parameter ('include-reads' or something better, of type choice with 2 acceptable values: 'all reads', 'forward only').

If 'all reads', concatenates all 4 files from the KneadData output and runs humann2. If 'forward only' concatenates the 2 forward files from KneadData and runs humann2.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.