Code Monkey home page Code Monkey logo

qp-shogun's Introduction

Shogun Qiita Plugin

Build Status Coverage Status

Qiita (canonically pronounced cheetah) is an analysis environment for microbiome (and other "comparative -omics") datasets.

This package includes the shogun functionality for Qiita.

qp-shogun's People

Contributors

antgonza avatar charles-cowart avatar jdereus avatar josenavas avatar qiyunzhu avatar semarpetrus avatar smruthi98 avatar tanaes avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

qp-shogun's Issues

Add multiple trimmed fastq file types to fastq artifact

After trimming, we can conceivably have four fastq file types derived from a single original read pair:

  1. Trimmed forward reads with surviving reverse mate pairs
  2. Trimmed reverse reads with surviving forward mate pairs
  3. Trimmed forward reads without surviving reverse mate pairs
  4. Trimmed reverse reads without surviving forward mate pairs

Each of these read types would be useful to track independently for downstream applications -- for instance, in assembly or read mapping, type 1 and 2 above will frequently need to be provided in a set, while type 3 and 4 might optionally be concatenated and provided as independent single-ended reads.

Another scenario, when I'm doing HUMAnN2, I typically ignore reverse pairs altogether and only analyze Type 1 and Type 3 forward reads after trimming.

Add kneaddata functionality to plugin

KneadData plugin should be able to:

  • Accept forward and reverse raw reads
  • Screen using trimmomatic for quality and adapter content
  • Screen against genome databases to remove host genomic contamination (or PhiX or similar)
  • Output trimmed and decontaminated sequences + FastQC reports

improvements to kneaddata plugin

  • Add parameter defaults for Nextera and Truseq-compatible adapters
  • Add human reference genome download and reference choices
  • Add mouse reference genome download and reference choices

Last little issues to finish preliminary shogun plugin

  • remove humann2 command
  • add biom convert function (which also adds shogun functional taxonomies)
  • remove levels from selectable options
  • change to output in output dir instead of temp dir
  • test other aligners (UTREE and BURST)

add humann2_split_stratified_table to humann2

We need to add an extra step in humann2 so we generate stratified tables, humann2_split_stratified_table. However, the current version doesn't support biom so we will add once is available.

RFC: Create containerized hdf5 fastq object / artifact

I think it would be great to have an object / file in Qiita that we can use to keep track of the various fastq permutations. In particular, once we start doing cleaning and read trimming, we start having to keep track of a bunch of files for each sample:

` ainfo = []
for run_prefix, sample, f_fp, r_fp in samples:
sam_out_dir = join(out_dir, run_prefix)

    if r_fp:
        ainfo += [
            ArtifactInfo('clean paired R1', 'per_sample_FASTQ',
                         [(join(sam_out_dir, '%s_paired_1.fastq' % prefix),
                          'per_sample_FASTQ')]),
            ArtifactInfo('clean paired R2', 'per_sample_FASTQ',
                         [(join(sam_out_dir, '%s_paired_2.fastq' % prefix),
                          'per_sample_FASTQ')]),
            ArtifactInfo('clean unpaired R1', 'per_sample_FASTQ',
                         [(join(sam_out_dir, '%s_unmatched_1.fastq' % prefix),
                          'per_sample_FASTQ')]),
            ArtifactInfo('clean unpaired R2', 'per_sample_FASTQ',
                         [(join(sam_out_dir, '%s_unmatched_2.fastq' % prefix),
                          'per_sample_FASTQ')])]
    else:
        ainfo += [
            ArtifactInfo('cleaned reads', 'per_sample_FASTQ',
                         [(join(sam_out_dir, '%s.fastq' % prefix),
                          'per_sample_FASTQ')])]`

This isn't such a big deal when working on a single plugin, but negotiating between tools can be confusing. E.g. what if I did trimmomatic on paired reads previously, and now I have four fastq files? If I'm working on a downstream plugin, how can I be sure what kind of input I'm getting? In my own work as long as I have lots of good quality data, I'll often just ignore unpaired read. But with shallow sequencing and poorer overall quality (i.e. high throughput Nextera, often), sometimes those unpaired reads become important.

I'm imagining that having a consistent (possibly per-sample) compressed file format to organize these fastqs in a dependable way would really simplify development. Instead of planning your Qiita plugin to accept various input files and moving those around in artifacts, each plugin would just need to reference a method to generate the kind of serialized file format necessary for the particular tool.

I spoke a bit with @josenavas and @wasade about this, and I can see that it might be possible to represent something like this entirely within Qiita using a reworked artifact type that did a better job of keeping track of various permutations of Fastqs. However, it would also be super handy to have a hdf5 format representation for use in other contexts as well (I would love to have this in my snakemake workflows! :) ).

One possible structure @wasade and I came up with:

  • root
    • index
      • index seq: [n: ...]
    • corrected index
      • index seq: [n: ...]
    • read headers
      • header text: [n: ...]
    • rev
      • sequences: [n: ...]
      • quals: [n: ...]
    • fwd
      • sequences: [n: ...]
      • quals: [n: ...]

The index / fwd / rev reads would all be stored in the same order, with unpaired fwd / rev reads being stored as null strings (and maybe some sort of masking bit to indicate pairing). Pulling appropriately matched forward / reverse / paired serialized fastq format files out could then be done temporarily on a per-tool basis.

Thoughts?

Correct call to humann2 with PE reads

Right now is doing 2 calls, one with the FWD and one with the REV. However, this generates a downstream error. The solution, after discussing with @tanaes and @antgonza the solution would be to add a new parameter ('include-reads' or something better, of type choice with 2 acceptable values: 'all reads', 'forward only').

If 'all reads', concatenates all 4 files from the KneadData output and runs humann2. If 'forward only' concatenates the 2 forward files from KneadData and runs humann2.

Correct final files for KneadData

Currently the plugin is sending the trimmed files back (see here) but those files are not the correct ones because they've not been filtered for human data. See here for the actual correct filenames.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.