Qiita (canonically pronounced cheetah) is an analysis environment for microbiome (and other "comparative -omics") datasets.
This package includes the shogun functionality for Qiita.
License: BSD 3-Clause "New" or "Revised" License
After trimming, we can conceivably have four fastq file types derived from a single original read pair:
Each of these read types would be useful to track independently for downstream applications -- for instance, in assembly or read mapping, type 1 and 2 above will frequently need to be provided in a set, while type 3 and 4 might optionally be concatenated and provided as independent single-ended reads.
Another scenario, when I'm doing HUMAnN2, I typically ignore reverse pairs altogether and only analyze Type 1 and Type 3 forward reads after trimming.
KneadData plugin should be able to:
It will simplify set up for new devs
We need to add an extra step in humann2 so we generate stratified tables, humann2_split_stratified_table. However, the current version doesn't support biom so we will add once is available.
I think it would be great to have an object / file in Qiita that we can use to keep track of the various fastq permutations. In particular, once we start doing cleaning and read trimming, we start having to keep track of a bunch of files for each sample:
` ainfo = []
for run_prefix, sample, f_fp, r_fp in samples:
sam_out_dir = join(out_dir, run_prefix)
if r_fp:
ainfo += [
ArtifactInfo('clean paired R1', 'per_sample_FASTQ',
[(join(sam_out_dir, '%s_paired_1.fastq' % prefix),
'per_sample_FASTQ')]),
ArtifactInfo('clean paired R2', 'per_sample_FASTQ',
[(join(sam_out_dir, '%s_paired_2.fastq' % prefix),
'per_sample_FASTQ')]),
ArtifactInfo('clean unpaired R1', 'per_sample_FASTQ',
[(join(sam_out_dir, '%s_unmatched_1.fastq' % prefix),
'per_sample_FASTQ')]),
ArtifactInfo('clean unpaired R2', 'per_sample_FASTQ',
[(join(sam_out_dir, '%s_unmatched_2.fastq' % prefix),
'per_sample_FASTQ')])]
else:
ainfo += [
ArtifactInfo('cleaned reads', 'per_sample_FASTQ',
[(join(sam_out_dir, '%s.fastq' % prefix),
'per_sample_FASTQ')])]`
This isn't such a big deal when working on a single plugin, but negotiating between tools can be confusing. E.g. what if I did trimmomatic on paired reads previously, and now I have four fastq files? If I'm working on a downstream plugin, how can I be sure what kind of input I'm getting? In my own work as long as I have lots of good quality data, I'll often just ignore unpaired read. But with shallow sequencing and poorer overall quality (i.e. high throughput Nextera, often), sometimes those unpaired reads become important.
I'm imagining that having a consistent (possibly per-sample) compressed file format to organize these fastqs in a dependable way would really simplify development. Instead of planning your Qiita plugin to accept various input files and moving those around in artifacts, each plugin would just need to reference a method to generate the kind of serialized file format necessary for the particular tool.
I spoke a bit with @josenavas and @wasade about this, and I can see that it might be possible to represent something like this entirely within Qiita using a reworked artifact type that did a better job of keeping track of various permutations of Fastqs. However, it would also be super handy to have a hdf5 format representation for use in other contexts as well (I would love to have this in my snakemake workflows! :) ).
One possible structure @wasade and I came up with:
The index / fwd / rev reads would all be stored in the same order, with unpaired fwd / rev reads being stored as null strings (and maybe some sort of masking bit to indicate pairing). Pulling appropriately matched forward / reverse / paired serialized fastq format files out could then be done temporarily on a per-tool basis.
Thoughts?
Right now is doing 2 calls, one with the FWD and one with the REV. However, this generates a downstream error. The solution, after discussing with @tanaes and @antgonza the solution would be to add a new parameter ('include-reads' or something better, of type choice with 2 acceptable values: 'all reads', 'forward only').
If 'all reads', concatenates all 4 files from the KneadData output and runs humann2. If 'forward only' concatenates the 2 forward files from KneadData and runs humann2.
It would be great if readfq
had unit tests. Alternatively, it may be easier to replace it and use the parser from scikit-bio
Current version only outputs ribosomal and non-ribosomal reads per sample (format: per_sample_fastq). Log or summary file can be added to indicate proportion of reads aligned to each database.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.