Code Monkey home page Code Monkey logo

qiita-files's People

Watchers

 avatar  avatar  avatar  avatar  avatar

qiita-files's Issues

HDF5-based Fastq container

Problem:

Shotgun metagenomic analysis create a large number of partially-redundant Fastq files which occupy a lot of disk space.

As an example, a typical workflow looks something like this, per sample:

  • Initial Fastq.gz (1.1 Gb)
  • QC/trimmed Fastq.gz (1.0 Gb)
  • Host-filtered Fastq.gz (0.9 Gb)

So for a typical workflow, the initial data is likely to be ~triplicated on disk. Any additional comparison of parameter values, for example different trimming operations, would require additional duplication of data.

Compounding the problem, different programs use different 'flavors' of Fastq format.

Trimmomatic outputs (up to) four files per trimming run:

  • R1 (surviving pairs, forward)
  • R2 (surviving pairs, reverse)
  • U1 (surviving forward with no surviving reverse mate)
  • U2 (surviving reverse with no surviving forward mate)

Other trimming programs will output interleaved forward/reverse pairs, either discarding pairs where one mate was filtered or inserting empty sequence.

Analysis programs also sometimes require either interleaved format or non-interleaved format. Additionally, some programs can and some cannot accept gzipped input files. In a given analysis, it might be necessary to take a single trimmed, host-filtered Fastq file and reprocess it several different ways on disk to satisfy different downstream programs.

Proposed solution:

An HDF5-based Fastq container format that can read multiple flavors of Fastq sequence file, store them in a compressed but accessible structure, and efficiently write temporary Fastq sequences files of various flavors for use by downstream tools.

Desired characteristics

  1. Compressed data structure
  2. Simple, fast command-line methods to read and write various Fastq flavors
  3. Efficient read-name indexing for selective retrieval of reads (e.g. for filtering from a BAM alignment)
  4. Compact, additive representation of trimming or filtering steps
  5. Ability to scale up or down container to include multiple additional files (e.g. multiple lanes from a sample, or multiple samples in a run)

I'll address each of these briefly.

1. Compressed data structure

We should get this for free with HDF5

2. Simple, fast methods for writing Fastq

Because these would be a purely intermediate format, we're trading compute time necessary to generate the appropriate Fastq type for storage. Thus we'll likely have to write a complete Fastq file every time we want to interact with the object: generating a new file should be fast as possible.

3. Efficient read-name indexing

If the unique read names can be indexed in an accessible way, it will allow selective retrieval of data subsets in cases where you only want to look at a portion of the reads.

4. Compact, additive trimming info

Rather than storing entirely new Fastq files for new trimming steps, it would be great to just store a lightweight mask of the trimmed positions -- this mask could be added to the original sequence container to produce the intermediate, trimmed Fastq for analysis.

It might also be useful if these could be additive -- for example, I might want to trim a Fastq for adapters before host filtering, but then be able to propagate the host trimming information back with reference to the original sequences.

Because trimming programs don't necessarily yield rich per-sequence information about trimming outcomes, there will need to be some simple aligner (or similar) that can calculate the trim positions by comparing the original and trimmed sequence.

5. Container scaling

Frequently, a single biological sample will get sequenced on multiple lanes, generating several starting Fastq files. It would be nice to be able to treat these as a single file while maintaining the run information.

Proposed HDF5 file structure

A basic representation would store a single Fastq file, with each read set off the instrument stored as an N x M array (number of reads by read length) for fast streaming writes. The different datasets (fwd, rev, i1 and i2) will be in the same order. That is, row 1 belongs to the same "read".

./fwd/
	<seq : N x M array of char>
	<qual : N x M array of int>
./rev/
	<seq : N x M array of char>
	<qual : N x M array of int>
./i1/
	<seq : N x M array of char>
	<qual : N x M array of int>
./i2/
	<seq : N x M array of char>
	<qual : N x M array of int>
./read names/
	/<names : groups of sorted(?) read names>/
		<idx : index of (original) position of corresponding seq or qual>

Trimming results could be stored as a separate HDF5 referencing the original data structure:

/trim_result1/
	/<link to dataset>/
		[idx, fwd_start, fwd_end, rev_start, rev_end]

Optionally, the trimming results could also be added to the original HDF5 file as a new group to store all the information in a single file, which will simplify data sharing.

Normalize demux API

The current API for the demux mainly accepts an h5py.File as parameter, which puts the burden of opening the hdf5 file to the consumer of the API. I think it will make more sense to actually pass in the filepath of the hdf5 file and handle the file opening/closing internally. This will allow us to parallelize more stuff internally, as we did for the to_per_sample_files function (which is the only exposed API function that allows a filepath rather than a h5py.File object.

@wasade, @antgonza Any objections to this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.