View Code? Open in Web Editor
NEW
Compare speed of FASTQ vs uBAM for various applications
License: MIT License
fastq-vs-ubam's Introduction
- Sample- and read-level metadata
- Single file for both single- and paired-end reads
- Different query names for reads 1 and 2 - store suffix(es) in tags
- FASTQ will always be superior unless BAM allows different compression schemes
- More compression options for FASTQ
Naive/serial decompression speed
Parallel decompression/parsing/processing
- Simple task (compute nucleotide composition)
- Add random sleep to simulate more complex task
Multi-threading (FASTQ and uBAM)
- Share single file handle across threads
- Synchronize on file to read batch
- Record padding for deferred parsing
Multi-processing (FASTQ and uBAM)
- Main process reads file and adds batches to queue
- Sub-processes pull batches off queue and process
- Record padding for deferred parsing (main process vs subprocess)
Unsynchronized parallel processing of bgzf blocks (uBAM only)
- Main process adds block offsets and lengths to queue
- Threads/subprocesses pull block coordinates off the queue, read, and process
- Use of index vs main process reading block offsets at runtime
- Memory mapping
- Limit read-ahead to keep memory to fixed size
- BGZIP FASTQ
- New custom format
- Column-oriented
- Different compression scheme for each column
- Decompress columns in parallel
- Don't decompress unused columns
- Single file
- Eliminate redundant storage of query name
- Separate columns for reads 1 and 2 (self-describing)
- Optimized storage of long reads
- Easy interop w/ Arrow
fastq-vs-ubam's People
Contributors