delocalizer / streammd Goto Github PK

View Code? Open in Web Editor NEW

21.0 2.0 2.0 9.39 MB

Single-pass probabilistic duplicate marking of alignments with a Bloom filter.

Home Page: https://doi.org/10.1093/bioinformatics/btad181

License: MIT License

Makefile 0.84% Shell 0.06% M4 0.56% C++ 97.83% C 0.13% Python 0.58%

streammd's Introduction

streammd

Single-pass probabilistic duplicate marking of alignments with a Bloom filter.

Input is a SAM file format input stream: a valid header followed by reads in qname-grouped order (e.g. output of bwa mem). Detected duplicates have the SAM FLAG 0x400 bit set and PG:Z: tag added, and summary metrics are written to a file at the end of processing.

Features

Fast — with default settings streammd is ~5x faster than Picard MarkDuplicates.
Low memory footprint even for large libraries — with default settings streammd requires just 4G to process 1B templates.
High concordance with Picard MarkDuplicates metrics.
Soft-clipped reads are correctly handled.
Tunable target false positive rate.
Streaming input and output.

Limitations

Due to the nature of the single-pass operation:

streammd retains the first encountered template as the original and marks subsequently encountered copies as duplicates. This differs from Picard MarkDuplicates which marks those of lowest quality as the duplicates.
streammd does not differentiate between optical duplicates and PCR duplicates.

Additionally, the current implementation handles SAM format input only.

Requirements

Development

autotools

Installation

Requires c++17 with <charconv>. gcc >= 8.1 or clang >= 7 should work.

Install

Download a distribution tarball streammd-[x.y.z].tar.gz from https://github.com/delocalizer/streammd/releases

tar -xzvf streammd-[x.y.z].tar.gz
cd streammd-[x.y.z]
./configure
# build
make
# run unit tests (optional)
make check
# install
sudo make install

The configure script and makefiles are generated by autotools and should be quite portable. The usual variables and options may be used to customize the build and install steps e.g.

./configure --prefix=$HOME/.local
make CXX='clang++' CXXFLAGS='-g -O3'

Usage

get help

streammd --help

Usage: streammd [-h] [--input INPUT] [--output OUTPUT] [--fp-rate FP_RATE] [--mem MEM]
                [--allow-overcapacity] [--metrics METRICS_FILE] [--remove-duplicates]
                [--show-capacity] [--single] [--strip-previous]

Read a SAM file from STDIN, mark duplicates in a single pass and stream processed records to STDOUT.
Input must begin with a valid SAM header followed by qname-grouped records. Default log level is
'info' — set to something else (e.g. 'debug') via SPDLOG_LEVEL environment variable.

Optional arguments:
  -h, --help            	shows help message and exits 
  -v, --version         	prints version information and exits 
  --input INPUT         	Input file. [default: STDIN] 
  --output OUTPUT       	Output file. [default: STDOUT] 
  -p, --fp-rate FP_RATE 	The maximum acceptable marginal false-positive rate. [default: 1e-06]
  -m, --mem MEM         	Memory allowance for the Bloom filter, e.g "4GiB". Both binary
                                (kiB|MiB|GiB) and decimal (kB|MB|GB) formats are understood. As a
                                result of implementation details, a value that is an exact power
                                of 2 (512MiB, 1GiB, 2GiB etc) gives a modest processing speed
                                advantage (~5%) over neighbouring values. [default: "4GiB"]
  --allow-overcapacity  	Warn instead of error when Bloom filter capacity is exceeded.
                                [default: false] 
  --metrics METRICS_FILE	Output metrics file. [default: "streammd-metrics.json"]
  --remove-duplicates   	Omit detected duplicates from the output. 
  --show-capacity       	Do no work, just print the capacity of the Bloom filter that would
                                be constructed with the given --fp-rate and --mem values. 
  --single              	Accept single-ended reads as input. [default: paired-end] 
  --strip-previous      	Unset duplicate flag for any reads that have it set and are no
                                longer considered duplicate. Only ever required if records have
                                previously been through a duplicate marking step. [default: false]

mark duplicates on an input SAM record stream

bwa mem ref.fa r1.fq r2.fq|streammd

Notes

Bloom filter capacity.

Bloom filter capacity n (in this context, the number of templates that can be effectively processed) depends on the desired maximum false positive rate p as well as the memory allowance. For a given p you should set the memory to a value sufficient to handle the expected number of templates. Run the tool with --show-capacity to view Bloom filter capacity for any -p and -m values.

p	mem	n
1.00E-02	128MiB	1.07E+08
1.00E-04	256MiB	1.09E+08
1.00E-06	512MiB	1.24E+08
1.00E-02	1GiB	8.56E+08
1.00E-04	2GiB	8.72E+08
1.00E-06	4GiB	9.94E+08

For example, as a guide: 60x human WGS 2x150bp paired-end sequencing consists of n ≈ 6.00E+08 templates, and the default memory setting of 4GiB is sufficient to process this at the default false positive rate of 1.00E-06.

False-positive rate.

-p, --fp-rate sets the value for the maximum acceptable marginal false-positive rate. This is the value at which the tool raises an error when with n items already stored, the predicted FP probability of a newly tested item exceeds that value. This is not the same as the true bulk FP rate in all items processed by the Bloom filter, which will always be lower as long as we stop adding items when the marginal rate is exceeded.

The --allow-overcapacity option overrides this default error behaviour, generating only a warning when the target maximum marginal FP rate is exceeded. This is not recommended for general use.

Hash functions.

The current implementation of streammd uses a fixed number of hash functions (k=10) regardless of the value of p (--fp-rate) or m (--mem). Other implementations are certainly possible, such as allowing a free choice of k, or using the value of k that maximizes capacity n for a given m and p (i.e. the capacity-optimal value). Each of these has its own drawbacks. In the case of a free choice, the user must understand and experiment with the non-trivial relationship between k and filter capacity, false-positive rate, memory use, and performance to pick a sensible value. Most users probably just want reasonable performance and minimal parameter space searching, given their computational constraint (mem) and analytic goal (FPR).

Moreover, the value of k that maximizes n for a given p and m is not a great default for performance reasons because Bloom filter capacity is only weakly dependent on k close to the capacity-optimal value, but computational cost is linear in k. For example, with defaults of m=4GiB and p=1e-6, the capacity-optimal Bloom filter requires k=20 to store 1.2e9 items, but halving the number of hashes to k=10 (doubling the speed) reduces capacity by just 17% to 1e9 items.

In summary, a fixed k=10 is a pragmatic compromise between performance and memory given values of p and n likely to be useful for large short-read datasets.

Metrics

TEMPLATES = total count of templates seen.
TEMPLATES_UNMAPPED = count of templates containing no mapped read.
TEMPLATES_MARKED_DUPLICATE = count of templates marked duplicate.
TEMPLATE_DUPLICATE_FRACTION = TEMPLATES_MARKED_DUPLICATE / (TEMPLATES - TEMPLATES_UNMAPPED)

Pipelining

streammd is capable of generating outputs at a very high rate. For efficient pipelining, downstream tools should be run with sufficient cpu resources to handle their inputs — for example if you want to write the outputs to BAM format using samtools view you should specify extra compression threads for optimal throughput; for example:

bwa mem ref.fa r1.fq r2.fq|streammd|samtools view -@2 -o some.MD.bam

Citing

Please consider citing https://doi.org/10.1093/bioinformatics/btad181 when referring to streammd in publications.

streammd's People

Contributors

Stargazers

Watchers

Forkers

jdidion schaudge

streammd's Issues

Improve help for --fp-rate option

Make it clear in the help and documentation this is the maximum acceptable marginal FP rate. Not the true bulk FP rate that will be achieved after processing, which will be lower.

Documentation

Replace current README.md (which is essentially dev notes) with proper install instructions and usage.

Keep performance profiling, perhaps.

write metrics to their own file

--metrics argument

Tests

got 1 primary alignment(s). Input is not paired or not qname-grouped?

Hi,

I got this error when running streammd 4.0.2,
[2023-01-07 21:40:43.242] [main] [info] BloomFilter initialized with p=1e-06 n=993917924 m=34359738368 k=10
[2023-01-07 21:40:43.242] [main] [info] BloomFilter capacity: 993917924 items
[2023-01-07 21:40:43.993] [main] [error] A00601:371:HLNYMDSXY:3:1101:3983:19413: got 1 primary alignment(s). Input is not paired or not qname-grouped?

The pipeline I used (for huge sequencing data),
step 1
split fastq into many pieces,
step 2
bwa mem map each pieces seperately,
step 3
samtools sort each pieces seperately,
step 4
samtools merge pieces into a single bam
step 5
samtools-1.9 view -@ 4 -h F1.sort.bam|streammd --metrics F1.dedup.metrics |samtools-1.9 view -@ 4 -bS -o F1.dedup.bam

And got the error I have mentioned above.

Best,
Kun

Error queue

Capture Exceptions in child processes

unit tests from MarkDuplicates

https://twitter.com/nilshomer/status/1582603486629199873?t=dy0HNME-WofbtnTNdzzaWA&s=19

stored items count incorrect

The fractional capacity reported in the log message after processing is calculated from templates/n where templates is actually "templates seen". It should be templates - templates_marked_duplicate i.e. "unique templates".

This has no effect on duplicate marking or metrics but can give an unnecessary exit due to overcapacity because the true used capacity of the Bloom filter is lower than stated.

Handle pairs where one read is unmapped

Add @PG line into output SAM file header

ID:streammd PN:streammd VN:... etc

Tool to calculate mem from input n, p

Expose m value from optimal_m_k as a CLI option

Does not build on ARM

Compilation fails due to inclusion of immintrin.h. It looks like this can be fixed pretty easily. See e.g. coin-or/Clp#127.

investigate alternatives to pysam.AlignedSegment

For convenience we currently perform template end calculations using pysam.AlignedSegment.
We do however incur a fair deserialize/serialize cost to get reads into and out of that form.
Investigate using just the raw record.
Most of it is pretty easy:
split by TAB, cast FLAG and POS to int, do bitwise mask, use bitwise mask on the flag for all the is_mapped, is_forward, is_first etc.
Use regex to capture leading (fwd) and trailing (rev) soft clips.
The only mildly tricky bit is doing the CIGAR calculation properly to get reference_end — need to handle D and I ops.

Versioning

use branches, tags and bump versions when required

option to discard or separate duplicates

some tests fail when compiling with clang

Tests of SamRecord.update_dup_status fail when compiling with clang...
Looks like pgtag_ is empty, and therefore pgidx_ that is set in parse() is actually the value of PNEXT instead of either the position of PG:Z (if present) or the end of the record.

something to do with the inline static const declaration and initialization of pgtag_ ???

samtools parser error when stdout output into samtools view

This is in 2.0.0 and all previous releases:
When batchsize is > 4 and streammd output is piped directly to samtools view, the latter dies with parser error and we get broken pipe.

Problem with installation

Hello!
I'd like to use your tool as an alternative of Picard Markdup. But I have a problem with build the tool.
I use gcc version 7.5.0 (requirements gcc >= 7.1), but command make returns me an error:

depbase=`echo bloomfilter.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DPACKAGE_NAME=\"streammd\" -DPACKAGE_TARNAME=\"streammd\" -DPACKAGE_VERSION=\"4.2.1\" -DPACKAGE_STRING=\"streammd\ 4.2.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"streammd\" -DVERSION=\"4.2.1\" -I.    -Wall -Wextra -std=c++17 -I ../external -g -O2 -MT bloomfilter.o -MD -MP -MF $depbase.Tpo -c -o bloomfilter.o bloomfilter.cxx &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo markdups.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DPACKAGE_NAME=\"streammd\" -DPACKAGE_TARNAME=\"streammd\" -DPACKAGE_VERSION=\"4.2.1\" -DPACKAGE_STRING=\"streammd\ 4.2.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"streammd\" -DVERSION=\"4.2.1\" -I.    -Wall -Wextra -std=c++17 -I ../external -g -O2 -MT markdups.o -MD -MP -MF $depbase.Tpo -c -o markdups.o markdups.cxx &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo streammd.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DPACKAGE_NAME=\"streammd\" -DPACKAGE_TARNAME=\"streammd\" -DPACKAGE_VERSION=\"4.2.1\" -DPACKAGE_STRING=\"streammd\ 4.2.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"streammd\" -DVERSION=\"4.2.1\" -I.    -Wall -Wextra -std=c++17 -I ../external -g -O2 -MT streammd.o -MD -MP -MF $depbase.Tpo -c -o streammd.o streammd.cxx &&\
mv -f $depbase.Tpo $depbase.Po
In file included from streammd.cxx:4:0:
../external/argparse/argparse.hpp:36:10: fatal error: charconv: No such file or directory
 #include <charconv>
          ^~~~~~~~~~
compilation terminated.
Makefile:444: recipe for target 'streammd.o' failed
make: *** [streammd.o] Error 1

Could you help me, please?
Best regards,
Polina.

set up local cross-compile envs

we should explicitly test builds on other platforms before releases to avoid things like #24

Handle non-string values in BloomFilter.add

either try converting, or fail with an explicit ValueError... basically do anything except fail naively as currently happens

output an estimate for actual marginal FP rate at the end of a run

Although docs and --help have been updated in an attempt to clarify what exactly -p, --fp-rate setting does, to minimize potential for confusion it would be good to output an estimate of the actual marginal false positive rate reached at the end of a run.

Handle single-ended reads

TEMPLATE_DUPLICATE_FRACTION calculation

Currently TEMPLATE_DUPLICATE_FRACTION uses the count of all templates in the denominator; i.e. it's just equal to TEMPLATES_MARKED_DUPLICATE/TEMPLATES. The denominator should however omit templates where neither end is aligned, as they are not assessed for duplicate status.
Whatever formula is used, it should be documented in the help or at least in the source README

Add streammd to bioconda clouds

Hi,

streammd is a great tool, but it is not so easy to compile in some system environments, would you please add it to bioconda clouds?

Best,
Kun

align orphan read logic with SAMBLASTER

streammd uses broadly the same strategy as SAMBLASTER to detect and mark duplicates on templates where one end is not mapped:

the signature of the mapped read must match a previously seen orphan

i.e. orphans are compared only to orphans. This is in contrast to MarkDuplicates where orphans may be compared to and match reads in completely mapped templates.

There is however one difference between streammd 4.1.7 and SAMBLASTER in that streammd currently aligns with SAMBLASTER logic prior to 0.1.25 i.e.

all orphans are treated as if on forward strand

Starting with SAMBLASTER version 0.1.25:

forward and reverse strand orphans/singletons are allowed to be duplicates of each other

In our test data this marks between 5-10% more orphan dups. The logic behind this is probably good and what certainly isn't debatable is that aligning streammd logic exactly with the most recent SAMBLASTER makes it easier to compare the tools directly.