mangul-lab-usc / benchmarking_sv Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 7.0 479.52 MB

Updated figures for "A benchmarking of WGS-based structural variant callers" paper

License: MIT License

Jupyter Notebook 99.86% Python 0.06% Shell 0.08%

benchmarking_sv's People

Contributors

Stargazers

Watchers

Forkers

addicted-to-coding d-cameron harmandotpy ctsa chhugani gmykim seichang00

benchmarking_sv's Issues

bwa mem flags could be problematic

-a can have a significant impact on caller performance. In the case of GRIDSS, results were signficantly worse when -a was an option.

-Y is similarly problematic: some older callers pre-date the supplementary flag being added to the SAM specifications so essentially require -Y to treat them fairly. Recent callers may actually adhere to the SAM specifications so -Y needs to not be supplied to those callers.

Other callers can rely on specific aligner versions. When I last benchmarked, I found that LUMPY sensitivity got halved due to a minor version change in version of bwa used for alignment.

In summary, -a is a problematic default, and -Y may or may not be required for older callers.

FILTER and QUAL should be taken into account

We weren't able to find any clear documentation on how to use QUAL, and FILTER so those were ignored

If there's no documentation in the caller, then you should use the specifications definitions of those fields:

6. QUAL — quality: Phred-scaled quality score for the assertion made in ALT. i.e. −10log10 prob(call in ALT is
wrong). If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10
prob(no variant). If unknown, the MISSING value must be specified. (Float)

7. FILTER — filter status: PASS if this position has passed all filters, i.e., a call is made at this position.
Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g.
“q10;s50” might indicate that at this site the quality is below 10 and the number of samples with data is below
50% of the total number of samples. ‘0’ is reserved and must not be used as a filter String. If filters have
not been applied, then this field must be set to the MISSING value. (String, no white-space or semi-colons
permitted, duplicate values not allowed.)

It seems unfair to penalise callers for following the VCF file format specifications.

QUAL could be ignored if you just want the 'default' call set but it can be used for generating ROC curves instead of single points for each caller.

FILTER should definitely be respected as those calls are not part of the 'default' call set for the caller - the caller itself already thinks they are bad for the reason specified in the FILTER field.

git lfs quota exceeded

Downloading the raw raw VCFs is not possible as the gts lfs quota gets exceeded before all files can be download.

LICENSE file is missing

The license is mentioned in the README and a link to a LICENSE file is provided, but the link appears to be broken.

Unable to obtain raw data for reproduction

Hello.

I just saw your preprint go up and was wanting to generate ROC curves from the caller qual scores to compare to your single-point results but it appears the raw data is not available. Would you be able to update your repo and readme with :

A link to the google drive location of the bams and/or fastqs.
- Some aligners have significantly elevated FP rates when using bwa mem -a and am curious what the effect of your choice of aligner settings has on your results
Include the raw VCF files as output by the caller.
https://github.com/Mangul-Lab-USC/benchmarking-sv-callers-paper/tree/master/Data/raw_data/mouse only includes your processed results, not the raw VCFs. I am unable to generate ROC curves because you've stripped the essential information when generating those subset files (ie QUAL, and FILTER).

Account for microhomology and CIPOS

Many germline events have microhomology at the breakpoint. When matching an event, this should be taken into account as there are many way to report the event in VCF. Figure 9 of the VCF specifications outlines this.

Some callers report CIPOS (and in the case of GRIDSS, the non-standard HOMPOS field) which is what the caller itself thinks the extent of the homology, but others do not.

Properly matching equivalent variants can be very messy. For example, in the HG002 truth set, a sine duplication within a sine repeat (ref=SINE-SINE-SINE , var=SINE-SINE-SINE-SINE) is reported as an INS event after the 3rd SINE, but the short read callers report it as a DUP of the first sine. These calls both result in the same sequence, but they're 600bp away from each other! Events such as these are a bit extreme and difficult to handle but a basic check of homology, and/or respecting the CIPOS reported by the variant caller will result in more accurate benchmarking results.

The delta between the caller event length and the actual event length for TPs is a good indicator of how well a SV gets the length correct. You should find that, in contrast to overall average lengths, length deltra predicted by BreakDancer do not match the actual event lengths very closely - something that has the potent to change someone's choice of caller.

mangul-lab-usc / benchmarking_sv Goto Github PK

benchmarking_sv's People

Contributors

Stargazers

Watchers

Forkers

benchmarking_sv's Issues

bwa mem flags could be problematic

FILTER and QUAL should be taken into account

git lfs quota exceeded

LICENSE file is missing

Unable to obtain raw data for reproduction

Account for microhomology and CIPOS

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent