mangul-lab-usc / benchmarking_sv Goto Github PK
View Code? Open in Web Editor NEWUpdated figures for "A benchmarking of WGS-based structural variant callers" paper
License: MIT License
Updated figures for "A benchmarking of WGS-based structural variant callers" paper
License: MIT License
-a
can have a significant impact on caller performance. In the case of GRIDSS, results were signficantly worse when -a
was an option.
-Y
is similarly problematic: some older callers pre-date the supplementary flag being added to the SAM specifications so essentially require -Y
to treat them fairly. Recent callers may actually adhere to the SAM specifications so -Y
needs to not be supplied to those callers.
Other callers can rely on specific aligner versions. When I last benchmarked, I found that LUMPY sensitivity got halved due to a minor version change in version of bwa used for alignment.
In summary, -a
is a problematic default, and -Y
may or may not be required for older callers.
We weren't able to find any clear documentation on how to use QUAL, and FILTER so those were ignored
If there's no documentation in the caller, then you should use the specifications definitions of those fields:
6. QUAL — quality: Phred-scaled quality score for the assertion made in ALT. i.e. −10log10 prob(call in ALT is
wrong). If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10
prob(no variant). If unknown, the MISSING value must be specified. (Float)
7. FILTER — filter status: PASS if this position has passed all filters, i.e., a call is made at this position.
Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g.
“q10;s50” might indicate that at this site the quality is below 10 and the number of samples with data is below
50% of the total number of samples. ‘0’ is reserved and must not be used as a filter String. If filters have
not been applied, then this field must be set to the MISSING value. (String, no white-space or semi-colons
permitted, duplicate values not allowed.)
It seems unfair to penalise callers for following the VCF file format specifications.
QUAL could be ignored if you just want the 'default' call set but it can be used for generating ROC curves instead of single points for each caller.
FILTER should definitely be respected as those calls are not part of the 'default' call set for the caller - the caller itself already thinks they are bad for the reason specified in the FILTER field.
Downloading the raw raw VCFs is not possible as the gts lfs quota gets exceeded before all files can be download.
The license is mentioned in the README and a link to a LICENSE file is provided, but the link appears to be broken.
Hello.
I just saw your preprint go up and was wanting to generate ROC curves from the caller qual scores to compare to your single-point results but it appears the raw data is not available. Would you be able to update your repo and readme with :
bwa mem -a
and am curious what the effect of your choice of aligner settings has on your resultsMany germline events have microhomology at the breakpoint. When matching an event, this should be taken into account as there are many way to report the event in VCF. Figure 9 of the VCF specifications outlines this.
Some callers report CIPOS
(and in the case of GRIDSS, the non-standard HOMPOS
field) which is what the caller itself thinks the extent of the homology, but others do not.
Properly matching equivalent variants can be very messy. For example, in the HG002 truth set, a sine duplication within a sine repeat (ref=SINE-SINE-SINE , var=SINE-SINE-SINE-SINE) is reported as an INS event after the 3rd SINE, but the short read callers report it as a DUP of the first sine. These calls both result in the same sequence, but they're 600bp away from each other! Events such as these are a bit extreme and difficult to handle but a basic check of homology, and/or respecting the CIPOS
reported by the variant caller will result in more accurate benchmarking results.
The delta between the caller event length and the actual event length for TPs is a good indicator of how well a SV gets the length correct. You should find that, in contrast to overall average lengths, length deltra predicted by BreakDancer do not match the actual event lengths very closely - something that has the potent to change someone's choice of caller.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.