Code Monkey home page Code Monkey logo

benchmarking_sv's Introduction

Benchmarking of WGS-based structural variant callers

This project contains the links to the datasets and the code that was used for our study : "A benchmarking of WGS-based structural variant callers"

Table of contents

How to cite this study

Sarwal, Varuni, et al. "A comprehensive benchmarking of WGS-based structural variant callers" bioRxiv, doi: https://doi.org/10.1101/2020.04.16.045120

Reproducing results

Tools

We have evaluated 12 structural variant tools: Biograph, BreakDancer, CLEVER, DELLY, GASV, GRIDSS, indelMINER, MiStrVar, Pindel, PopDel, RDXplorer, LUMPY . Details about the tools and instructions for running can be found in our paper.

We have prepared "wrappers" in order to run each of the respective tools as well as create standardized log files.

Data

The raw vcf's produced by the tools can be found here: https://github.com/Mangul-Lab-USC/benchmarking-sv-callers-paper/tree/master/Data/raw_data/mouse/raw_vcf The custom vcf files, which are raw vcf's converted to the VCFv4.2 format can be found here: https://github.com/Mangul-Lab-USC/benchmarking-sv-callers-paper/tree/master/Data/raw_data/mouse The fastq and bam files used will be available soon

Scripts

The scripts to compare the deletions inferred by the SV-caller versus the true deletions is available here: https://github.com/Mangul-Lab-USC/benchmarking-sv-callers-paper/blob/master/Scripts/customvcf_mouse.py

Notebooks and Figures

We have prepared Jupyter Notebooks that utilize the raw data described above to reproduce the results and figures presented in our manuscript.

License

This repository is under MIT license. For more information, please read our LICENSE.md file.

Contact

Please do not hesitate to contact us ([email protected]) if you have any comments, suggestions, or clarification requests regarding the study or if you would like to contribute to this resource.

benchmarking_sv's People

Contributors

addicted-to-coding avatar aditya-sarkar441 avatar alu52904 avatar ndarci avatar ramayyala avatar smangul1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

benchmarking_sv's Issues

FILTER and QUAL should be taken into account

We weren't able to find any clear documentation on how to use QUAL, and FILTER so those were ignored

If there's no documentation in the caller, then you should use the specifications definitions of those fields:

6. QUAL — quality: Phred-scaled quality score for the assertion made in ALT. i.e. −10log10 prob(call in ALT is
wrong). If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10
prob(no variant). If unknown, the MISSING value must be specified. (Float)

7. FILTER — filter status: PASS if this position has passed all filters, i.e., a call is made at this position.
Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g.
“q10;s50” might indicate that at this site the quality is below 10 and the number of samples with data is below
50% of the total number of samples. ‘0’ is reserved and must not be used as a filter String. If filters have
not been applied, then this field must be set to the MISSING value. (String, no white-space or semi-colons
permitted, duplicate values not allowed.)

It seems unfair to penalise callers for following the VCF file format specifications.

QUAL could be ignored if you just want the 'default' call set but it can be used for generating ROC curves instead of single points for each caller.

FILTER should definitely be respected as those calls are not part of the 'default' call set for the caller - the caller itself already thinks they are bad for the reason specified in the FILTER field.

bwa mem flags could be problematic

-a can have a significant impact on caller performance. In the case of GRIDSS, results were signficantly worse when -a was an option.

-Y is similarly problematic: some older callers pre-date the supplementary flag being added to the SAM specifications so essentially require -Y to treat them fairly. Recent callers may actually adhere to the SAM specifications so -Y needs to not be supplied to those callers.

Other callers can rely on specific aligner versions. When I last benchmarked, I found that LUMPY sensitivity got halved due to a minor version change in version of bwa used for alignment.

In summary, -a is a problematic default, and -Y may or may not be required for older callers.

Unable to obtain raw data for reproduction

Hello.

I just saw your preprint go up and was wanting to generate ROC curves from the caller qual scores to compare to your single-point results but it appears the raw data is not available. Would you be able to update your repo and readme with :

  • A link to the google drive location of the bams and/or fastqs.
    • Some aligners have significantly elevated FP rates when using bwa mem -a and am curious what the effect of your choice of aligner settings has on your results
  • Include the raw VCF files as output by the caller.
  • https://github.com/Mangul-Lab-USC/benchmarking-sv-callers-paper/tree/master/Data/raw_data/mouse only includes your processed results, not the raw VCFs. I am unable to generate ROC curves because you've stripped the essential information when generating those subset files (ie QUAL, and FILTER).

LICENSE file is missing

The license is mentioned in the README and a link to a LICENSE file is provided, but the link appears to be broken.

git lfs quota exceeded

Downloading the raw raw VCFs is not possible as the gts lfs quota gets exceeded before all files can be download.

Account for microhomology and CIPOS

Many germline events have microhomology at the breakpoint. When matching an event, this should be taken into account as there are many way to report the event in VCF. Figure 9 of the VCF specifications outlines this.

Some callers report CIPOS (and in the case of GRIDSS, the non-standard HOMPOS field) which is what the caller itself thinks the extent of the homology, but others do not.

Properly matching equivalent variants can be very messy. For example, in the HG002 truth set, a sine duplication within a sine repeat (ref=SINE-SINE-SINE , var=SINE-SINE-SINE-SINE) is reported as an INS event after the 3rd SINE, but the short read callers report it as a DUP of the first sine. These calls both result in the same sequence, but they're 600bp away from each other! Events such as these are a bit extreme and difficult to handle but a basic check of homology, and/or respecting the CIPOS reported by the variant caller will result in more accurate benchmarking results.

The delta between the caller event length and the actual event length for TPs is a good indicator of how well a SV gets the length correct. You should find that, in contrast to overall average lengths, length deltra predicted by BreakDancer do not match the actual event lengths very closely - something that has the potent to change someone's choice of caller.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.