Code Monkey home page Code Monkey logo

bayestyper's Introduction

BayesTyper

BayesTyper performs genotyping of all types of variation (including structural and complex variation) based on an input set of variants and read k-mer counts. Internally, BayesTyper uses exact alignment of k-mers to a graph representation of the input variation and reference sequence in combination with a probabilistic model of k-mer counts to do genotyping. The variant representation ensures that the resulting calls are not biased towards the reference sequence as is otherwise generally the case when basing calls only on mapped reads.

The BayesTyper was used to integrate mapping- and assembly-based calls in the GenomeDenmark project. A manuscript describing the method is currently in revision.

The BayesTyper is being developed by Jonas Andreas Sibbesen, Lasse Maretty and Anders Krogh at the Section for Computational and RNA Biology, Department of Biology, University of Copenhagen.

Use cases

Variant integration

Sensitive calling of structural variation typically requires running multiple callers to ensure sensitivity yet this leads to the problem of integrating calls across call-sets. The BayesTyper can be used to produce a fully integrated call-set including SNVs, indels and complex variation from input variant candidates produced by a panel of methods; the panel must include standard SNV and indel calls e.g. from GATK, Freebayes or Platypus.

Prior based genotyping

A signficant amount of both simple and complex variation is already known from large population-scale studies. As some of these variants may be missed in a study - even when running multiple methods - due to alignment bias, we provide a database containing common SNPs/indels together with complex variants that can be combined with in-sample calls (i.e. calls based only on the study data) to improve sensitivity. This approach can for instance be used to quickly augment a set of standard SNV and indel calls (e.g. from GATK) with structural variation by running BayesTyper on the SNV/indel calls combined with our variation database. For higher sensitivity, in-sample complex variation calls can be combined with the database to produce the final intergrated call-set.

Installation

BayesTyper can either be build from source or a static Linux x86_64 build can be downloaded under releases.

Building BayesTyper

Prerequisites

  • gcc (c++11 support required. Tested with gcc 4.8 and 4.9)
  • CMake (version 2.8.0 or higher)
  • Boost (tested with version 1.55.0 and 1.56.0)

Compilation

  1. git clone https://github.com/bioinformatics-centre/BayesTyper.git
  2. cd BayesTyper
  3. mkdir build && cd build
  4. cmake ..
  5. make

The compiled bayesTyper and bayesTyperTools binaries are located in the bin directory.

Basic usage

The BayesTyper package contains bayesTyper, which does the genotyping, and bayesTyperTools, which is used to pre- and post-process VCF files for BayesTyper.

  1. Count k-mers

    1. Run KMC3 on each sample: kmc -k55 sample_1.fq sample_1
      • This will output k-mer counts to sample_1.kmc_pre and sample_1.kmc_suf.
      • For low coverage data (<20X), include singleton k-mers by adding -ci1 to the kmc3 commandline.
  2. Prepare variant input

    IMPORTANT: The variant input must contain simple variants (SNPs and short indels). These can be obtained by first running a standard tool like GATK, Platypus or Freebayes and then combine these variants with structural variants calls and/or prior as desired. At least 1 million simple variants are required.

    1. If required, convert allele IDs (e.g. <DEL>) to sequence: bayesTyperTools convertAlleleId -o sample_1_sv_calls_seq -v sample_1_sv_calls.vcf -g hg38.fa

      • Currently <DEL>, <DUP>, <CN[digit(s)]>, <CNV>, <INV>, <INS:ME:[sequence name]> are supported. The latter require a fasta file with the mobile element insertion sequences.
      • This step can be skipped if the variant sets does not include any allele IDs (e.g. GATK, Platypus and Freebayes output).
    2. Normalise variants using Bcftools: bcftools norm -o sample_1_gatk_norm.vcf -f hg38.fa sample_1_gatk.vcf

    3. Combine variant sets: bayesTyperTools combine -o bayesTyper_input -v gatk:sample_1_gatk_norm.vcf,gatk:sample_2_gatk_norm.vcf,gatk:sample_3_gatk_norm.vcf,varDB:SNP_dbSNP150common_SV_1000g_dbSNP150all_GDK_GoNL_GTEx_GRCh38.vcf

      • The contig fields in the headers need to be identical between variant sets and the variants sorted in the same order as the fields.
      • *IMPORTANT: The variant input must contain simple variants (SNPs and short indels). These can be obtained by first running a standard tool like GATK, Platypus or Freebayes
  3. Genotype variants

    IMPORTANT: If you want to run BayesTyper on more than 30 samples, you should run BayesTyper in batches of 30 samples or less but using the full set of variants (i.e. across all individuals)

    1. Prepare sample information: Create tsv file with one sample per row with columns <sample_id>, <sex> and <path_to_kmc3_output> (example)

    2. Run BayesTyper: bayesTyper -o integrated_calls -s samples.tsv -v bayesTyper_input.vcf -g hg38.fa -p <threads> > bayesTyper_log.txt

      • Decoy sequences: BayesTyper can be provided with decoy sequences using '-d' to handle sequence similarities between genotyped regions and non-genotyped regions (e.g. the mitochondrial genome and unplaced contigs in the reference). Matching reference and decoy sequences are available for
  4. Filter output

    1. Run filtering: bayesTyperTools filter -o integrated_calls_filtered -v integrated_calls.vcf -g hg38.fa --kmer-coverage-filename integrated_calls_kmer_coverage_estimates.txt
      • By default only genotypes with high confidence (posterior probability >= 0.99) are kept. If low confident genotypes are needed in a downstream analyses this can be changed using the option --min-genotype-posterior.

Variant databases

Variant database sources

GRCh37

Source Version Filters* Lifted Reference
dbSNP 150 No rare SNVs No link
1000 Genomes Project (1KG) Phase 3 No SNVs No link
Genome of the Netherlands Project (GoNL) Release 6 No SNVs No link
Genotype-Tissue Expression (GTEx) Project GTEx Analysis V6 No SNVs No link
GenomeDenmark (GDK) v1.0 No SNVs From GRCh38 link

GRCh38

Source Version Filters* Lifted Reference
dbSNP 150 No rare SNVs No link
1000 Genomes Project (1KG) Phase 3 No SNVs No link
Genome of the Netherlands Project (GoNL) Release 6 No SNVs From GRCh37 link
Genotype-Tissue Expression (GTEx) Project GTEx Analysis V6 No SNVs From GRCh37 link
GenomeDenmark (GDK) v1.0 No SNVs No link

*Reference and alternative alleles containing ambiguous nucleotides were removed from all variant sources.

Memory requirements

Variants Coverage Samples Singletons removed Threads Memory (GB) Time (wall-time hours)
15M 30X 10 Yes 32 235 26
21M ~13X 10 No 32 280 20
51M ~50X 13 Yes 32 430 107

Third-party

Third-party software used by BayesTyper (distributed together with the BayesTyper source code).

bayestyper's People

Contributors

lassemaretty avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.