Code Monkey home page Code Monkey logo

1000g-integration's Introduction

1000 Genomes integration methods

This repository packages methods that are used in the 1000 Genomes Project phase3 integration pipeline, specifically for small non-SNP (and non-biallelic) sequence variants, such as indels and multiallelic or complex loci. They are packaged here to ease the distribution of computational effort for this task.

Please contact me ([email protected]) with any questions.

Getting started

First, use git to clone the repository:

git clone --recursive https://github.com/ekg/1000G-integration.git

Note the use of --recursive, this ensures that the many submodules included in the project are correctly cloned as well.

Build using make. Executables (freebayes and glia) will be in bin/. You should modify scripts/run_region.sh to reflect their location, e.g.:

## Change this to your 1000G-integration directory, or unset if
## you put freebayes and glia into your system-wide path:
bin=/path/to/1000G-integration/bin

You will also need samtools. In my tests, I used version 0.1.19-44428cd.

Execution

Now, you can run the graph-based regenotyping pipeline using:

scripts/run_region.sh [outdir] \
                      [reference] \
                      [region] \
                      [union] \
                      [contamination] \
                      [cnvmap] \
                      [scratch] \
                      [merge_script]

This will generate outdir/region.vcf.gz in gzip format, along with a number of *.err files for each component in the process. Additionally, a *.sites.vcf.gz file is made, and if decompression of the generated gzipped VCF does not fail, outdir/region.ok is touched.

The cnvmap and contamination estimate files for the 2535 samples in the 1000G release are both in resources/. The 1000 Genomes reference should be used here. You will also need to download the whole-genome union allele list and its .tbi index from the 1000 Genomes FTP site. Supply this as union in the run_region.sh command.

You will also need to provide a merge_script which takes a genomic region in samtools format (e.g. 5:300-400). This script should produce a merged, uncompressed BAM stream on stdout of all low-coverage and exome sequencing data mapped to the target region in the 2535 samples in the 1000G phase3. The exact functioning of this script is likely system-dependent, but an example that performs the merge in less than 700Mb of memory is provided in the scripts directory. This method requires a merged SAM header which has been generated with bamtools, and is provided in the resources directory.

Considerations

The method does assume that you have stored your data in per-sample (exome and low-coverage) BAMs, and that these need to be merged, processed by glia, written to disk (temporarily) and then processed by freebayes for best performance. However, you may have already generated per-region files for the data.

You will need suitable scratch space (specified as scratch) on the nodes where you execute the script. If this is not available, the glia and freebayes components can run on purely streamed data at a cost of doubled runtime memory requirements. If this is the case, you will need to modify the run_region.sh script to not write the temporary file. Please contact me if this is necessary.

You will also need enough memory. In this alignment data, glia and freebayes will rarely use more than 4-5Gb at runtime. You can detect such failures by grepping for out-of-memory errors ("bad alloc"s) in the *.err files.

In practice, I use a set of targets that have approximately equal amounts of sequencing data in them (this was estimated from ~20 exome and ~20 low coverage files). This kind of even partitioning can improve performance, but again this is possibly system-dependent and thus the merging strategy is left up to the user.

1000g-integration's People

Contributors

ekg avatar generalgubernator1 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

miranda76

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.