Code Monkey home page Code Monkey logo

ivcfmerge's Introduction

Build Status

ivcfmerge: Incremental VCF merge

1. Purpose

We provides a utility to merge a large number of VCF files (possibly too many to open at once) incrementally, that only use almost as much memory as one merged line takes.

2 Important assumptions

  • All input VCFs are positionally sorted, and the values for the FILTER column of each position are the same for all samples.
  • All input VCFs have the same headers and the same number of positions.
  • All input VCFs have the FORMAT column.

3. Output format

Since the FILTER value for each sample is different, we omit (set to .) the FILTER column in the merged result, and append the original FILTER value for each sample in their call data. Its format is described by the FT field in the FORMAT column.

For example: These two input lines:

NC_000962.3 11 . A C . MIN_GCP . GT:DP:COV:GT_CONF:GT_CONF_PERCENTILE 0/0:6:6,0:73.54:0.74

and

NC_000962.3 11 . A C . MIN_DP;MIN_GCP . GT:DP:COV:GT_CONF:GT_CONF_PERCENTILE 0/0:3:3,0:36.98:0.01

will produce this output line:

NC_000962.3 11 . A C . . . GT:DP:COV:GT_CONF:GT_CONF_PERCENTILE:FT 0/0:6:6,0:73.54:0.74:MIN_GCP 0/0:3:3,0:36.98:0.01:MIN_DP;MIN_GCP

4. Usage

You can use the utility as either:

4.1.1 If the number of input files is small (can be opened all at once)

from contextlib import ExitStack
from ivcfmerge import ivcfmerge

filenames = [...]    # List/iterator of relative/absolute paths to input files
output_path = '...'  # Where to write the merged VCF to

with ExitStack() as stack:
    files = map(lambda fname: stack.enter_context(open(fname)), filenames)
    with open(output_path) as outfile:
        ivcfmerge(files, outfile)

4.1.2 If the number of input files is big (cannot be opened all at once)

from ivcfmerge import ivcfmerge_batch

filenames = [...]    # List/iterator of relative/absolute paths to input files
output_path = '...'  # Where to write the merged VCF to
batch_size = 1000    # How many files to open and merge at once

ivcfmerge_batch(filenames, output_path, batch_size)
4.1.2.1 You may also need to specify a temporary directory

That has at least as much space as that occupied by the input files to store intermediate results, in the batch processing version.

...
temp_dir = '...'  # for example, a directory on a mounted disk like /mnt/big_disk/tmp or /media/big_disk/tmp

ivcfmerge_batch(filenames, output_path, batch_size, temp_dir)

4.2.1 If the number of input files is small (can be opened all at once)

# Prepare a file of paths to input VCF files
> cat input_paths.txt
1.vcf
2.vcf
...

> python3 ivcfmerge.py input_paths.txt path/to/output/file

4.2.2 If the number of input files is big (cannot be opened all at once)

# Prepare a file of paths to input VCF files
> cat input_paths.txt
1.vcf
2.vcf
...

> python3 ivcfmerge_batch.py --batch-size 1000 input_paths.txt path/to/output/file
4.2.2.1 You may also need to specify a temporary directory

That has at least as much space as that occupied by the input files to store intermediate results, in the batch processing version.

...

> python3 ivcfmerge_batch.py --batch-size 1000 --temp-dir /path/to/tmp/dir input_paths.txt path/to/output/file
pip3 install .
ivcfmerge -h
ivcfmerge_batch -h

All CLI arguments & options are the same as described in 4.2, i.e. just replace python3 ivcfmerge.py with ivcfmerge, similarly for the batch version.

5. Important parameters

5.1 batch_size

Indicates how many files to open and merge each batch, for the batch processing version.

The default value for this parameter is 1000.

5.1.1 How batch_size affect computation resources and performance

  • Total memory usage will not exceed batch_size * size(one line of input VCFs).
  • Batch size equals the number of files the utility will open at once.
  • Bigger batch size will reduce the total time taken, but requires more memory and file handles from the OS.

5.2 temp_dir

For the batch processing version, the utility needs to store the intermediate results somewhere with as much space as the total space occupied by the input files.

By default, the choice is left to the tempfile library. On Unix/Linux, this is usually /tmp.

6. Development

6.1 Running tests

pip3 install -r requirements/dev.txt
pip3 install -e .
pytest

ivcfmerge's People

Contributors

dependabot[bot] avatar giang-nghg avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ivcfmerge's Issues

Resolve warnings in pytest run

When you run tests through pytest, there are quite a few warnings like this

tests/test_ivcfmerge_batch.py::test_merging_example_input /home/vagrant/.local/lib/python3.6/site-packages/hypothesis/extra/pytestplugin.py:168: HypothesisDeprecationWarning: tests/test_ivcfmerge_batch.py::test_merging_example_input uses the 'input_paths' fixture, but function-scoped fixtures should not be used with @given(...) tests, because fixtures are not reset between generated examples! since="2020-02-29",

Please resolve these warnings.

Feature: Parallel merge

Batch merging currently process batches sequentially in order to maintain the ordering of samples in the original input. But in cases this is not necessary, batches can be processed in parallel to speed up the entire process.

Add Travis build

Add Travis build for this repo so that any commit push to this repo can be picked by Travis.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.