Code Monkey home page Code Monkey logo

pairsamtools's Introduction

pairsamtools

Documentation Status Build Status Join the chat at https://gitter.im/mirnylab/distiller

build Hi-C mapping pipelines with pairsamtools

pairsamtools is a simple and fast command-line framework to process sequencing data from a Hi-C experiment.

pairsamtools process pair-end sequence alignments and perform the following operations:

  • detect and classify ligation sites (a.k.a. Hi-C pairs) produced in Hi-C experiments
  • sort .pairs files for downstream analyses
  • detect, tag and remove PCR/optical duplicates
  • generate extensive statistics of Hi-C datasets
  • select Hi-C pairs given flexibly defined criteria
  • restore and tag .sam files for selected subsets of Hi-C pairs

To get started, check out the documentation.

pairsamtools produce and operate on tab-separated files compliant with the .pairs format defined by the 4D Nucleome Consortium. All pairsamtools properly manage file headers and keep track of the data processing history.

installation

Requirements:

  • python 3.x
  • unix sort
  • bgzip
  • Cython
  • numpy
  • click

Install using pip:

$ pip install git+https://github.com/mirnylab/pairsamtools

tools

  • parse: read .sam files produced by bwa and form Hi-C pairs

    • form Hi-C pairs by reporting the outer-most mapped positions and the strand on the either side of each molecule;
    • report unmapped/multimapped (ambiguous alignments)/chimeric alignments as chromosome "!", position 0, strand "-";
    • identify and rescue chrimeric alignments produced by singly-ligated Hi-C molecules with a sequenced ligation junction on one of the sides;
    • perform upper-triangular flipping of the sides of Hi-C molecules such that the first side has a lower sorting index than the second side;
    • form hybrid pairsam output, where each line contains all available data for one Hi-C molecule (outer-most mapped positions on the either side, read ID, pair type, and .sam entries for each alignment);
    • print the .sam header as #-comment lines at the start of the file.
  • sort: sort pairsam files (the lexicographic order for chromosomes, the numeric order for the positions, the lexicographic order for pair types).

  • merge: merge sorted pairsam files

    • simple merge sort for pairsam entries;
    • combine the pairs headers from all input files;
    • check that each pairsam file was mapped to the same reference genome index (by checking the identity of the @SQ sam header lines).
  • select: select pairsam entries with specific field values

    • select pairsam entries according to the provided condition. A programmable interface allows for arbitrarily complex queries on specific pair types, chromosomes, positions, strands, read IDs (including matches to a wildcard/regexp/list).
    • optionally print the non-matching entries into a separate file.
  • dedup: remove PCR duplicates from a sorted triu-flipped pairsam file

    • remove PCR duplicates by finding pairs of entries with both sides mapped to similar genomic locations (+/- N bp);
    • optionally output the PCR duplicate entries into a separate file.
    • NOTE: in order to remove all PCR duplicates, the input must contain *all* mapped read pairs from a single experimental replicate;
  • maskasdup: mark all pairs in a pairsam as Hi-C duplicates

    • change the field pair_type to DD;
    • change the pair_type tag (Yt:Z:) for all sam alignments;
    • set the PCR duplicate binary flag for all sam alignments (0x400).
  • split: split a pairsam file into pairs and sam alignments.

  • stats: calculate various statistics of .pairs and .pairsam files

  • restrict: identify the span of the restriction fragment forming a Hi-C junction

pipelines

We provide a simple mapping bash pipeline in /examples/. It serves as an illustration to pairsamtools' functionality and will not be further developed.

distiller is a powerful Hi-C data analysis workflow, based on pairsamtools and nextflow.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.