Code Monkey home page Code Monkey logo

read-paf's Introduction

readpaf

Build PyPI

readpaf is a fast parser for minimap2 PAF (Pairwise mApping Format) files. It is written in pure python with no required dependencies unless a pandas DataFrame is required.

Installation

Minimal install:

pip install readpaf

With optional pandas dependency:

pip install readpaf[pandas]
Direct download As readpaf is a self contained module it can be installed by downloading just the module. The latest version is available from:
https://raw.githubusercontent.com/alexomics/read-paf/main/readpaf.py

or a specific version can be downloaded from a release/tag like so:

https://raw.githubusercontent.com/alexomics/read-paf/v0.0.5/readpaf.py

PyPI is the recommended install method.

Usage

readpaf only has one user function, parse_paf that accepts of file-like object; this is any object in python that has a file-oriented API (sys.stdin, stdout from subprocess, io.StringIO, open files from gzip or open).

The following script demonstrates how minimap2 output can be piped into readpaf

from readpaf import parse_paf
from sys import stdin

for record in parse_paf(stdin):
    print(record.query_name, record.target_name)

readpaf can also generate a pandas DataFrame:

from readpaf import parse_paf

with open("test.paf", "r") as handle:
    df = parse_paf(handle, dataframe=True)

Functions

readpaf has a single user function

parse_paf

parse_paf(file_like=file_handle, fields=list, na_values=list, na_rep=numeric, dataframe=bool)

Parameters:

  • file_like: A file like object, such as sys.stdin, a file handle from open or io.StringIO objects
  • fields: A list of 13 field names to use for the PAF file, default:
    "query_name", "query_length", "query_start", "query_end", "strand",
    "target_name", "target_length", "target_start", "target_end",
    "residue_matches", "alignment_block_length", "mapping_quality", "tags"
    These are based on the PAF specification.
  • na_values: A list of values to interpret as NaN. This is only applied to numeric fields, default: ["*"]
  • na_rep: Value to use when a NaN value specified in na_values is found. This should ideally be 0 to match minimap2's output default: 0
  • dataframe: bool, if True, return a pandas.DataFrame with the tags expanded into separate Series

If used as an iterator, then each object returned is a named tuple representing a single line in the PAF file. Each named tuple has field names as specified by the fields parameter. The SAM-like tags are converted into their specified types and stored in a dictionary with the tag name as the key and the value a named tuple with fields name, type, and value. When print or str are called on PAF record (named tuple) a formated PAF string is returned, which is useful for writing records to a file. The PAF record also has a method blast_identity which calculates the blast identity for that record.

If used to generate a pandas DataFrame, then each row represents a line in the PAF file and the SAM-like tags are expanded into individual series.

read-paf's People

Contributors

alexomics avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

read-paf's Issues

Handle unknown tag fields

Currently expected tags (from minimap2 manual) are used to convert tags back into record format for writing out records. Tools like uncalled and sigmap use their own tags which will not be written correctly.

A failing test would be:

 def test_write_unmapped_line():
    _rec = "2a708733-5e95-49e3-8806-e181e9380cd9\t3715\t*\t*\t*\t*\t*\t*\t*\t*\t*\t61\tmt:f:0.000000"
    PAF_IO = StringIO(_rec)
    for rec in parse_paf(PAF_IO):
        assert str(rec) == _rec.strip(), "record didn't match"

As this record will be assigned the tag mt:None:0.00000.

"{}:{}:{}".format(k, REV_TYPES.get(k), v) for k, v in self[-1].items()

read-paf/readpaf.py

Lines 54 to 70 in 8cdef6b

REV_TYPES = {
"tp": "A",
"cm": "i",
"s1": "i",
"s2": "i",
"NM": "i",
"MD": "Z",
"AS": "i",
"ms": "i",
"nn": "i",
"ts": "A",
"cg": "Z",
"cs": "Z",
"dv": "f",
"de": "f",
"rl": "i",
}

Here we should change the tag holder to include the original tag type. This may create an issue with dataframe conversion.

Unmapped reads write `nan' into output records

Records that incorrectly use * inplace of integer values are assigned nan as of v0.0.7. When these records are written to a string the fields are given the value nan, which should be read in using readpaf without issue, but may create issues with other tools.

Eg:

2a708733-5e95-49e3-8806-e181e9380cd9	3715	nan	nan	*	*	nan	nan	nan	nan	nan	61

These reads should be written out using zeros in place of nan. To mimic what minimap2 does.

Empty PAF

Hi @alexomics ,

Thanks again for the great tool!

I hit a bug where I had an empty PAF (no read alignments, this seems to be valid output from minimap2?). Tried reading this in with parse_paf into a pandas dataframe but fails (I believe because of the tag columns). Is it possible to produce an empty df? Should this be the desired behaviour?

To reproduce:

touch test.paf
from readpaf import parse_paf

with open("test.paf", "r") as handle:
    df = parse_paf(handle, dataframe=True)

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.