pyranges / pyranges Goto Github PK

View Code? Open in Web Editor NEW

415.0 12.0 45.0 7.81 MB

Performant Pythonic GenomicRanges

Home Page: https://pyranges.readthedocs.io

License: MIT License

Python 99.91% R 0.06% CSS 0.03%

genomicranges python genomics pandas data-structures

pyranges's Introduction

pyranges

Introduction

PyRanges is a Python library specifically designed for efficient and intuitive manipulation of genomics data, particularly genomic intervals (like genes, genomic features, or reads). The library is optimized for fast querying and manipulation of genomic annotations.

"Finally ... This was what Python badly needed for years." - Heng Li

Documentation

The pyranges documentation, including installation instructions, API, tutorial, and how-to-pages, is available at https://pyranges.readthedocs.io/

Features

fast
memory-efficient
featureful
pythonic/pandastic
supports chaining with a terse syntax
uses Pandas DataFrames, so the whole Python data science stack works on PyRanges.

Paper/Cite

Stovner EB, Sætrom P (2020) PyRanges: efficient comparison of genomic intervals in Python. Bioinformatics 36(3):918-919 http://dx.doi.org/10.1093/bioinformatics/btz615

Supporting pyranges

most importantly, cite pyranges if you use it. It is the main metric funding sources care about.
use pyranges in Stack Overflow/biostars questions and answers
star the repo (possibly important for github visibility and as a proxy for project popularity)
if you are a business using pyranges, please give to one of the charities listed at https://www.givewell.org/

Asking for help

If you encounter bugs, or the documentation is not enough a cannot accomplish a specific task of interest, or if you'd like new features implemented, open an Issue at github: https://github.com/pyranges/pyranges/issues

Contributing to pyranges

Pyranges accepts code contributions in form of pull request. For details, visit https://pyranges.readthedocs.io/developer_guide.html

pyranges's People

Contributors

Stargazers

Watchers

Forkers

pythseq sridhar0605 xie186 gmgitx xtmgah xma82 dmaloneynygc bmill3r jergosh akmorrow13 liyao001 x-bioinformatics muhammedhasan pedrotomazsilva iamjli hybio sadler7 cfriedline hanyelgaml-roche rasi milescsmith marco-mariotti zzygyx9119 sambryce-smith crsky1023 maarten-vd-sande mkanai fairliereese mariottigenomicslab y9c twhanigan isy89 aallahyar iq-scm lexogen-tools maxtico lzj1769 amazingshi fuxialexander 3ng7n33r

pyranges's Issues

Subsetting exception

Just following the example for subsetting from the documentation but I get the following exception:

>>> print(cpg[cpg.CpG > 50])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyranges/pyranges.py", line 101, in __getitem__
    return _getitem(self, val)
  File "pyranges/methods/getitem.py", line 20, in _getitem
    raise Exception("Not valid subsetter: {}".format(str(val)))
Exception: Not valid subsetter: 0        True
1        True
2        True
3        True
4       False
        ...
1072     True
1073    False
1074    False
1075    False
1076    False
Name: CpG, Length: 1077, dtype: bool

Documentation questions

Hi,

If you are like me then you probably don't like writing documentation, or worse, keeping that documentation up to date. But your document is very helpful to get started with pyranges. It looks like I can clean up some nasty data-format-to-pandas parsers I wrote and replace them by pyranges code. Perhaps if I get more experienced with your module I can contribute (to the docs), but I'm not comfortable doing that yet.

After going through the documentation I have some questions/suggestions though:

I might have missed it, but I see you use .slack() and slack=2 without explaining what that exactly this. It seems to extend intervals according to chapter 12 https://biocore-ntnu.github.io/pyranges/methods-for-manipulating-single-pyranges.html, but perhaps that needed to be specified earlier. It's first mentioned as a method in a code example in chapter 7
I think it would be very helpful if you could add drawings to the part about intersecting, overlapping and joining, like in the bedtools documentation (e.g. https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html). At least for me, it can initially be hard to wrap my head around how the intervals are combined. I like visualizations :)
Perhaps it would be useful to add some wikipedia links for background information, for example when you start talking about run-length encodings in part 20 (https://en.wikipedia.org/wiki/Run-length_encoding). Those concepts may not be obvious to everyone. But now I see the next part (21) is an introduction to RLEs, so maybe the order of the chapters needs to be revised.

I could probably hack something together, but perhaps you can suggest an efficient solution for the following:
I have two data files, containing per genomic position some information. In addition to those, I have a bed file with intervals of interest.
Now for every interval in the bed file, I would like to slice the two data files, sum some numbers and statistically compare file A vs file B.
And then print some output, move on to the next interval.

Would it be most efficient to just iterate over the bed pyranges object and slice pyranges object of the data files? Or should I first annotate every record in the data files with the name of the interval in the bed file it belongs to?

Thanks,
Wouter

feature request: find overlaps

Hey!

Again, I appreciate the work you're doing on this :-)

It would be nice if r1.overlap(r2) told me which rows of r1 overlap which rows of r2. Then I would be able to use it to e.g. annotate my ranges. This is a-la findOverlaps in GenomicRanges: https://kasperdanielhansen.github.io/genbioconductor/html/GenomicRanges_GRanges_Usage.html#findoverlaps.

If you do this, there'll be one less reason for me to open R ;-)

Thx!
J.

Different results with join in 0.0.57

Prior to 0.0.57, the files I am processing produce much larger PyRanges objects. However, when I upgrade to 0.0.57, the ranges are much smaller and cause our pipeline to crash. I'm happy to share files offline, but can't post them here.

For example, in 0.0.56 (and 0.0.55), I log this:

2019-10-23 14:00:52,481|processor.processor|process|769|INFO|Processing 8577 exon overlaps from 16 segments in 964 transcripts of 355 genes

With the same files and code, but after upgrading to 0.0.57, this is logged.

2019-10-23 14:05:06,403|processor.processor|process|769|INFO|Processing 17 exon overlaps from 3 segments in 11 transcripts of 3 genes

The method in question is here:

def get_exon_join(self, other: PyRanges, flip: bool = True) -> PyRanges:
    """
    Join ranges, other, with exon ranges. If flip is false, the join
    is done with other as the target, rather than the source
    :param other: the ranges to join
    :param flip: if true, other is first. if false, other is second
    :return: a pyranges object
    """
    other.Chromosome = other.df.Chromosome.apply(self.get_target_contig_name)
    if flip:
        return other.join(self._exon_ranges, suffix="_exon")
    return self._exon_ranges.join(other, suffix="_other")

Has anything changed with respect to join in this regard? I've tried messing with new_pos with no success.

Thanks so much for taking a look!

license??

What license is this software released under? What are the terms for using it in my own project?

Or is the license embedded in another file?

Convention is to have a LICENSE file

overlap fails when other pyrange is empty

Hi,

I ran into this problem when finding what ranges from one set overlap ranges in several other sets. Here is a trivial test case:

pr1 = pr.PyRanges(pd.DataFrame({
    'Chromosome': ['1', '1'],
    'Start': [243, 645],
    'End': [267, 689],
}))
pr2 = pr.PyRanges(pd.DataFrame({
    'Chromosome': ['2', '2'],
    'Start': [243, 645],
    'End': [267, 689],
}))
pr3 = pr.PyRanges(pd.DataFrame({
    'Chromosome': ['3', '3'],
    'Start': [243, 645],
    'End': [267, 689],
}))

# overlap of pr1 and pr2; should be empty
s1 = pr1
s2 = pr2
res1 = s1.overlap(s2, strandedness=None, how='first')

# overlap of first comparison with pr3; should also be empty
s1 = res1
s2 = pr2
res2 = s1.overlap(s2, strandedness=None, how='first')

Which generates this error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-158-047be8f5ac7d> in <module>
     17 s1 = res1
     18 s2 = pr2
---> 19 res2 = s1.overlap(s2, strandedness=None, how='first')
     20 len(res2.df)

~/jupyter-venv/lib/python3.7/site-packages/pyranges/pyranges.py in overlap(self, other, **kwargs)
    423         kwargs = fill_kwargs(kwargs)
    424 
--> 425         dfs = pyrange_apply(_overlap, self, other, **kwargs)
    426 
    427         return PyRanges(dfs)

~/jupyter-venv/lib/python3.7/site-packages/pyranges/multithreaded.py in pyrange_apply(function, self, other, **kwargs)
    215     else:
    216 
--> 217         if self.stranded and not other.stranded:
    218 
    219             for (c, s), df in items:

~/jupyter-venv/lib/python3.7/site-packages/pyranges/pyranges.py in stranded(self)
    686     def stranded(self):
    687         # print(self.keys())
--> 688         return len(list(self.keys())[0]) == 2
    689 
    690     @property

IndexError: list index out of range

The immediate problem is since res1 is empty it doesn't have anything with which to populate the dataframe dictionary dfs. So when checking if either set is stranded it fails because self.keys() is empty

    @property
    def stranded(self):
        return len(list(self.keys())[0]) == 2

Not sure if this is the best fix, but stranded(self) should probably first check if len(self.keys()) > 0 or something. Alternately, maybe just save a lot of computational work and check in overlap if either PyRanges is empty and if so return an empty PyRanges object.

The init.py file of pyranges 0.0.52 on pypi is empty

As stated, use the __init__.py file from this github repo fix the problem. But the distributed file on pypi has an empty __init__.py

Bioconda package?

Hi, this looks like a promising library that could replace pybedtools in my workflow. Are there any plans to add it as a package in the bioconda channel?

Look into plotting with pygenometracks

https://github.com/deeptools/pyGenomeTracks

What's the roadmap?

First of all, thanks so much for all the work you already put into this. I have been tentatively working with your package and think this is very promising. I am sure many people would be very happy to have a Ranges package of this quality in their python toolbox. I have two questions:

Is there any roadmap? Can you give any estimate when you'll get to work on this package again and if and when you would recommend starting to use this package? Btw, I am already (test-)using the package with some success.
Would you be interested in some pull requests, or do you still want to experiment with the package design and it's too early for outside contributions?

Thanks!

pyranges coverage - going beyond count of overlapping reads

This is a great package thanks - pyranges is easing a transition from R to python.

Previously using GenomicRanges I have used the binnedAverage R function to convert RLE data to mean depth of coverage across target bins.

I much appreciate the pyranges coverage function and the associated NumberOverlaps and FractionOverlaps and would now love to discover how to move beyond these metrics and include e.g. mean coverage and stddev of coverage.

Would welcome any thoughts and suggestions

challenges with pyrle

Dear @endrebak

I have been looking into the updated pyrle = 0.0.28 and have some challenges - the code example that you presented works without issue

import pyranges as pr
gr = pr.random(int(3e6))
r = gr.to_rle()
t = gr.tile(100)
print(r[t])

I have two objects that are giving me some issues

(1) - object rle contains the RLE information from a BAM file corresponding to a target enrichment study - the data looks just like this

1 +
+--------+---------+-----+-----+------+------+---------+------+-------+-------+-------+-----+
| Runs   | 10001   | 2   | 4   | 25   | 66   |  ...    | 18   | 155   | 573   | 125   | 1   |
|--------+---------+-----+-----+------+------+---------+------+-------+-------+-------+-----|
| Values | 0.0     | 2.0 | 3.0 | 4.0  | 5.0  | ...     | 2.0  | 3.0   | 4.0   | 3.0   | 1.0 |
+--------+---------+-----+-----+------+------+---------+------+-------+-------+-------+-----+
Rle of length 248946422 containing 15316 elements
...
Y -
+--------+-----------+-------+--------+---------+----------+---------+--------+-------+--------+--------+--------+
| Runs   | 3328476   | 368   | 1201   | 13975   | 134481   |  ...    | 2861   | 326   | 9465   | 3329   | 3993   |
|--------+-----------+-------+--------+---------+----------+---------+--------+-------+--------+--------+--------|
| Values | 0.0       | 1.0   | 2.0    | 1.0     | 0.0      | ...     | 1.0    | 2.0   | 1.0    | 2.0    | 1.0    |
+--------+-----------+-------+--------+---------+----------+---------+--------+-------+--------+--------+--------+
Rle of length 56887898 containing 1110 elements
PyRles object with 50 chromosomes/strand pairs.

(2) -- ranges is a pyranges object that looks like this

+--------------+-----------+-----------+--------------+
| Chromosome   | Start     | End       | Strand       |
| (category)   | (int32)   | (int32)   | (category)   |
|--------------+-----------+-----------+--------------|
| 1            | 0         | 1000      | +            |
| 1            | 1000      | 2000      | +            |
| 1            | 2000      | 3000      | +            |
| 1            | 3000      | 4000      | +            |
| ...          | ...       | ...       | ...          |
| Y            | 57224000  | 57225000  | -            |
| Y            | 57225000  | 57226000  | -            |
| Y            | 57226000  | 57227000  | -            |
| Y            | 57227000  | 57227415  | -            |
+--------------+-----------+-----------+--------------+
Stranded PyRanges object has 6,176,562 rows and 4 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

when I run

print(rle[ranges])

I am met with an error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-44-3caf6f3aa493> in <module>
      1 import mkl
      2 mkl.set_num_threads(1)
----> 3 print(rle[ranges])

~/miniconda3/envs/megrim/lib/python3.7/site-packages/pyrle/rledict.py in __getitem__(self, key)
    227 
    228                 id_values, id_runs = find_runs(ids)
--> 229                 df = pd.DataFrame({"Start": np.repeat(v.Start.values, id_runs),
    230                                    "End": np.repeat(v.End.values, id_runs),
    231                                    "RunID": ids,

<__array_function__ internals> in repeat(*args, **kwargs)

~/miniconda3/envs/megrim/lib/python3.7/site-packages/numpy/core/fromnumeric.py in repeat(a, repeats, axis)
    480 
    481     """
--> 482     return _wrapfunc(a, 'repeat', repeats, axis=axis)
    483 
    484 

~/miniconda3/envs/megrim/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     59 
     60     try:
---> 61         return bound(*args, **kwds)
     62     except TypeError:
     63         # A TypeError occurs if the object does have such a method in its

ValueError: operands could not be broadcast together with shape (248957,) (248947,)

I have no idea what I have done wrong here - the data looks very much similar to your simulated data - a brief comment would be very much welcomed - thanks!

possible bug joining int64 ranges

I was happy to see this package, thank you.

I am getting failed joins when using long ranges (ie that cover the genome rather than a chromosome)

import pyranges as pr
import pandas as pd
a = pr.PyRanges(pd.DataFrame(data = {'Chromosome':  [0], 'Start':[0], 'End': 5e10}), int64=True)
b = pr.PyRanges(pd.DataFrame(data = {'Chromosome':  [0, 0], 'Start':[0, 5], 'End': [5, 5e10]}), int64=True)
a.join(b)

produces:
Empty PyRanges

I expect it to produce the same as below

a = pr.PyRanges(pd.DataFrame(data = {'Chromosome':  [0], 'Start':[0], 'End': 5e8}), int64=True)
b = pr.PyRanges(pd.DataFrame(data = {'Chromosome':  [0, 0], 'Start':[0, 5], 'End': [5, 5e8]}), int64=True)
a.join(b)

which produces

+--------------+-----------+-----------+-----------+-----------+
|   Chromosome |     Start |       End |   Start_b |     End_b |
|   (category) |   (int64) |   (int64) |   (int64) |   (int64) |
|--------------+-----------+-----------+-----------+-----------|
|            0 |         0 | 500000000 |         0 |         5 |
|            0 |         0 | 500000000 |         5 | 500000000 |
+--------------+-----------+-----------+-----------+-----------+
Unstranded PyRanges object has 2 rows and 5 columns from 1 chromosomes.

Python version 3.7.4
pyranges version 0.0.54

Where did insert go? What to do instead?

I am a user of PyRanges and I found your contact information on GitHub. I am wondering why the PyRanges "Insert" function was removed, where one can overlap pyranges and add a column of data simultaneously. I used this quite often. Is there another way to accomplish this same thing? I cannot seem to find anything similar in the documentation. If you have the chance to respond with any suggestion, I would appreciate any guidance.

Pyranges.subset drops Strand

First, thank you for your work. Alternatives to compute genomic arithmetics are always welcome!

cs = pyranges.data.chipseq()

cs
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       |
| (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         | 212609534 | 212609559 | U0         | 0         | +            |
| chr1         | 169887529 | 169887554 | U0         | 0         | +            |
| chr1         | 216711011 | 216711036 | U0         | 0         | +            |
| chr1         | 144227079 | 144227104 | U0         | 0         | +            |
| ...          | ...       | ...       | ...        | ...       | ...          |
| chrY         | 15224235  | 15224260  | U0         | 0         | -            |
| chrY         | 13517892  | 13517917  | U0         | 0         | -            |
| chrY         | 8010951   | 8010976   | U0         | 0         | -            |
| chrY         | 7405376   | 7405401   | U0         | 0         | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.


def too_long(x):
    return (x.End - x.Start) < 120

cs.subset(too_long)

+--------------+-----------+-----------+------------+-----------+
| Chromosome   | Start     | End       | Name       | Score     |
| (category)   | (int32)   | (int32)   | (object)   | (int64)   |
|--------------+-----------+-----------+------------+-----------|
| chr1         | 212609534 | 212609559 | U0         | 0         |
| chr1         | 169887529 | 169887554 | U0         | 0         |
| chr1         | 216711011 | 216711036 | U0         | 0         |
| chr1         | 144227079 | 144227104 | U0         | 0         |
| ...          | ...       | ...       | ...        | ...       |
| chrY         | 15224235  | 15224260  | U0         | 0         |
| chrY         | 13517892  | 13517917  | U0         | 0         |
| chrY         | 8010951   | 8010976   | U0         | 0         |
| chrY         | 7405376   | 7405401   | U0         | 0         |
+--------------+-----------+-----------+------------+-----------+
Unstranded PyRanges object has 10,000 rows and 5 columns from 24 chromosomes.

Why do we miss the Strand column with the subset function? Is it a bug or intended behaviour?

pip install does not install Cython as dependency

When trying to install from pip, I get the following error:
'ModuleNotFoundError: No module named 'Cython'

If I install cython manually, the pip installation for pyranges then works. Is there a reason cython is not included as a requirement for pip?

What to do about PyRanges too wide to display in terminal?

One thing I have been wondering about is what to do in case the range is too wide to display properly. I guess truncating the columns Chromosome Strand Start End to Position in the display and only have one column in the displayed output like

Position
chr1:100-200:+

without changing the underlying data would be an improvement.

But this is something that one should be able to change in the settings.

In case it is still too wide, I guess I will have to remove a few columns and just display ... instead.

Behaviour of gr.merge

I have been testing gr overlap and merge functions. As far as I can tell right now overlap works by not considering the end (1,5] while merge works by considering it (1,5)?

I have this example bellow:

import pyranges as pr
from pyranges import PyRanges

import pandas as pd

from io import StringIO

f1 = """Chromosome Start End Score Strand
chr1 4 7 23.8 +
chr1 6 11 0.13 -
chr2 0 14 42.42 +"""


df1 = pd.read_csv(StringIO(f1), sep="\s+")

gr1 = PyRanges(df1)

f2 = """Chromosome Start End Score Strand
chr2 14 18 15.3 +
"""


df2 = pd.read_csv(StringIO(f2), sep="\s+")

gr2 = PyRanges(df2)

gr1.overlap(gr2)

f3 = """Chromosome Start End Score Strand
chr1 4 7 23.8 +
chr1 6 11 0.13 -
chr2 0 14 42.42 +
chr2 14 18 15.3 +"""


df3 = pd.read_csv(StringIO(f3), sep="\s+")

gr3 = PyRanges(df3)

gr3.merge(count=True)

overlap doesn't produce any overlaps while merge merges the intervals that start and end with the same coordinate. Personally I would have expected the two to work the same way. Is there a way to get the same results from merge as you get from overlap?

Support vcf file

Can you provide support for vcf file as gtf and bam files?

Equivalent of findOverlaps() from GenomicRanges in R

Thanks for making a really useful package! I was wondering if it's possible to get the information about which pairs of regions form overlaps, similar to the Hits object from the GenomicRanges library in R?

Computing union / intersection of multiple files possible?

I was reading the docs on joining ranges but I can't find any place with information on how to join multiple ranges across files. These are, for instance, accomplished through bedtools multiintersectbed. Is this possible here too?

Thanks in advance

join: ValueError: Buffer dtype mismatch, expected 'const long' but got Python object

my_data contains the reads I want to map to genes
my_annotation contains the gene information

When I do this:

my_annotation.join(my_data)

It is OK.

But when I do this:

my_data.join(my_annotation)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.local/lib/python3.6/site-packages/pyranges/pyranges.py", line 230, in join
    dfs = pyrange_apply(_write_both, self, other, **kwargs)
  File "/.local/lib/python3.6/site-packages/pyranges/multithreaded.py", line 271, in pyrange_apply
    result = call_f(function, nparams, df, odf, kwargs)
  File "/.local/lib/python3.6/site-packages/pyranges/multithreaded.py", line 22, in call_f
    return f.remote(df, odf, kwargs)
  File "/.local/lib/python3.6/site-packages/pyranges/methods/join.py", line 45, in _write_both
    scdf, ocdf = _both_dfs(scdf, ocdf, how=how)
  File "/.local/lib/python3.6/site-packages/pyranges/methods/join.py", line 18, in _both_dfs
    starts, ends, indexes)
  File "ncls/src/ncls32.pyx", line 76, in ncls.src.ncls32.NCLS32.all_overlaps_both
ValueError: Buffer dtype mismatch, expected 'const long' but got Python object

Here is how my_data looks like

+--------------+-----------+-----------+-----------+
| Chromosome   | Start     | End       | Index     |
| (category)   | (int32)   | (int32)   | (int64)   |
|--------------+-----------+-----------+-----------|
| 1            | 3003842   | 3014343   | 0         |
| 1            | 3025329   | 3035830   | 1         |
| 1            | 3030643   | 3041144   | 2         |
| 1            | 3057723   | 3068243   | 3         |

Here is how my_annotation looks like

+--------------+------------+-------------+-----------+-----------+--------------+-------+
| Chromosome   | source     | gene_type   | Start     | End       | Strand       | ...   |
| (category)   | (object)   | (object)    | (int32)   | (int32)   | (category)   | ...   |
|--------------+------------+-------------+-----------+-----------+--------------+-------|
| 1            | havana     | gene        | 3073253   | 3074322   | +            | ...   |
| 1            | havana     | transcript  | 3073253   | 3074322   | +            | ...   |
| 1            | havana     | exon        | 3073253   | 3074322   | +            | ...   |
| 1            | ensembl    | gene        | 3102016   | 3102125   | +            | ...   |

Do you have any idea about this error?

add columns from overlap object

Is there a possibility to add columns from the pyranges which was used for overlapping. What I mean is that let's say one uses a subset of gtf file as regions of interest, but after overlapping wants to also annotate with other columns like gene name from gtf file.

It could be an option like this:
gr.overlap(gr_gtf, addcols = True)
The output would have all rows from gr that overlap with gtf, but in addition also all columns from gtf object for each overlap.

Keep headers bam (and vcf)

The multiline headers in bam/vcf files is often significant. How should it be preserved?

Possibility: add a meta-dict to PyRanges where the header is stored after reading a file. Or just in a variable gr.header.

Use `AttributeError` rather than plain `Exception`

https://github.com/biocore-ntnu/pyranges/blob/570d34e1f66087512da771efe26dd04584fd9f04/pyranges/methods/attr.py#L56

In current behavior, built-in methods such as hasattr do not work because of wrong exception type.

Error during printing - Wrong number of dimensions

I'm trying to work through some of the examples in the documentation and run into an error when trying to print pyranges objects:

gr = pr.PyRanges(chromosomes="chr1", strands="+", starts=[0, 1, 2], ends=(3, 4, 5))
print(gr)

Traceback (most recent call last):
File "", line 1, in
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pyranges/pyranges.py", line 107, in str
return tostring(self)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pyranges/tostring2.py", line 276, in tostring
df = _get_df(self, n, sort)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pyranges/tostring2.py", line 72, in _get_df
df = self.df.astype(object)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/util/_decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 3299, in astype
**kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3224, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3091, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 471, in astype
**kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 2197, in _astype
return self.make_block(values)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 206, in make_block
return make_block(values, placement=placement, ndim=ndim, **kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 2719, in make_block
return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 1844, in init
placement=placement, **kwargs)
File "/Users/jnasser/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 106, in init
raise ValueError('Wrong number of dimensions')
ValueError: Wrong number of dimensions

pyranges version: 0.0.54

Keep QNAMES when reading BAM files

Hi,

First of all, thanks a lot for the great package. This is exactly what the python ecosystem required.

It would be great, if the QNAME field of BAM files would be kept in the pyranges object when using the read_bam function. E.g. when working with paired end data, this is important to recover read pairs.

Thanks a lot

Unexpected behavior of coverage

Given a PyRanges object:

+--------------+-----------+-----------+------------------------+
| Chromosome   | Start     | End       | Tumor_Sample_Barcode   |
| (category)   | (int32)   | (int32)   | (object)               |
|--------------+-----------+-----------+------------------------|
| X            | 44938423  | 44938423  | DO52686                |
| X            | 49595038  | 49595038  | DO52686                |
| X            | 49957631  | 49957631  | DO52686                |
| X            | 51488226  | 51488226  | DO52686                |
| ...          | ...       | ...       | ...                    |
| X            | 137709431 | 137709431 | DO52686                |
| X            | 140993315 | 140993315 | DO52686                |
| X            | 144909350 | 144909350 | DO52686                |
| X            | 144909500 | 144909500 | DO52686                |
+--------------+-----------+-----------+------------------------+
Unstranded PyRanges object has 18 rows and 4 columns from 1 chromosomes.

The corresponding PyRles object from calling coverage() is:

+--------+-------------+
| Runs   | 144909500   |
|--------+-------------|
| Values | 0.0         |
+--------+-------------+
Rle of length 144909500 containing 1 elements

However, on R's GenomicRanges, the RleList with the same data is:

integer-Rle of length 155270560 with 37 runs
  Lengths: 44938422        1  4656614        1   362592        1  1530594 ...        1  3916034        1      149        1 10361060
  Values :        0        1        0        1        0        1        0 ...        1        0        1        0        1        0

serialization for multiprocessing

Hi again,
as usual I am super-thrilled at how this is coming along.

I found that I often want to pass a PyRanges instance to a function which I want to use in parallel (using multiprocessing). I can of course pass the underlying dataframe, and then reconstruct the PyRanges within the child processes. However, i) PyRanges construction for large sets of regions takes some time ii) if i use the same function outside of the context of multiprocessing, I do want to pass a PyRanges instance directly. So it would be just a little bit more convenient if the PyRanges class would be serializable.

Do you have any plans / thoughts on that? In the simplest case, setstate and getstate could automate the 'serialize the underlying DF, then reconstruct the PyRanges' part. I am not Cython-savy enough to know whether there is a more performant alternative...

Thanks!

Enable compression of data frames when writing to csv

Thanks for the great addition to the Python ecosystem! We are starting to incorporate this into some of our systems, which operate on models built from large amounts of transcript and exon range data. We build these models now with the really nice interlap library and drop them to pickles for reuse.

Saving our PyRanges objects to csv files is an excellent replacement for this, but they are kind of large when written (~70MB). It would be great to pass the compression kwarg of pandas, like you're currently doing with sep and header to reduce that size.

Bedtools coverage

Is there a function to calculate the coverage or fraction of intersect of two or more pyranges?
Similar to https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html

How to create/add multiple metadata columns?

What is the best way to add multiple metadata columns to a PyRanges object simultaneously (or to create a PyRanges object from a dataframe that has many metadata columns)?

The documentation describes adding one metadata column at a time, but it doesn't provide the syntax for multiple columns.

PyRanges constructor changes input df dtypes

The current behavior of the PyRanges constructor when initialising from a DataFrame is to change the dtypes of the Chromosome, Start and End column of the input dataframe inplace.

For me this behavior has been problematic. Would consider changing it? The culprit is the set_dtypes call at the beginning of PyRanges._init.

Example:

import pyranges as pr
import pandas as pd

df = pd.DataFrame({
    "Chromosome": list("1122"),
    "Start": [1, 2, 3, 4],
    "End": [10, 11, 12, 13]}
)
df.dtypes
# object, i8, i8

gr = pr.PyRanges(df)

df.dtypes
# category, i4, i4

Btw: thanks again for your great work!

join/merge function

I want to merge/join two Pyranges objects, let say obj1 and obj2. And I still want to trackback the indices of obj1 and obj2. I tried to use 'join' but the function returns with a completely new object which I couldn't trackback the indices of obj1 and obj2 anymore. I imagine it will be similar to the merge function in R. Any suggestion? Or the join function should be improved?

Using int64 PyRanges is buggy

I need to test the whole codebase with values larger than 2,147,483,647 to see that it works as expected.

Known places that are buggy: pyranges.random, pyranges.cluster and possibly more :/

Exception: ('PyRanges object has no attribute', 'getstate') when run in parallel

Hi @endrebak. Another snag I ran into. We have a large number of ranges we are processing, and it would be great to be able to do this in parallel. Some of our processing is fast, but for other situations it could take up to an hour or more to get all of the records processed. A use case to consider is iterating a VCF file and for each record, determining transcript and exon overlaps from PyRanges objects which contain this information. I'd like to be able to parallelize sending the overlap calls for each record in the VCF. However, the PyRanges objects can't be pickled in their current state.

Here's a trivial example, with some code to show the parallel syntax working.

# %%
import pyranges as pr
gr = pr.data.chipseq()

# %%
from joblib import Parallel, delayed


# %%
def overlap(source, target):
    return target.overlap(source)

def test(t):
    return t

samples = [pr.PyRanges(gr.df.sample()) for i in range(10)]     

# %%
with Parallel(n_jobs=8) as par:
    results = par((delayed(test)(i) for i in range(10)))
results

# %%
with Parallel(n_jobs=8) as par:
    results = par((delayed(overlap)(gr, s) for s in samples))
results

I will concede that there may be a better way to do this. Some info on the underlying data:

exons: Stranded PyRanges object has 790,511 rows and 19 columns from 25 chromosomes
transcripts: Stranded PyRanges object has 77,825 rows and 19 columns from 25 chromosomes.

When searching exons, I do filter the ranges first by transcript to speed things up, and doing this on a record-by-record basis in parallel would be better.

Any way it would be possible to make PyRanges objects pickleable?

Thanks very much!
Chris

Printing the files after writing them out is slightly annoying in the REPL

Being able to do gr.to_csv("bla").cluster() is convenient for storing the intermediate results in a call chain to view them later, but it is annoying to have the pyranges printed when just doing

gr.to_gff3("bla")

in the REPL.

Solutions?

I could add a chain-flag to the to_* methods. I guess this is the most Pythonic way, as it is explicit.

gr.to_gff3("bla", chain=False)

is kinda verbose though.

There are ways to know whether you are in an interactive interpreter or not, but I do not see how to use them to turn off printing in the specific case of writing a pyrange to a file.

left join does not report all chromosomes when prange objects have different chromosomes

when performing a left join, ie:
bed1.join(bed2,strandedness=False,how='left')

If bed1 has more chromosomes than bed2, the result of this join will not report all the regions in bed1.

tileGenome equivalent

Correct me if I'm wrong, but I don't believe there is an equivalent for R's tileGenome function

tileGenome returns a set of genomic regions that form a partitioning of the specified genome. Each region is called a "tile".

Use of git tags for releases

Hi - thanks for the great work and continuing to enhance this package. We're getting great use out of it. Would it be possible to incorporate git tagging/releases to match with what's in pypi? That would make tracking changes, especially breaking ones, much more straightforward since the tag would capture the whole repo, not just individual commit messages.

No module named raymock

Hello,

Thanks very much for bringing GRanges to python. Very much appreciated!

I tried to install pyranges through pip install pyranges. The installation issued no problems. However, when I tried to import pyranges, it complained about No module named raymock.

I did a quick fix by updating import pyranges.ramock as ray as import ramock as ray in the pyranges.py. But then it similarly complained other packages. I am wondering is there way I can just fix all automatically? Thanks for any help.

Simo

Buffer dtype mismatch, expected 'long' but got 'long long'

I'm running the pyranges example code for nearest.py on Windows (Python 3.7.3 on Anaconda). I installed pyranges with pip. I'm getting this error:

gr = pr.data.chipseq()
gr2 = pr.data.chipseq_background()
print(gr.nearest(gr2, suffix="_Input"))

ValueError Traceback (most recent call last)

      1 gr = pr.data.chipseq()
      2 gr2 = pr.data.chipseq_background()
----> 3 print(gr.nearest(gr2, suffix="_Input"))

~\Anaconda3\lib\site-packages\pyranges\pyranges.py in nearest(self, other, **kwargs)
    199             assert other.stranded, "If doing upstream or downstream nearest, other pyranges must be stranded"
    200 
--> 201         dfs = pyrange_apply(_nearest, self, other, **kwargs)
    202 
    203         return PyRanges(dfs)

~\Anaconda3\lib\site-packages\pyranges\multithreaded.py in pyrange_apply(function, self, other, **kwargs)
    290                 # dbg(odf)
    291 
--> 292                 result = call_f(function, nparams, df, odf, kwargs)
    293                 results.append(result)
    294 

~\Anaconda3\lib\site-packages\pyranges\multithreaded.py in call_f(f, nparams, df, odf, kwargs)
     20 
     21     if nparams == 3:
---> 22         return f.remote(df, odf, kwargs)
     23     else:
     24         return f.remote(df, odf)

~\Anaconda3\lib\site-packages\pyranges\methods\nearest.py in _nearest(scdf, ocdf, kwargs)
    120         else:
    121             previous_r_idx, previous_dist = _previous_nonoverlapping(
--> 122                 df_to_find_nearest_in.Start, ocdf.End)
    123 
    124             next_r_idx, next_dist = _next_nonoverlapping(

~\Anaconda3\lib\site-packages\pyranges\methods\nearest.py in _previous_nonoverlapping(left_starts, right_ends)
     73     right_ends = right_ends.sort_values()
     74     r_idx, dist = nearest_previous_nonoverlapping(
---> 75         left_starts.values, right_ends.values - 1, right_ends.index.values)
     76 
     77     r_idx = pd.Series(r_idx, index=left_starts.index).sort_index().values

~\Anaconda3\lib\site-packages\sorted_nearest\src\sorted_nearest.pyx in sorted_nearest.src.sorted_nearest.nearest_previous_nonoverlapping()

ValueError: Buffer dtype mismatch, expected 'const long' but got 'long long'

feature request: subset by index or bool array

Hey!

Would be lovely to have a full-fledged granges clone in python!

How do you feel about implementing sub-setting of ranges objects by indexing into them, and/or with a bool array? Right now I found the workaround is to go via pandas, i.e.

gtf_df = gtf.as_df()
gtf_df_genes = gtf_df[gtf_df.Feature=='gene']
gtf_genes = pyranges.PyRanges(gtf_df_genes)

Would be nice to do this directly on the PyRanges object, no? :-)

Thx,
J.

I will be presenting pyranges at CSHL in a month.

Come see me at the Genome Informatics meeting :)

Autodecting columns?

I'd love an API to try to make a PyRanges automatically from a dataframe, without the user having to rename any columns.

gr = pr.from_dataframe(df)

But then we would need a method to guess which columns contain chromsomes, starts, ends and optionally strand.

Starts and Ends: the two first int columns where all End >= Start.
Strand: any column containing either "+" or "-", ".".
Chromosome: harder....

Also, would be great with a method like

df = pr.to_dataframe(gr)

which remembers the original chromosome, start, end and strand names.

read_bedlike

The current read_bed is very strict. I'd like a method which reads any table where the three first columns are Chromosome, Start and End and the optional strand can be anywhere else. This would lessen the need for

gr = pr.PyRanges(pd.read_table("bla"))
# pr.read_bedlike("bla") much more convenient

What convention do PyRanges objects follow: 0-based or 1-based?

In the 0-based convention (used by the BED format) the range that spans the first 10 bases of a chromosome is represented by setting start-end to 0-10 (end position is excluded). In the 1-based convention (used by the GTF and GTF3 formats and in Bioconductor) this range is represented by setting start-end to 1-10.

I couldn't find anything in PyRanges documentation about which convention is used.

Note that the following result:

>>> import pyranges as pr
>>> x = pr.PyRanges(chromosomes='chr1', starts=[10], ends=[20])
>>> y = pr.PyRanges(chromosomes='chr1', starts=[20], ends=[30])
>>> x.intersect(y)
Empty PyRanges

suggests that the 0-based convention is used. With the 1-based convention, this intersection wouldn't be empty (it would contain exactly one base).

But OTOH the following result:

>>> print(x.to_gff3())
chr1	.	.	10	20	.	.	.

suggests the opposite!

Also note that having a PyRanges method that returns the number of bases in each range (width in Bioconductor terminology) would be very useful. Right now, given that it's not clear which convention is used, one has no way to know whether the width should be computed with x.End - x.Start (0-based convention) or with x.End - x.Start + 1 (1-based convention).

Thanks!
H.

PS: FWIW I started a small Bioconductor package to facilitate access to PyRanges objects from Bioconductor: https://github.com/hpages/BiocPyRanges

Chromosome bounds in slack

Hello,

A useful feature would be to be able to ensure that the slack function does not extend genomic ranges beyond the ends of chromosomes. This is implemented in bedtools: https://bedtools.readthedocs.io/en/latest/content/tools/slop.html

It seems there is some logic to prevent slack from creating negative positions (but maybe not for stranded slack), but there is no check for the other end of the chromosome.

I'm thinking this could be implemented as an optional argument to slack. If the optional 'bounds' object is passed, then the output of the slack function is guaranteed to be within the bounds.

genomic features tss not working with GTF source

I ran across this bug trying to use the features.tss() method. After doing some research the problem seems to be that this code was written expecting GFF column names which are different if the source is GTF.

In genomicfeatures.py the tss() definition includes this line:

df = df.drop(["ExonNumber", "ExonID"], 1)

But if my features are read from a gtf file (either by myself or using pyranges_db from gencode) the corresponding columns are actually called ['exon_number', 'exon_id']. This is in contrast to what the columns are called when parsed from GFF files.

It looks like this is also the case with tes()

Additionally, in looking at the code it looks like the drop_duplicates argument to tss and tes doesn't actually have any effect if you set this to False. It's true you would disable the drop duplicates statement at this level, but two calls deeper in _tss() and _tes() there is another drop_duplicates argument which defaults to True and the argument does not get passed down through tss_or_tes() to prevent duplicate dropping at this level.

Add slack as an option to join

The slack parameter seems to work for merge but not join:

gr1 = pr.PyRanges(chromosomes="chr1", starts=[0], ends=[10], strands = "+")
gr2 = pr.PyRanges(chromosomes="chr1", starts=[15], ends=[20], strands = "+")

gr1.merge(gr2, slack = 10)
+--------------+-----------+-----------+--------------+
| Chromosome | Start | End | Strand |
| (category) | (int32) | (int32) | (category) |
|--------------+-----------+-----------+--------------|
| chr1 | 0 | 10 | + |
+--------------+-----------+-----------+--------------+

gr1.join(gr2, slack = 10)
Empty PyRanges