4dn-dcic / pairix Goto Github PK

1D/2D indexing and querying on bgzipped text file with a pair of genomic coordinates

License: MIT License

Makefile 0.52% C++ 2.79% C 65.33% Python 17.20% Shell 8.46% Perl 5.70%

hi-c random-access bgzip bioinformatics c python pypairix pairs

pairix's Issues

assume symmetrical query

First - love the tool, worked right out of the box.
2D queries on BEDPE will be quite powerful, looking forward to formal tabix support!

One suggestion, if one queries via only a single coordinate, can the tool assume a symmetrical query?
e.g. change query internally
pairix -h test.bedpe.gz '1:1387294-1468095'
to
pairix -h test.bedpe.gz '1:1387294-1468095|1:1387294-1468095'

The command, "pairix -h test.bedpe.gz '1:1387294-1468095'" currently returns nothing, I would suspect most users would intend for the tool to run the query, "pairix -h test.bedpe.gz '1:1387294-1468095|1:1387294-1468095'", given the original input.

Thoughts?

Two questions about the pairs format

The example in the spec gives two strands, one of each pos1 and pos2. However, I speculate that only one relative strand (=strand1*strand2) is needed. For example, do these two lines make difference in downstream processing?
```
EAS139:136:FC706VJ:2:1286:25:275154 chr1 30000 chr3 40000 + -
EAS139:136:FC706VJ:2:1286:25:275154 chr1 30000 chr3 40000 - +
```

Another example in the spec shows that only one of the following two lines should be retained:

EAS139:136:FC706VJ:2:1286:25:275154 chr1 10000 chr2 2000 + +
EAS139:136:FC706VJ:2:1286:25:275154 chr2 2000 chr1 10000 + +

which makes sense. However, is it legitimate to encode a triplet with identical first column like

EAS139:136:FC706VJ:2:1286:25:275154 chr1 10000 chr2 2000 + +
EAS139:136:FC706VJ:2:1286:25:275154 chr2 2000 chr3 10000 + +

Thanks!

Integer overflow for -n option: reporting negative numbers!

At least it is observed on the largest Bonev dataset.

magus@wiz:~/work/distillate/rao2014Test$ pairix -n HiC_NPC_4.nodups.pairs.gz
-2003369355

Cannot install on Mac M2

Hi,

I am using Apple silicon (M2). When trying to compile I get the following error:

cd src; make; cd ..
gcc -g -w -O2 -fPIC -o pairix main.o -L. -lpairix -lm -lz
gcc -c -g -w -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE -DBGZF_CACHE pairs_merger.c -o pairs_merger.o
pairs_merger.c:41:15: error: call to undeclared function 'pairs_merger'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
int res = pairs_merger(fn_list, num_fn, NULL);
^
1 error generated.
make[2]: *** [pairs_merger.o] Error 1
make[1]: *** [all-recur] Error 1

Does this look familiar at all?

Strand fields in pairs file format

Perhaps this is obvious, but in a pairs file format, can strand fields be specified as "." or not specified at all?

Thank you in advance.

Error indexing a custom contact text file

Hi, thanks a lot for making and maintaining all these very useful tools.

I have a custom made contact pairs file in the following format chr1 10000 20000 chr2 30000 50000 3.5

that I'm trying to index using pairix to load in cooler. I tried

pairix -s 1 -d 4 -b 2 -e 3 -u 5 -v 6 -T 3R.txt.gz

but I get

[get_intv] the following line cannot be parsed and skipped: chr3R 0 20000 chr3R 40000 60000 252

[ti_index_core] the indexes overlap or are out of bounds

It's the 1st line of my file and I don't really understand the error, my file only contains contacts for chr3R as I'm only interested in intra-chromosomal contacts.

3R.txt.gz

Could you please let me know if this error is reproducible or if I'm doing something wrong?
Thanks in advance

Converting pairs to merged_nodups

Hi pairix developers! I'd like to explore if it's possible by any chance to convert pairs file to merged_nodups file using pairix? There's a script in pairix that does the reverse though.

Thanks a lot!

Fetching small regions can be very slow

Currently, even with a smallest .pairs file (~100MB), it takes 0.3 seconds to fetch a little region (two lines) from the file with 23 chromosomes, and 0.6 seconds to fetch it from a file with full hg38 set of 455 chromosomes. It takes up to a second for larger files (e.g. Bonev et al), even when the result is just a several lines or is empty.

For comparison, if the file only has one chromosome, a fetch takes 0.007 seconds, which is 100 times faster!

Basically, there is a flat price of fetching a region of any size, and sometimes it is pretty steep.

This can interfere with analyses that use pairix to iterate over pairs of chromosomes. Especially if someone forgot to trim down all the contigs from hg38, and started working with the whole file with 455 chromosomes. Trying to fetch all chromosome pairs would take 455 * 455 * 0.6 = 34 hours of overhead. In that time, for example, one can sort, ingest, and process a largest existing pairs file several times.

One possible solution would be to provide an interactive library that pre-loads the pairix index into memory, parses it as needed, and then does a very quick fetch. Working with this library interactively would speed up development of several algorithms, such as "cooler cload pairix".

Is there a binary spec for the pairix format?

Hi there, is a binary spec for pairix or if not just some code I could reference to understand? Thanks :)

pypairix to capture pairs that have specific readIDs

Hello,

I would like to know whether pypairix can be used to capture readID related pairs quickly.
Just like:

import pypairix
readID_list = ['read1','read2,'read3'] # string or regular expression
tb = pypairix.open(pair_file)
tb.query2D(readID = readID_list)

It would helps a lot if pypairix would prefrom like this!
(or pandas would just work the same, without much help from the index?)

executing get_blocknames() twices crashes Python

Hi,
I've found that executing get_blocknames() twice crashes the whole python session:

f = pypairix.open(pairs_path)
f.get_blocknames()
f.get_blocknames()

Is there a tool to convert bam file to juicer's merged_nodups.txt file?

Just a question,
Is there a tool to convert bam file to juicer's merged_nodups.txt file?

Best,
Kun

Which coordinates does pairix return?

For example, let's say the .pairs output is:

READID chrX 16145203 chrX 16213822 + - UU

Is 16145203 the left-most or the right-most coordinate on this read? Likewise for 16213822? Does this change based on the strand?

max chrom size too small

Hi there,

I am using pairix in conjunction with cooler to generate cool files for my Hi-C data. Pairix generation seemed to work without an issue using cooler csort pairix. However, I am now generating the base resolution with cooler cload pairix and getting maximum chromosome size warnings since some of chromosomes are larger than 2^30.
Does this have an effect on my results and can this be resolved in your implementation to also support larger chromosomes?

These are the sizes:

chr1p   1416415443
chr1q   1471228731
chr2p   1381227519
chr2q   1413870220
chr3p   867469519
chr3q   1525846468
chr4p   1204678091
chr4q   1215118724
chr5p   1279525745
chr5q   1285306171
chr6p   1462196291
chr6q   1625855492
chr7p   907274393
chr7q   1083033678
chr8p   745249056
chr8q   885924281
chr9p   454801983
chr9q   1001906603
chr10p  1081352037
chr10q  525520881
chr11p  305453894
chr11q  1078813816
chr12p  288882840
chr12q  857146890
chr13p  241582323
chr13q  482900735
chr14p  184541253
chr14q  436048769

Thanks

Problem with fragment_4dnpairs.pl

Support for python3.8

Hello,
I was wondering if you were planning to support python 3.8 (bioconda/bioconda-recipes#24502).
Thanks

pypairix check triangle

[ti_index_core] the indexes overlap or are out of bounds

Hello,
I am trying to run pairix tool for the analysis with cooler tool to create and contact map but running into this error:

(base) ubuntu@ip-172-31-18-119:/Data1$ pairix corrected2_porec_test.concatemers.pairs.txt.gz
[get_intv] the following line cannot be parsed and skipped: CONCAT0 + - UU 1 chr2 5443 chr1 3003 32 R1
[ti_index_core] the indexes overlap or are out of bounds

zcat corrected2_porec_test.concatemers.pairs.txt.gz | head -n 20

pairs format v1.0.0

#shape: whole matrix
#genome_assembly: unknown
#chromsize: chr1 3577
#chromsize: chr2 7551
#samheader: @sq SN:chr1 LN:3577
#samheader: @sq SN:chr2 LN:7551
#samheader: CL:minimap2 -ay -t 2 @pg PN:minimap2 ID:minimap2 VN:2.24-r1122 map-ont -x
#samheader: PP:minimap2 CL:/home/epi2melabs/conda/bin/pore-c-py annotate - @pg PN:pore-c-py ID:pore-c-py-2 VN:2.0.1 --monomers porec_test.concatemers
#samheader: parse2 --output-stats porec_test.concatemers.stats.txt -c @pg ID:pairtools_parse2 PN:pairtools_parse2 CL:/home/epi2melabs/conda/bin/pairtools --single-end fasta.fai
#samheader: restrict -f fragments.bed -o @pg ID:pairtools_restrict PN:pairtools_restrict CL:/home/epi2melabs/conda/bin/pairtools extract_pairs.tmp porec_test.concatemers.pairs.gz
#columns: readID chrom1 pos1 chrom2 pos2 strand1 strand2 pair_type walk_pair_index walk_pair_type mapq1 mapq2 pos51 pos52 pos31 pos32 cigar1 cigar2 read_len1 read_len2 matched_bp1 matched_bp2 algn_ref_span1 algn_ref_span2 algn_read_span1 algn_read_span2 dist_to_51 dist_to_52 dist_to_31 dist_to_32 mismatches1 mismatches2 rfrag1 rfrag_start1 rfrag_end1 rfrag2 rfrag_start2 rfrag_end2
CONCAT0 + - UU 1 chr2 5443 chr1 3003 32 R1
CONCAT0 + - UU 2 chr1 1104 chr1 1103 60 R1
CONCAT0 + - UU 3 chr1 602 chr2 6455 60 R1
CONCAT0 + + UN 4 chr2 5530 ! 0 60 R1
CONCAT0 - - NU 5 ! 0 chr2 6538 0 R1
CONCAT0 + - UU 6 chr2 6456 chr2 5442 51 R1
CONCAT1 + - UU 1 chr1 3004 chr2 6538 60 R1
CONCAT2 + - UU 1 chr1 1104 chr1 601 60 R1

4dn-dcic / pairix Goto Github PK

pairix's Issues

pairs format v1.0.0

Recommend Projects

Recommend Topics

Recommend Org