4dn-dcic / pairix Goto Github PK
View Code? Open in Web Editor NEW1D/2D indexing and querying on bgzipped text file with a pair of genomic coordinates
License: MIT License
1D/2D indexing and querying on bgzipped text file with a pair of genomic coordinates
License: MIT License
First - love the tool, worked right out of the box.
2D queries on BEDPE will be quite powerful, looking forward to formal tabix support!
One suggestion, if one queries via only a single coordinate, can the tool assume a symmetrical query?
e.g. change query internally
pairix -h test.bedpe.gz '1:1387294-1468095'
to
pairix -h test.bedpe.gz '1:1387294-1468095|1:1387294-1468095'
The command, "pairix -h test.bedpe.gz '1:1387294-1468095'" currently returns nothing, I would suspect most users would intend for the tool to run the query, "pairix -h test.bedpe.gz '1:1387294-1468095|1:1387294-1468095'", given the original input.
Thoughts?
The example in the spec gives two strands, one of each pos1 and pos2. However, I speculate that only one relative strand (=strand1*strand2) is needed. For example, do these two lines make difference in downstream processing?
EAS139:136:FC706VJ:2:1286:25:275154 chr1 30000 chr3 40000 + -
EAS139:136:FC706VJ:2:1286:25:275154 chr1 30000 chr3 40000 - +
Another example in the spec shows that only one of the following two lines should be retained:
EAS139:136:FC706VJ:2:1286:25:275154 chr1 10000 chr2 2000 + +
EAS139:136:FC706VJ:2:1286:25:275154 chr2 2000 chr1 10000 + +
which makes sense. However, is it legitimate to encode a triplet with identical first column like
EAS139:136:FC706VJ:2:1286:25:275154 chr1 10000 chr2 2000 + +
EAS139:136:FC706VJ:2:1286:25:275154 chr2 2000 chr3 10000 + +
Thanks!
At least it is observed on the largest Bonev dataset.
magus@wiz:~/work/distillate/rao2014Test$ pairix -n HiC_NPC_4.nodups.pairs.gz
-2003369355
Hi,
I am using Apple silicon (M2). When trying to compile I get the following error:
cd src; make; cd ..
gcc -g -w -O2 -fPIC -o pairix main.o -L. -lpairix -lm -lz
gcc -c -g -w -O2 -fPIC -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE -DBGZF_CACHE pairs_merger.c -o pairs_merger.o
pairs_merger.c:41:15: error: call to undeclared function 'pairs_merger'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
int res = pairs_merger(fn_list, num_fn, NULL);
^
1 error generated.
make[2]: *** [pairs_merger.o] Error 1
make[1]: *** [all-recur] Error 1
Does this look familiar at all?
Perhaps this is obvious, but in a pairs file format, can strand fields be specified as "." or not specified at all?
Thank you in advance.
Hi, thanks a lot for making and maintaining all these very useful tools.
I have a custom made contact pairs file in the following format chr1 10000 20000 chr2 30000 50000 3.5
that I'm trying to index using pairix to load in cooler. I tried
pairix -s 1 -d 4 -b 2 -e 3 -u 5 -v 6 -T 3R.txt.gz
but I get
[get_intv] the following line cannot be parsed and skipped: chr3R 0 20000 chr3R 40000 60000 252
[ti_index_core] the indexes overlap or are out of bounds
It's the 1st line of my file and I don't really understand the error, my file only contains contacts for chr3R as I'm only interested in intra-chromosomal contacts.
Could you please let me know if this error is reproducible or if I'm doing something wrong?
Thanks in advance
Hi pairix developers! I'd like to explore if it's possible by any chance to convert pairs file to merged_nodups file using pairix? There's a script in pairix that does the reverse though.
Thanks a lot!
Currently, even with a smallest .pairs file (~100MB), it takes 0.3 seconds to fetch a little region (two lines) from the file with 23 chromosomes, and 0.6 seconds to fetch it from a file with full hg38 set of 455 chromosomes. It takes up to a second for larger files (e.g. Bonev et al), even when the result is just a several lines or is empty.
For comparison, if the file only has one chromosome, a fetch takes 0.007 seconds, which is 100 times faster!
Basically, there is a flat price of fetching a region of any size, and sometimes it is pretty steep.
This can interfere with analyses that use pairix to iterate over pairs of chromosomes. Especially if someone forgot to trim down all the contigs from hg38, and started working with the whole file with 455 chromosomes. Trying to fetch all chromosome pairs would take 455 * 455 * 0.6 = 34 hours of overhead. In that time, for example, one can sort, ingest, and process a largest existing pairs file several times.
One possible solution would be to provide an interactive library that pre-loads the pairix index into memory, parses it as needed, and then does a very quick fetch. Working with this library interactively would speed up development of several algorithms, such as "cooler cload pairix".
Hi there, is a binary spec for pairix or if not just some code I could reference to understand? Thanks :)
Hello,
I would like to know whether pypairix can be used to capture readID related pairs quickly.
Just like:
import pypairix
readID_list = ['read1','read2,'read3'] # string or regular expression
tb = pypairix.open(pair_file)
tb.query2D(readID = readID_list)
It would helps a lot if pypairix would prefrom like this!
(or pandas would just work the same, without much help from the index?)
Hi,
I've found that executing get_blocknames() twice crashes the whole python session:
f = pypairix.open(pairs_path)
f.get_blocknames()
f.get_blocknames()
Hi
Just a question,
Is there a tool to convert bam file to juicer's merged_nodups.txt file?
Best,
Kun
For example, let's say the .pairs output is:
READID chrX 16145203 chrX 16213822 + - UU
Is 16145203 the left-most or the right-most coordinate on this read? Likewise for 16213822? Does this change based on the strand?
Hi there,
I am using pairix in conjunction with cooler to generate cool files for my Hi-C data. Pairix generation seemed to work without an issue using cooler csort pairix
. However, I am now generating the base resolution with cooler cload pairix
and getting maximum chromosome size warnings since some of chromosomes are larger than 2^30.
Does this have an effect on my results and can this be resolved in your implementation to also support larger chromosomes?
These are the sizes:
chr1p 1416415443
chr1q 1471228731
chr2p 1381227519
chr2q 1413870220
chr3p 867469519
chr3q 1525846468
chr4p 1204678091
chr4q 1215118724
chr5p 1279525745
chr5q 1285306171
chr6p 1462196291
chr6q 1625855492
chr7p 907274393
chr7q 1083033678
chr8p 745249056
chr8q 885924281
chr9p 454801983
chr9q 1001906603
chr10p 1081352037
chr10q 525520881
chr11p 305453894
chr11q 1078813816
chr12p 288882840
chr12q 857146890
chr13p 241582323
chr13q 482900735
chr14p 184541253
chr14q 436048769
Thanks
Hello,
I was wondering if you were planning to support python 3.8 (bioconda/bioconda-recipes#24502).
Thanks
Hello,
I am trying to run pairix tool for the analysis with cooler tool to create and contact map but running into this error:
(base) ubuntu@ip-172-31-18-119:/Data1$ pairix corrected2_porec_test.concatemers.pairs.txt.gz
[get_intv] the following line cannot be parsed and skipped: CONCAT0 + - UU 1 chr2 5443 chr1 3003 32 R1
[ti_index_core] the indexes overlap or are out of bounds
zcat corrected2_porec_test.concatemers.pairs.txt.gz | head -n 20
#shape: whole matrix
#genome_assembly: unknown
#chromsize: chr1 3577
#chromsize: chr2 7551
#samheader: @sq SN:chr1 LN:3577
#samheader: @sq SN:chr2 LN:7551
#samheader: CL:minimap2 -ay -t 2 @pg PN:minimap2 ID:minimap2 VN:2.24-r1122 map-ont -x
#samheader: PP:minimap2 CL:/home/epi2melabs/conda/bin/pore-c-py annotate - @pg PN:pore-c-py ID:pore-c-py-2 VN:2.0.1 --monomers porec_test.concatemers
#samheader: parse2 --output-stats porec_test.concatemers.stats.txt -c @pg ID:pairtools_parse2 PN:pairtools_parse2 CL:/home/epi2melabs/conda/bin/pairtools --single-end fasta.fai
#samheader: restrict -f fragments.bed -o @pg ID:pairtools_restrict PN:pairtools_restrict CL:/home/epi2melabs/conda/bin/pairtools extract_pairs.tmp porec_test.concatemers.pairs.gz
#columns: readID chrom1 pos1 chrom2 pos2 strand1 strand2 pair_type walk_pair_index walk_pair_type mapq1 mapq2 pos51 pos52 pos31 pos32 cigar1 cigar2 read_len1 read_len2 matched_bp1 matched_bp2 algn_ref_span1 algn_ref_span2 algn_read_span1 algn_read_span2 dist_to_51 dist_to_52 dist_to_31 dist_to_32 mismatches1 mismatches2 rfrag1 rfrag_start1 rfrag_end1 rfrag2 rfrag_start2 rfrag_end2
CONCAT0 + - UU 1 chr2 5443 chr1 3003 32 R1
CONCAT0 + - UU 2 chr1 1104 chr1 1103 60 R1
CONCAT0 + - UU 3 chr1 602 chr2 6455 60 R1
CONCAT0 + + UN 4 chr2 5530 ! 0 60 R1
CONCAT0 - - NU 5 ! 0 chr2 6538 0 R1
CONCAT0 + - UU 6 chr2 6456 chr2 5442 51 R1
CONCAT1 + - UU 1 chr1 3004 chr2 6538 60 R1
CONCAT2 + - UU 1 chr1 1104 chr1 601 60 R1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.