decodegenetics / chopbai Goto Github PK
View Code? Open in Web Editor NEWChops a bam index file into pieces
License: GNU General Public License v2.0
Chops a bam index file into pieces
License: GNU General Public License v2.0
Ideally chopBAI would include some bundled test cases which could be invoked by the commonly used idiom of make test
.
Right now there isn't even a provided sample dataset for human testing.
Using Mac OS X 10.10.5,
$ uname
Darwin
$ g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.0 (clang-700.1.76)
Target: x86_64-apple-darwin14.5.0
Thread model: posix
Attempting to build the latest code for chopBAI, specifically commit 6dfdaa5 as follows:
$ git clone https://github.com/seqan/seqan.git
$ git clone https://github.com/DecodeGenetics/chopBAI.git
$ cd chopBAI
$ emacs Makefile
...
$ git diff
diff --git a/Makefile b/Makefile
index 24ec4fb..04eb225 100644
--- a/Makefile
+++ b/Makefile
@@ -4,7 +4,7 @@ CC=$(CXX)
# Set this to include SeqAn libraries, either system wide
# or download into current folder and set to .
-SEQAN_LIB=.
+SEQAN_LIB=../seqan/include/
CXXFLAGS+=-I$(SEQAN_LIB) -DSEQAN_HAS_ZLIB=1
LDLIBS=-lz -lpthread
Then compiling gives:
$ make
g++ -I../seqan/include/ -DSEQAN_HAS_ZLIB=1 -DDATE=\""on 2015-11-06 17:36:57"\" -W -Wall -Wno-long-long -pedantic -Wno-variadic-macros -O3 -DSEQAN_ENABLE_TESTING=0 -DSEQAN_ENABLE_DEBUG=0 -Qunused-arguments -c -o chopBAI.o chopBAI.cpp
In file included from chopBAI.cpp:7:
./bam_index_csi.h:173:55: warning: unused parameter 'bamFile' [-Wunused-parameter]
bool jumpToOrphans(FormattedFile<Bam, Input, TSpec> & bamFile,
^
./bam_index_csi.h:174:27: warning: unused parameter 'hasAlignments' [-Wunused-parameter]
bool & hasAlignments,
^
./bam_index_csi.h:175:42: warning: unused parameter 'index' [-Wunused-parameter]
BamIndex<Csi> const & index)
^
3 warnings generated.
g++ chopBAI.o -lz -lpthread -o chopBAI
Compilation gives three warnings as shown above, but seems to work:
$ ./chopBAI --version
./chopBAI version 0.1 beta
chopBAI fails to populate the BAI psuedo-bins used to record the total number of reads mapped to or placed against each reference, as used by samtools idxstats
. These values appears in the references' lists of distinct bins as bin number 37450 (which is beyond the normal range), see https://github.com/samtools/hts-specs/blob/master/SAMv1.tex#L1071
To reproduce, using 2df1721 (current master branch tip)
$ git reset --hard
$ git checkout 2df17217b0ed4296858d33de966b76f267eac8d4
$ make
$ make test
Note I am using Mac OS X and samtools 1.3 but any recent version should work.
$ samtools --version
samtools 1.3
Using htslib 1.3
Copyright (C) 2015 Genome Research Ltd.
Now for the bug, first let's look at the reads mapped to or placed on each reference:
$ samtools idxstats tests/test.sorted.bam
chrA:B:C:D 300 44 0
chrA:B 10000 2236 0
chrB 80001 17718 0
* 0 0 0
Note that this sample test file has no placed but unmapped reads (i.e. unmapped partners of paired reads which are placed at the same POS/RNAME as their mapped partner for sorting purposes), so the final column is all zeros.
Now looking at one of the windowed BAI files and its BAM symlink,
$ samtools idxstats tests/chrB/test.sorted.bam
chrA:B:C:D 300 0 0
chrA:B 10000 0 0
chrB 80001 0 0
* 0 0 0
Expected result would be to have zeros for all the non-selected regions, but the full count for chrB
, i.e.
chrA:B:C:D 300 0 0
chrA:B 10000 0 0
chrB 80001 17718 0
* 0 0 0
Likewise for a region of a reference,
$ samtools idxstats "tests/chrB:1-100/test.sorted.bam"
chrA:B:C:D 300 0 0
chrA:B 10000 0 0
chrB 80001 0 0
* 0 0 0
Here I would expect the same numbers as if I have extracted reads in this region via samtools view
and indexed the sub-file:
$ samtools view -b -o test_chrB_1-100_subset.bam tests/test.sorted.bam "chrB:1-100"
$ samtools index test_chrB_1-100_subset.bam
$ samtools idxstats test_chrB_1-100_subset.bam
chrA:B:C:D 300 0 0
chrA:B 10000 0 0
chrB 80001 24 0
* 0 0 0
I have tried using chopBAI -l
just in case that made any difference, it did not.
The README file just has a generic example assuming files sequence.bam
and sequence.bam.bai
in the current directory. It would be helpful to document a real world example where chopBAI is expected to be useful - perhaps something from the 1000 Genomes Project, or an excerpt of the full dataset used but not described in the preprint?
The preprint http://biorxiv.org/content/early/2015/11/06/030825 talks about typical (human) BAI files of 10Mb [sic, see below], and Figure 1 plots data from a specific but unspecified example dataset - presumably human chromosome one. The discussion reports "With chopBAI’s preprocessing, the I/O imposed by BAM index files in the analysis of 10 Kbp regions of 10,000 BAM files can be reduced
from 93 Gb to 4.5 Gb per job", again presumably the same unspecified multiple-individual human data set.
P.S. The preprint could be be misusing the abbreviations Gb and Mb (gigabit and megabit) rather than GB and MB (gigabyte and megabytes). For network IO using bits rather than bytes is (unfortunately) common, but for file size on disk mega or gigabytes is normal, likewise hard drive IO is usually measured in megabytes/second (MB/s).
The README file explains that if you would like to use the reduced BAI file with a tool that assumes the BAM and the BAI file to share a common prefix (such as samtools), you can create a symbolic link, e.g
$ ln -s sequence.bam chr4:15000000-16000000/sequence.bam
$ samtools view -c chr4:15000000-16000000/sequence.bam chr4:15500000-15600000
This can then use the original bam file, but the reduced index.
It would be nice to have these links created by the tool, perhaps via a new command line option. You've used -l
which would have been the obvious shorthand, so maybe -s
and --symlink
would work?
chopBAI fails to follow the convention long defined in samtools for how to specify a region, e.g. from samtools 0.1.18:
$ samtools view --help
...
4. A region should be presented in one of the following formats:
`chr1', `chr2:1,000' and `chr3:1000-2,000'. When a region is
specified, the input alignment file must be an indexed BAM file.
...
or samtools 1.2,
$ samtools view --help
...
5. A region should be presented in one of the following formats:
`chr1', `chr2:1,000' and `chr3:1000-2,000'. When a region is
specified, the input alignment file must be a sorted and indexed
alignment (BAM/CRAM) file.
...
This is currently implemented via htslib, see https://github.com/samtools/htslib/blob/develop/hts.c#L1877 - allowed syntax is reference
, reference:begin
and reference:begin-end
and comma separators are allowed in the coordinates.
Example:
$ samtools idxstats example_bam_file.bam
chr1 1575 1446 18
chr2 1584 1789 17
* 0 0 0
$ ./chopBAI example_bam_file.bam chr1 chr2
ERROR: Could not parse region 0: chr1
Please specify region in format chr:beg-end.
Note that chopBAI also fails to accept comma thousand separators:
$ ./chopBAI example_bam_file.bam chr1:1-1,575 chr2:1-1,584
ERROR: Could not parse region 0: chr1:1-1,575
Please specify region in format chr:beg-end.
Workaround:
$ ./chopBAI example_bam_file.bam chr1:1-1575 chr2:1-1584
Due to a failure to fully define the specification, historically samtools index example.bam
would create example.bam.bai
while the Picard Java equivalent would create example.bai
(shorter extension).
Therefore recent versions of samtools will check for and use either form, giving preference to example.bam.bai
, e.g.
$ rm *.bai
$ samtools index example_bam_file.bam
$ ls example_bam_file.*
example_bam_file.bam example_bam_file.bam.bai
$ samtools idxstats example_bam_file.bam
chr1 1575 1446 18
chr2 1584 1789 17
* 0 0 0
$ mv example_bam_file.bam.bai example_bam_file.bai
$ ls example_bam_file.*
example_bam_file.bai example_bam_file.bam
$ ../samtools/samtools idxstats example_bam_file.bam
chr1 1575 1446 18
chr2 1584 1789 17
* 0 0 0
Therefore ideally chopBAI would also include the same fall back:
$ ls example_bam_file.*
example_bam_file.bai example_bam_file.bam
$ ./chopBAI example_bam_file.bam chr1:1-1575 chr2:1-1584
ERROR: Open failed on bam index file example_bam_file.bam.bai
Workaround: Use the samtools convention .bam.bai
instead:
$ mv example_bam_file.bai example_bam_file.bam.bai
$ ls example_bam_file.*
example_bam_file.bam example_bam_file.bam.bai
$ ./chopBAI example_bam_file.bam chr1:1-1575 chr2:1-1584
The SAM specification http://samtools.github.io/hts-specs/SAMv1.pdf (source https://github.com/samtools/hts-specs/blob/master/SAMv1.tex#L196 ) defines the reference sequence names with the regular expression [!-)+-<>-~][!-~]*
.
i.e. While not commonly used, a colon within the reference name is valid, e.g. chr1:A
and chr1:B
.
Consider this test case,
$ more colons.sam
@SQ SN:chr1:A LN:1000
@SQ SN:chr1:B LN:2000
EAS56_57:6:190:289:82 69 chr1:A 100 0 * = 100 0 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:i:192
EAS56_57:6:190:289:82 137 chr1:A 100 73 35M = 100 0 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; MF:i:64 Aq:i:0 NM:i:0 UQ:i:0 H0:i:1 H1:i:0
EAS54_65:3:320:20:250 147 chr1:B 1532 77 35M = 1367 -200 TTTTTTTTTTTTTTTTTTTTTTTGCATGCCAGAAA +'''/<<<<7:;+<;::<<<;;<<<<<<<<<<<<< MF:i:18 Aq:i:6 NM:i:2 UQ:i:24 H0:i:1 H1:i:2
EAS114_26:7:37:79:581 83 chr1:B 1533 68 35M = 1349 -219 TTTTTTTTTTTTTTTTTTTTTTTCATGCCAGAAAA 3,,,===6===<===<;=====-============ MF:i:18 Aq:i:27 NM:i:2 UQ:i:23 H0:i:0 H1:i:1
$ samtools view -b -S colons.sam > colons.bam
[samopen] SAM header is present: 2 sequences.
$ samtools index colons.bam
$ samtools idxstats colons.bam
chr1:A 1000 1 1
chr1:B 2000 2 0
* 0 0 0
$ ls colons.*
colons.bam colons.bam.bai colons.sam
Viewing a valid region with samtools view
,
$ samtools view colons.bam "chr1:A:1-500"
EAS56_57:6:190:289:82 69 chr1:A 100 0 * = 100 0 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:i:192
EAS56_57:6:190:289:82 137 chr1:A 100 73 35M = 100 0 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; MF:i:64 Aq:i:0 NM:i:0 UQ:i:0 H0:i:1 H1:i:0
Trying to use the same region with chopBAI:
$ ./chopBAI colons.bam "chr1:A:1-500"
ERROR: Could not open file listing the regions: chr1:A:1-500
See also #2 about not accepting samtools style regions, and how samtools parses the region names.
Once you have added some automated testing (#4), it would be good practice to then have these tests automatically run whenever the repository is updated.
I would suggest using TravisCI, as used by samtools for example, see http://docs.travis-ci.com/ and https://github.com/samtools/samtools/blob/develop/.travis.yml
Another big plus of TravisCI is that any user submitted pull requests would automatically have the functional test results shown on them which is a big help when reviewing contributions.
$ ./chopBAI --version ; echo "Returned $?"
./chopBAI version 0.1 beta
Returned 1
$ ./chopBAI --help ; echo "Returned $?"
./chopBAI - chops a bam index file into pieces
==============================================
SYNOPSIS
./chopBAI [OPTIONS] BAM-FILE REGION1 [... REGIONn]
./chopBAI [OPTIONS] BAM-FILE REGION-FILE
DESCRIPTION
Writes small index files for the specified regions based on an existing bai or csi file for the input bamfile. The
regions have to be specified in the format 'chr:begin-end' and can be listed directly on the command line
separated by spaces or in a file listing one region per line. The program writes a smaller index file for each
region to the directory '<output prefix>/<region>/<bamfile>.[bai|csi]'. The output directories are created if they
do not exist.
-h, --help
Displays this help message.
--version
Display version information.
Output options:
-p, --prefix STR
Output prefix. Default: current directory.
-l, --linear
Include linear index of BAI in the output.
VERSION
./chopBAI version: 0.1 beta
Last update on 2015-11-06 17:36:57
Returned 1
These should both return 0 (zero) as any non-zero return value indicated failure under POSIX conventions.
The initial BAM index format BAI has a number of known limitations including fixed bin sizes (making it unsuitable for ultra-high coverage datasets where smaller bins would help) and a 512Mbp limit on reference length (a problem in some marsupial genomes, but more importantly major crop plants like wheat).
These limits are overcome in the replacement CSI index format, which is supported in samtools.
The chopBAI name and main documentation only talks about BAI files, making it unclear if the CSI alternative is supported or not.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.