Code Monkey home page Code Monkey logo

chopbai's People

Contributors

bkehr avatar pmelsted avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

peterjc

chopbai's Issues

chopBAI does not include any functional tests

Ideally chopBAI would include some bundled test cases which could be invoked by the commonly used idiom of make test.

Right now there isn't even a provided sample dataset for human testing.

compiler warnings: unused parameters

Using Mac OS X 10.10.5,

$ uname
Darwin
$ g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.0 (clang-700.1.76)
Target: x86_64-apple-darwin14.5.0
Thread model: posix

Attempting to build the latest code for chopBAI, specifically commit 6dfdaa5 as follows:

$ git clone https://github.com/seqan/seqan.git
$ git clone https://github.com/DecodeGenetics/chopBAI.git
$ cd chopBAI
$ emacs Makefile
...
$ git diff
diff --git a/Makefile b/Makefile
index 24ec4fb..04eb225 100644
--- a/Makefile
+++ b/Makefile
@@ -4,7 +4,7 @@ CC=$(CXX)

 # Set this to include SeqAn libraries, either system wide
 # or download into current folder and set to .
-SEQAN_LIB=.
+SEQAN_LIB=../seqan/include/

 CXXFLAGS+=-I$(SEQAN_LIB) -DSEQAN_HAS_ZLIB=1
 LDLIBS=-lz -lpthread

Then compiling gives:

$ make
g++ -I../seqan/include/ -DSEQAN_HAS_ZLIB=1 -DDATE=\""on 2015-11-06 17:36:57"\" -W -Wall -Wno-long-long -pedantic -Wno-variadic-macros -O3 -DSEQAN_ENABLE_TESTING=0 -DSEQAN_ENABLE_DEBUG=0 -Qunused-arguments  -c -o chopBAI.o chopBAI.cpp
In file included from chopBAI.cpp:7:
./bam_index_csi.h:173:55: warning: unused parameter 'bamFile' [-Wunused-parameter]
bool jumpToOrphans(FormattedFile<Bam, Input, TSpec> & bamFile,
                                                      ^
./bam_index_csi.h:174:27: warning: unused parameter 'hasAlignments' [-Wunused-parameter]
                   bool & hasAlignments,
                          ^
./bam_index_csi.h:175:42: warning: unused parameter 'index' [-Wunused-parameter]
                   BamIndex<Csi> const & index)
                                         ^
3 warnings generated.
g++   chopBAI.o  -lz -lpthread -o chopBAI

Compilation gives three warnings as shown above, but seems to work:

$ ./chopBAI --version
./chopBAI version 0.1 beta

Mapped and placed counts in BAI not reported (samtools idxstats)

chopBAI fails to populate the BAI psuedo-bins used to record the total number of reads mapped to or placed against each reference, as used by samtools idxstats. These values appears in the references' lists of distinct bins as bin number 37450 (which is beyond the normal range), see https://github.com/samtools/hts-specs/blob/master/SAMv1.tex#L1071

To reproduce, using 2df1721 (current master branch tip)

$ git reset --hard
$ git checkout 2df17217b0ed4296858d33de966b76f267eac8d4
$ make
$ make test

Note I am using Mac OS X and samtools 1.3 but any recent version should work.

$ samtools --version
samtools 1.3
Using htslib 1.3
Copyright (C) 2015 Genome Research Ltd.

Now for the bug, first let's look at the reads mapped to or placed on each reference:

$ samtools idxstats tests/test.sorted.bam
chrA:B:C:D  300 44  0
chrA:B  10000   2236    0
chrB    80001   17718   0
*   0   0   0

Note that this sample test file has no placed but unmapped reads (i.e. unmapped partners of paired reads which are placed at the same POS/RNAME as their mapped partner for sorting purposes), so the final column is all zeros.

Now looking at one of the windowed BAI files and its BAM symlink,

$ samtools idxstats tests/chrB/test.sorted.bam
chrA:B:C:D  300 0   0
chrA:B  10000   0   0
chrB    80001   0   0
*   0   0   0

Expected result would be to have zeros for all the non-selected regions, but the full count for chrB, i.e.

chrA:B:C:D  300 0   0
chrA:B  10000   0   0
chrB    80001   17718   0
*   0   0   0

Likewise for a region of a reference,

$ samtools idxstats "tests/chrB:1-100/test.sorted.bam"
chrA:B:C:D  300 0   0
chrA:B  10000   0   0
chrB    80001   0   0
*   0   0   0

Here I would expect the same numbers as if I have extracted reads in this region via samtools view and indexed the sub-file:

$ samtools view -b -o test_chrB_1-100_subset.bam tests/test.sorted.bam "chrB:1-100"
$ samtools index test_chrB_1-100_subset.bam 
$ samtools idxstats test_chrB_1-100_subset.bam 
chrA:B:C:D  300 0   0
chrA:B  10000   0   0
chrB    80001   24  0
*   0   0   0

I have tried using chopBAI -l just in case that made any difference, it did not.

Document example with real world data

The README file just has a generic example assuming files sequence.bam and sequence.bam.bai in the current directory. It would be helpful to document a real world example where chopBAI is expected to be useful - perhaps something from the 1000 Genomes Project, or an excerpt of the full dataset used but not described in the preprint?

The preprint http://biorxiv.org/content/early/2015/11/06/030825 talks about typical (human) BAI files of 10Mb [sic, see below], and Figure 1 plots data from a specific but unspecified example dataset - presumably human chromosome one. The discussion reports "With chopBAI’s preprocessing, the I/O imposed by BAM index files in the analysis of 10 Kbp regions of 10,000 BAM files can be reduced
from 93 Gb to 4.5 Gb per job
", again presumably the same unspecified multiple-individual human data set.

P.S. The preprint could be be misusing the abbreviations Gb and Mb (gigabit and megabit) rather than GB and MB (gigabyte and megabytes). For network IO using bits rather than bytes is (unfortunately) common, but for file size on disk mega or gigabytes is normal, likewise hard drive IO is usually measured in megabytes/second (MB/s).

Option to create symlink to master BAM file?

The README file explains that if you would like to use the reduced BAI file with a tool that assumes the BAM and the BAI file to share a common prefix (such as samtools), you can create a symbolic link, e.g

$ ln -s sequence.bam chr4:15000000-16000000/sequence.bam
$ samtools view -c chr4:15000000-16000000/sequence.bam chr4:15500000-15600000

This can then use the original bam file, but the reduced index.

It would be nice to have these links created by the tool, perhaps via a new command line option. You've used -l which would have been the obvious shorthand, so maybe -s and --symlink would work?

Fails to accepts valid samtools style regions

chopBAI fails to follow the convention long defined in samtools for how to specify a region, e.g. from samtools 0.1.18:

$ samtools view --help
...
  4. A region should be presented in one of the following formats:
     `chr1', `chr2:1,000' and `chr3:1000-2,000'. When a region is
     specified, the input alignment file must be an indexed BAM file.
...

or samtools 1.2,

$ samtools view --help
...
  5. A region should be presented in one of the following formats:
     `chr1', `chr2:1,000' and `chr3:1000-2,000'. When a region is
     specified, the input alignment file must be a sorted and indexed
     alignment (BAM/CRAM) file.
...

This is currently implemented via htslib, see https://github.com/samtools/htslib/blob/develop/hts.c#L1877 - allowed syntax is reference, reference:begin and reference:begin-end and comma separators are allowed in the coordinates.

Example:

$ samtools idxstats example_bam_file.bam
chr1    1575    1446    18
chr2    1584    1789    17
*   0   0   0
$ ./chopBAI example_bam_file.bam chr1 chr2
ERROR: Could not parse region 0: chr1
       Please specify region in format chr:beg-end.

Note that chopBAI also fails to accept comma thousand separators:

$ ./chopBAI example_bam_file.bam chr1:1-1,575 chr2:1-1,584
ERROR: Could not parse region 0: chr1:1-1,575
       Please specify region in format chr:beg-end.

Workaround:

$ ./chopBAI example_bam_file.bam chr1:1-1575 chr2:1-1584

Fails to find example.bai for example.bam

Due to a failure to fully define the specification, historically samtools index example.bam would create example.bam.bai while the Picard Java equivalent would create example.bai (shorter extension).

Therefore recent versions of samtools will check for and use either form, giving preference to example.bam.bai, e.g.

$ rm *.bai
$ samtools index example_bam_file.bam 
$ ls example_bam_file.*
example_bam_file.bam     example_bam_file.bam.bai
$ samtools idxstats example_bam_file.bam
chr1    1575    1446    18
chr2    1584    1789    17
*   0   0   0
$ mv example_bam_file.bam.bai example_bam_file.bai
$ ls example_bam_file.*
example_bam_file.bai example_bam_file.bam
$ ../samtools/samtools idxstats example_bam_file.bam
chr1    1575    1446    18
chr2    1584    1789    17
*   0   0   0

Therefore ideally chopBAI would also include the same fall back:

$ ls example_bam_file.*
example_bam_file.bai example_bam_file.bam
$ ./chopBAI example_bam_file.bam chr1:1-1575 chr2:1-1584
ERROR: Open failed on bam index file example_bam_file.bam.bai

Workaround: Use the samtools convention .bam.bai instead:

$ mv example_bam_file.bai example_bam_file.bam.bai
$ ls example_bam_file.*
example_bam_file.bam     example_bam_file.bam.bai
$ ./chopBAI example_bam_file.bam chr1:1-1575 chr2:1-1584

Fails to accept regions where the reference has a colon

The SAM specification http://samtools.github.io/hts-specs/SAMv1.pdf (source https://github.com/samtools/hts-specs/blob/master/SAMv1.tex#L196 ) defines the reference sequence names with the regular expression [!-)+-<>-~][!-~]*.

i.e. While not commonly used, a colon within the reference name is valid, e.g. chr1:A and chr1:B.

Consider this test case,

$ more colons.sam
@SQ     SN:chr1:A       LN:1000
@SQ     SN:chr1:B       LN:2000
EAS56_57:6:190:289:82   69      chr1:A  100     0       *       =       100     0       CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA     <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;     MF:i:192
EAS56_57:6:190:289:82   137     chr1:A  100     73      35M     =       100     0       AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC     <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;     MF:i:64 Aq:i:0  NM:i:0  UQ:i:0  H0:i:1  H1:i:0
EAS54_65:3:320:20:250   147     chr1:B  1532    77      35M     =       1367    -200    TTTTTTTTTTTTTTTTTTTTTTTGCATGCCAGAAA     +'''/<<<<7:;+<;::<<<;;<<<<<<<<<<<<<     MF:i:18 Aq:i:6  NM:i:2  UQ:i:24 H0:i:1  H1:i:2
EAS114_26:7:37:79:581   83      chr1:B  1533    68      35M     =       1349    -219    TTTTTTTTTTTTTTTTTTTTTTTCATGCCAGAAAA     3,,,===6===<===<;=====-============     MF:i:18 Aq:i:27 NM:i:2  UQ:i:23 H0:i:0  H1:i:1
$ samtools view -b -S colons.sam > colons.bam
[samopen] SAM header is present: 2 sequences.
$ samtools index colons.bam
$ samtools idxstats colons.bam
chr1:A  1000    1   1
chr1:B  2000    2   0
*   0   0   0
$ ls colons.*
colons.bam     colons.bam.bai colons.sam

Viewing a valid region with samtools view,

$ samtools view colons.bam "chr1:A:1-500"
EAS56_57:6:190:289:82   69  chr1:A  100 0   *   =   100 0   CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:i:192
EAS56_57:6:190:289:82   137 chr1:A  100 73  35M =   100 0   AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; MF:i:64 Aq:i:0  NM:i:0  UQ:i:0  H0:i:1  H1:i:0

Trying to use the same region with chopBAI:

$ ./chopBAI colons.bam "chr1:A:1-500"
ERROR: Could not open file listing the regions: chr1:A:1-500

See also #2 about not accepting samtools style regions, and how samtools parses the region names.

chopBAI does not use continuous integration testing

Once you have added some automated testing (#4), it would be good practice to then have these tests automatically run whenever the repository is updated.

I would suggest using TravisCI, as used by samtools for example, see http://docs.travis-ci.com/ and https://github.com/samtools/samtools/blob/develop/.travis.yml

Another big plus of TravisCI is that any user submitted pull requests would automatically have the functional test results shown on them which is a big help when reviewing contributions.

chopBAI --version and --help return failure error level

$ ./chopBAI --version ; echo "Returned $?"
./chopBAI version 0.1 beta
Returned 1
$ ./chopBAI --help ; echo "Returned $?"
./chopBAI - chops a bam index file into pieces
==============================================

SYNOPSIS
    ./chopBAI [OPTIONS] BAM-FILE REGION1 [... REGIONn]
    ./chopBAI [OPTIONS] BAM-FILE REGION-FILE

DESCRIPTION
    Writes small index files for the specified regions based on an existing bai or csi file for the input bamfile. The
    regions have to be specified in the format 'chr:begin-end' and can be listed directly on the command line
    separated by spaces or in a file listing one region per line. The program writes a smaller index file for each
    region to the directory '<output prefix>/<region>/<bamfile>.[bai|csi]'. The output directories are created if they
    do not exist.

    -h, --help
          Displays this help message.
    --version
          Display version information.

  Output options:
    -p, --prefix STR
          Output prefix. Default: current directory.
    -l, --linear
          Include linear index of BAI in the output.

VERSION
    ./chopBAI version: 0.1 beta
    Last update on 2015-11-06 17:36:57
Returned 1

These should both return 0 (zero) as any non-zero return value indicated failure under POSIX conventions.

Make it clear if CSI is supported as well as BAI

The initial BAM index format BAI has a number of known limitations including fixed bin sizes (making it unsuitable for ultra-high coverage datasets where smaller bins would help) and a 512Mbp limit on reference length (a problem in some marsupial genomes, but more importantly major crop plants like wheat).

These limits are overcome in the replacement CSI index format, which is supported in samtools.

The chopBAI name and main documentation only talks about BAI files, making it unclear if the CSI alternative is supported or not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.