Code Monkey home page Code Monkey logo

gramtools's People

Contributors

bricoletc avatar ffranr avatar iqbal-lab avatar jeromekelleher avatar martinghunt avatar sm0179 avatar zamiqbal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gramtools's Issues

mask storage

compress sparse mask with some rank/select supporting data structure

Faster/lower RAM kmer hash

I don't know the exact RAM usage by our kmer hash, but for future reference, we could replace it with https://github.com/rob-p/BooM which is a minimal perfect hash map based on Rayan's BBhash - uses about 3bits per element apparently. So for human, if we want to store 13mers, we need to store 66 million kmers--->this would only take 25Mb

coverages obtained with bitmask branch on joint DBLMSP prg

Coverages obtained by running gramtools from bitmask branch on the 5 pf3k validation samples with the PRG built from the 2 DBLMSP genes are different from the ones obtained with master (previous pf3k analysis), dev and ranks.

cmd on aws: parallel --gnu -a /data1/gramtools_test_data/input/validation_samples './gramtools -p /data1/gramtools_test_data/input/final_joint_prg/updated_graph_gn01/Chr10_div_in_msp34_prg.in -c /data1/gramtools_test_data/input/final_joint_prg/updated_graph_gn01/csa_file -i /data1/gramtools_test_data/input/fastq/{}.fq.trim.gz -s /data1/gramtools_test_data/input/final_joint_prg/updated_graph_gn01/Mask_sites -a /data1/gramtools_test_data/input/final_joint_prg/updated_graph_gn01/Mask_alleles -v /data1/gramtools_test_data/output/test_bitmask/{}/Coverages -r /data1/gramtools_test_data/output/test_bitmask/{}/Reads_processed -b /data1/gramtools_test_data/output/test_bitmask/{}/int_al_file_bv -l /data1/gramtools_test_data/output/test_bitmask/{}/csa_memory -k 9 -f /data1/gramtools_test_data/input/final_joint_prg/updated_graph_gn01/precalc-kmers/kmersandrev > /data1/gramtools_test_data/output/test_bitmask/{}/out'

examples attached for sample 7G8 (for other samples, see /data1/gramtools_test_data/output/test_bitmask/ and /data1/gramtools_test_data/output/validation_samples/ on aws)
Coverages_bitmask.txt
Coverages_master.txt

Coverages obtained by running gramtools on the whole-genome simulation are identical between the 4 branches

k-mers overlapping variants

modify the variantkmers.py script to take in the read length as arg and generate kmers surrounding the variants (+/- RLen bases) in addition to the ones overlapping variants

Periodic assertion failures in SDSL

eg here:

bash-4.1$ ./slow_unittest_bidir_search_bwd_fwd
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from BackwardSearchTest
[ RUN ] BackwardSearchTest.NoVariantsSlowTest2
slow_unittest_bidir_search_bwd_fwd: /home/sm0179/include/sdsl/int_vector.hpp:1235: sdsl::int_vector< >::reference sdsl::int_vector< >::operator[](const size_type&) [with unsigned char t_width = 0u; sdsl::int_vector< >::reference = sdsl::int_vector_referencesdsl::int_vector<0u >; sdsl::int_vector< >::size_type = long unsigned int]: Assertion `idx < this->size()' failed.

I also saw this in the fast tests, and it went away when I fixed some errors in the input pRG/masks.
I can't see any problem with the input file, and on rerunning I can't reproduce this

False interval added in precalc

PRG
ATCGCT5CCGCCGGCGA6G5TTTTT

perl ~/dev/git/pg/ben_langmead_materials/integer_bwt.pl ATCGCT5CCGCCGGCGA6G5TTTTT
bwm of 1423245223223323163544444$ is
$1423245223223323163544444
1423245223223323163544444$
163544444$1423245223223323
223223323163544444$1423245
223323163544444$1423245223
23163544444$14232452232233
23223323163544444$14232452
23245223223323163544444$14
23323163544444$14232452232
245223223323163544444$1423
3163544444$142324522322332
3223323163544444$142324522
323163544444$1423245223223
3245223223323163544444$142
3323163544444$142324522322
3544444$142324522322332316
4$142324522322332316354444
423245223223323163544444$1
44$14232452232233231635444
444$1423245223223323163544 index 19
4444$142324522322332316354
44444$14232452232233231635
45223223323163544444$14232 index 22
5223223323163544444$142324 index 23
544444$1423245223223323163
63544444$14232452232233231

This is what the kmer precalc file looks like for TTT

4 4 4 |1|19 22 22 23 |18 24 25 27 ||5 @| <<<<< look at this, says intervals [19,22) and [22,23)

[19,22) is correct.
[22,23) is wrong, corresponds to a TTT crossing and ignoring both alleles
ATCGCT5CCGCCGGCGA6G5TTTTT

Sorina knows about this, just logging this here so we don't forget.

Unexpected assertion for specific kmer

Sorina spotted some samples were hitting/failing an assertion. Basically, if a read was mapping horizontally uniquely but not vertically uniquely, this assert was checking that at that time, nothing had been pushed onto the sites list, in the knowledge that it was going to be pushed after exiting the current function. Actually the problem can be reproduced using just the last kmer of the read, and putting it in the kmers input file.

So, what is the command?

gramtools -p Chr10_div_in_msp34_prg.in 
-c csa_file -i test.fastq -s Mask_sites -a Mask_alleles 
-v Coverages -r Reads_processed -b int_alphabet_file 
-l csa_memory -k 9 -f  onekmer

where

cat onekmer
TATGAAGAA

cat test.fastq
@0
AAACCAGGATGATTCACAACTATTAGAAATATCAAAAAAGGATAAATGTAAAGAAGCATTAAAGCATTATGAAGAA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

OK, so what happens. First of all, if you look in the PRG file, you can see that the kmer overlaps site 79, and overlaps the end of many alleles, with AA poking out of the right hand end of the site.
So, when we precalculate the sa_interval and sites for this kmer, you should see all these alleles in the sites object. But in fact somehow it reduces it all down to just alleles 24 and 34.

I guess the thing to do is run it in gdb and look at where sa_intervals and sites are erased.

Suffix array sampling

Generally people don't store the entire suffix array, but only a sample of it in order to save memory. When you want to access a position of the SA, if that position is stored all good, if not, it is calculated from the nearest neighbour that is stored, which takes time. So there is this tradeoff between the proportion of values in the SA that you store and the speed of random access. At the moment, the default in sdsl is storing 1/32 SA values, but this parameter can be changed, so we can store more and then querying the SA will be faster.

Segfault in latest unit test of multiple matches

At commit ef3cf76

Add testing of sites object to multiple-matches unit test - segfaults

./unittest_bidir_search_bwd_fwd 
[==========] Running 10 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 10 tests from BackwardSearchTest
[ RUN      ] BackwardSearchTest.NoVariants1
[       OK ] BackwardSearchTest.NoVariants1 (4426 ms)
[ RUN      ] BackwardSearchTest.OneSNP
[       OK ] BackwardSearchTest.OneSNP (11693 ms)
[ RUN      ] BackwardSearchTest.TwoSNPs
[       OK ] BackwardSearchTest.TwoSNPs (338 ms)
[ RUN      ] BackwardSearchTest.Two_matches_one_variable_one_nonvariable_region
[       OK ] BackwardSearchTest.Two_matches_one_variable_one_nonvariable_region (3140 ms)
[ RUN      ] BackwardSearchTest.Two_matches_one_variable_second_allele_one_nonvariable_region
[       OK ] BackwardSearchTest.Two_matches_one_variable_second_allele_one_nonvariable_region (366 ms)
[ RUN      ] BackwardSearchTest.Two_long_sites
[       OK ] BackwardSearchTest.Two_long_sites (1201 ms)
[ RUN      ] BackwardSearchTest.Match_within_long_site_match_outside
[       OK ] BackwardSearchTest.Match_within_long_site_match_outside (331 ms)
[ RUN      ] BackwardSearchTest.Long_site_and_repeated_snp_on_edge_of_site
[       OK ] BackwardSearchTest.Long_site_and_repeated_snp_on_edge_of_site (322 ms)
[ RUN      ] BackwardSearchTest.Multiple_matches_over_multiple_sites
Segmentation fault
This is because the sites object ends up with a dummy entry in it with
zero capacity, ready for the next entry, and my unit test does not expect that.
I'm commiting it as it is for the moment. I need to get my head around whether
the behaviour is correct. I didn't need to allow for such a thing in all the previous unit tests

TwoSNPs unit tests shows sites object storing same site-id for both SNPs

At this commit

commit 1c0bcdc
Author: Zamin Iqbal [email protected]
Date: Sun Jul 31 21:57:32 2016 +0100

Add check of sites object in TwoSNPs unit test - FAILS.

[ RUN      ] BackwardSearchTest.TwoSNPs
./test/unittest_bidir_search_bwd_fwd.cpp:202: Failure
Value of: 5
Expected: sites.back().front().first
Which is: 7
[  FAILED  ] BackwardSearchTest.TwoSNPs (522 ms)

Seems to have same site-id for both SNPs it crosses  in
the sites object

Underflowing a uint64_t leads to unexpected lack of problems

PRG is A5C6G5T
Integer encoding is 1526354$

BWM is
$1526354
1526354$
26354$15
354$1526
4$152635
526354$1
54$15263
6354$152

Now, am stepping through the precalc kmers step, for the kmer AC.
Start with C==2, and you get the interval [2,3). So far so good.
Then try AC and step into bidir_search()

  uint64_t bidir_search(csa_wt<wt_int<bit_vector,rank_support_v5<>>,2,16777216> &csa,
                        uint64_t& left, uint64_t& right,
                        uint64_t& left_rev, uint64_t& right_rev,
                        uint8_t c)

c is 1.
left=2, right=3
So far so good.

// c_begin (below) is the first occurrence/posn // of char c in the far left column // of the BW matrix B uint64_t c_begin = csa.C[csa.char2comp[c]];

c_begin =1. This makes sense - look at the BWM.

Now we get here

//rank_l is the number of times c occurs in BWT[0,left] uint64_t rank_l = std::get<0>(r_s_b); // s is total number of times // chars smaller than c occur in BWT[left,right] uint64_t s = std::get<1>(r_s_b); // b is total number of times // chars bigger than c occur in BWT[left,right] uint64_t b = std::get<2>(r_s_b);

rank_l = 0
s=0
b=1
Comparing the BWM, this all looks ok. However the following fails

//rank_r is the number of times c occurs in BWT[0,right] uint64_t rank_r = right - left - s - b + rank_l - 1;

look at what we get:

(gdb) p rank_r $65 = 18446744073709551615

Look again at the definition of rank_r

(gdb) p right - left - s - b + rank_l - 1
$66 = 18446744073709551615
(gdb) p right - left - s - b + rank_l
$67 = 0

So we are trying to subtract 1 from a uint64_t zero.

What does this lead to?

//get new interval, after appending c left = c_begin + rank_l; right = c_begin + rank_r + 1;
which gives us

(gdb) p left
$72 = (uint64_t &) @0x7ffff0000a10: 1
(gdb) p right
$73 = (uint64_t &) @0x7ffff0000a18: 1

Segfaults when it meets a read of all Ns

Mapped 7G8 to a WG PRG, and by chance one read was all Ns.
Segfaulted
#11 0x0000000000440e8e in main (argc=12, argv=0x7fffffffbdb8) at ./map.cpp:91

(gdb) frame 11
#11 0x0000000000440e8e in main (argc=12, argv=0x7fffffffbdb8) at ./map.cpp:91

91 std::vector<uint8_t> kmer(p.begin()+p.size()-k,p.end()); //is there a way to avoid making this copy?
(gdb) l
86 if (q->seq[i]=='C' or q->seq[i]=='c') p.push_back(2);
87 if (q->seq[i]=='G' or q->seq[i]=='g') p.push_back(3);
88 if (q->seq[i]=='T' or q->seq[i]=='t') p.push_back(4);
89 }
90
91 std::vector<uint8_t> kmer(p.begin()+p.size()-k,p.end()); //is there a way to avoid making this copy?
92 if (kmer_idx.find(kmer)!=kmer_idx.end() && kmer_idx_rev.find(kmer)!=kmer_idx_rev.end() && kmer_sites.find(kmer)!=kmer_sites.end()) {
93 sa_intervals=kmer_idx[kmer];
94 sa_intervals_rev=kmer_idx_rev[kmer];
95 sites=kmer_sites[kmer];
(gdb) p p
$1 = std::vector of length 0, capacity 0
(gdb) p q.seq
$2 = 0x6905b0 'N' <repeats 76 times>

Precalculate some/all ranks to speed up mapping

This from Sorina

"I just realised something that could potentially speed up the mapping. Those ranks you need for jumping from the last column to the first column in the Burrows Wheeler matrix. In the standard FM-index algorithm they don’t compute them right at the time when you’re mapping each character from each read, because you’d have to compute the same thing multiple times, every time a read covers that region. Instead, what they do is precalculate all ranks for all letters and store them. Because this takes up quite a lot of memory, they actually store only a fraction of the rows and infer the gaps from the nearest stored value. I thought that’s what the sdsl guys are doing as well when I’m calling the rank function, they’re just accessing the structure that stores the ranks. But no, I just digged through their code and they’re actually calculating the rank on the spot. So I think it’s worth trying this out."

Typo - festa

We seem to have a lot of files containing "festa" - I guess typo for fasta ?
You might replace with fastaq to signify taking either fasta or fastq?

bidir_search_fwd is broken

OK, I know we don't use this function right now, so I'm raising this bug as a placeholder for when we need it.

First, I note that in bidir_search_fwd() as it currently stands, we do not clear allele_empty after passing it (by reference) into get_location, whereas we do in _bwd.
Second, I note that sites() ends up with far too many entries in it, including the same site and allele more than once.
Third, when you add empty_allele.clear() in the obvious places (mirroring what is done in _bwd() ), we get segfaults, and notice this:

`
#2 0x0000000000437e82 in BackwardSearchTest_OneSNP_Test::TestBody (this=0x6d4840) at ./test/unittest_bidir_search_bwd_fwd.cpp:157

157 EXPECT_EQ(sites.front().front().second.front(), 1);
(gdb) l
152 no_occ=(_sa_intervals.begin()).second-(_sa_intervals.begin()).first;
153 EXPECT_EQ(true,first_del);
154 EXPECT_EQ(1,sa_intervals.size());
155 EXPECT_EQ(no_occ,1);
156 EXPECT_EQ(sites.front().front().first, 5);
157 EXPECT_EQ(sites.front().front().second.front(), 1);
158 EXPECT_EQ(sites.front().front().second.size(), 1);
159
160
161 sa_intervals.clear();
(gdb) p sites
$1 = std::list = {[0] = std::vector of length 2, capacity 2 = {{first = 5, second = std::vector of length 0, capacity 0}, {first = 5, second = std::vector of length 1, capacity 1 = {0}}}}

`

Note the allele number {0} at the end - this must be wrong as we are indexing alleles 1-based, and all our _bwd() unit tests are passing with correct storing of alleles 1,2,.. etc in the sites object.

This is all true at commit 318a43b (which does not have the allele_empty.clear() added on the two lines after get_location is called).

valgrind issues in parse_masks

Running this command:

vBWT /data3/projects/pf3k/pilot_paper_analysis/prg-constr/new_format/Chr10_div_in_msp34_prg_letter_added_between_adj_sites.in csa_file /data3/projects/pf3k/pilot_paper_analysis/prg-constr/simulate_reads_from_alt_ref/sim_reads_from_alt.fq /data3/projects/pf3k/pilot_paper_analysis/prg-constr/new_format/Mask_sites /data3/projects/pf3k/pilot_paper_analysis/prg-constr/new_format/Mask_alleles Coverages Reads_processed int_alphabet_file csa_memory 9 /data3/projects/gramtools/wabi/precalc-kmers/kmersandrev

on Carlos' dev branch, through valgrind, we get a lot of errors.
Without valgrind, we segfault, and I'm not debugging that til the valgrind errors are fixed.

Run valgrind thus:

valgrind --leak-check=yes

I'm documenting a lot of detail here, might be useful for Sorina

We see:

First some stuff which is due to htslib - you can ignore this

==3352== Conditional jump or move depends on uninitialised value(s)
==3352== at 0x3A11A17546: index (in /lib64/ld-2.12.so)
==3352== by 0x3A11A06237: expand_dynamic_string_token (in /lib64/ld-2.12.so)
==3352== by 0x3A11A082F1: _dl_map_object (in /lib64/ld-2.12.so)
==3352== by 0x3A11A016DD: map_doit (in /lib64/ld-2.12.so)
==3352== by 0x3A11A0E265: _dl_catch_error (in /lib64/ld-2.12.so)
==3352== by 0x3A11A015F6: do_preload (in /lib64/ld-2.12.so)
==3352== by 0x3A11A0480E: dl_main (in /lib64/ld-2.12.so)
==3352== by 0x3A11A15B4D: _dl_sysdep_start (in /lib64/ld-2.12.so)
==3352== by 0x3A11A014A3: _dl_start (in /lib64/ld-2.12.so)
==3352== by 0x3A11A00B07: ??? (in /lib64/ld-2.12.so)
==3352== by 0xB: ???
==3352== by 0xFFEFFEDEE: ???
==3352==
==3352== Conditional jump or move depends on uninitialised value(s)
==3352== at 0x3A11A1754B: index (in /lib64/ld-2.12.so)
==3352== by 0x3A11A06237: expand_dynamic_string_token (in /lib64/ld-2.12.so)
==3352== by 0x3A11A082F1: _dl_map_object (in /lib64/ld-2.12.so)
==3352== by 0x3A11A016DD: map_doit (in /lib64/ld-2.12.so)
==3352== by 0x3A11A0E265: _dl_catch_error (in /lib64/ld-2.12.so)
==3352== by 0x3A11A015F6: do_preload (in /lib64/ld-2.12.so)
==3352== by 0x3A11A0480E: dl_main (in /lib64/ld-2.12.so)
==3352== by 0x3A11A15B4D: _dl_sysdep_start (in /lib64/ld-2.12.so)
==3352== by 0x3A11A014A3: _dl_start (in /lib64/ld-2.12.so)
==3352== by 0x3A11A00B07: ??? (in /lib64/ld-2.12.so)
==3352== by 0xB: ???
==3352== by 0xFFEFFEDEE: ???
==3352==

First error:


Mon Apr 25 22:32:27 2016
==3352== Invalid read of size 8
==3352== at 0x40B500: std::vector<int, std::allocator >::capacity() const (stl_vector.h:707)
==3352== by 0x4175AC: std::vector<int, std::allocator >::_M_fill_assign(unsigned long, int const&) (vector.tcc:219)
==3352== by 0x417120: void std::vector<int, std::allocator >::M_assign_dispatch(int, int, std::_true_type) (stl_vector.h:1207)
==3352== by 0x416BFF: void std::vector<int, std::allocator >::assign(int, int) (stl_vector.h:488)
==3352== by 0x41691D: parse_masks(std::vector<unsigned long, std::allocator >&, std::vector<int, std::allocator >&, std::string, std::string, std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > >&) (parse_masks.cpp:37)
==3352== by 0x440321: main (map.cpp:51)
==3352== Address 0x56c3840 is 1,472 bytes inside a block of size 1,536 free'd
==3352== at 0x4A07373: operator delete(void
) (vg_replace_malloc.c:457)
==3352== by 0x4178AB: __gnu_cxx::new_allocator<std::vector<int, std::allocator > >::deallocate(std::vector<int, std::allocator >
, unsigned long) (new_allocator.h:100)
==3352== by 0x417347: std::_Vector_base<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > >::_M_deallocate(std::vector<int, std::allocator >*, unsigned long) (stl_vector.h:175)
==3352== by 0x416DE4: void std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > >::_M_emplace_back_aux<std::vector<int, std::allocator > const&>(std::vector<int, std::allocator > const&&&) (vector.tcc:428)
==3352== by 0x416B1C: std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > >::push_back(std::vector<int, std::allocator > const&) (stl_vector.h:891)
==3352== by 0x416850: parse_masks(std::vector<unsigned long, std::allocator >&, std::vector<int, std::allocator >&, std::string, std::string, std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > >&) (parse_masks.cpp:24)
==3352== by 0x440321: main (map.cpp:51)
==3352==

OK, what is on line 24 of parse_masks?

std::ifstream h1(sites_fname);
std::ifstream h2(alleles_fname);
std::vector v;

no_sites=0;
while (h1>>d)
{
if (d>no_sites) {
no_sites=d;
covgs.push_back(v); <<<<<<<<<< this line
}
mask_s.push_back(d);
}
h1.close();

where covgs is passed in by reference as an argument of that function

std::vectorstd::vector& covgs

and what about line 37 (see above)

covgs[mask_s[i]-6].assign(no_alleles,0); //should have -5? might change anyway if I don't end up using mask_s

Why are we subtracting 6 btw?

Anyway,

..lots of that error.

Then:

Reading KMERS...
DONE!


Mon Apr 25 22:41:07 2016
==3352== Use of uninitialised value of size 8
==3352== at 0x438E7C: sdsl::rank_support_v5<(unsigned char)1, (unsigned char)1>::rank(unsigned long) const (rank_support_v5.hpp:123)
==3352== by 0x438FB9: sdsl::rank_support_v5<(unsigned char)1, (unsigned char)1>::operator()(unsigned long) const (rank_support_v5.hpp:136)
==3352== by 0x407297: std::tuple<unsigned long, unsigned long, unsigned long> sdsl::wt_int<sdsl::int_vector<(unsigned char)1>, sdsl::rank_support_v5<(unsigned char)1, (unsigned ch
ar)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >::lex_count<std::tuple<unsigned long, unsigned long
, unsigned long> >(unsigned long, unsigned long, unsigned long) const (wt_int.hpp:551)
==3352== by 0x406945: bidir_search(sdsl::csa_wt<sdsl::wt_int<sdsl::int_vector<(unsigned char)1>, sdsl::rank_support_v5<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mc
l<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >, 2u, 2u, sdsl::sa_order_sa_sampling<(unsigned char)0>, sdsl::isa_sampling<(unsig
ned char)0>, sdsl::int_alphabet<sdsl::sd_vector<sdsl::int_vector<(unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned c
har)0, (unsigned char)1> >, sdsl::rank_support_sd<(unsigned char)1, sdsl::int_vector<(unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_sup
port_mcl<(unsigned char)0, (unsigned char)1> >, sdsl::select_support_sd<(unsigned char)1, sdsl::int_vector<(unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned cha
r)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >, sdsl::int_vector<(unsigned char)0> > >&, unsigned long&, unsigned long&, unsigned long&, unsigned long&, unsigne
d char) (bidir_search.cpp:43)
==3352== by 0x409F53: bidir_search_bwd(sdsl::csa_wt<sdsl::wt_int<sdsl::int_vector<(unsigned char)1>, sdsl::rank_support_v5<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >, 2u, 2u, sdsl::sa_order_sa_sampling<(unsigned char)0>, sdsl::isa_sampling<(unsigned char)0>, sdsl::int_alphabet<sdsl::sd_vector<sdsl::int_vector<(unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >, sdsl::rank_support_sd<(unsigned char)1, sdsl::int_vector<(unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >, sdsl::select_support_sd<(unsigned char)1, sdsl::int_vector<(unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >, sdsl::int_vector<(unsigned char)0> > >&, unsigned long, unsigned long, unsigned long, unsigned long, __gnu_cxx::__normal_iterator<unsigned char*, std::vector<unsigned char, std::allocator > >, __gnu_cxx::__normal_iterator<unsigned char*, std::vector<unsigned char, std::allocator > >, std::list<std::pair<unsigned long, unsigned long>, std::allocator<std::pair<unsigned long, unsigned long> > >&, std::list<std::pair<unsigned long, unsigned long>, std::allocator<std::pair<unsigned long, unsigned long> > >&, std::list<std::vector<std::pair<unsigned int, std::vector<int, std::allocator > >, std::allocator<std::pair<unsigned int, std::vector<int, std::allocator > > > >, std::allocator<std::vector<std::pair<unsigned int, std::vector<int, std::allocator > >, std::allocator<std::pair<unsigned int, std::vector<int, std::allocator > > > > > >&, std::vector<int, std::allocator >&, unsigned long, bool&) (bidir_search_bwd.cpp:143)
==3352== by 0x440B60: main (map.cpp:101)
==3352==

which takes us to lex_count, as Sorina found also.

OK bug raised for now, am looking into it

Something odd with sites object in unit test for forward bidir search

In test Long_site_and_repeated_snp_on_edge_of_site at commit 2a41ce1

We have
prg = gacatagacacacagt5gtcgcctcgtcggctttgagt6gtcgctgctccacacagagact5ggtgctagac7c8a7ccagctgctccacacagaga
and

  //read aligns across sites 5 and 7, allele 1 in both cases                                                                                                              
    query="tagacacacagtgtcgcctcgtcggctttgagtggtgctagacccca";

First the unit test does a bidir backward search and all the unti test assertions pass

` EXPECT_EQ(true,first_del);
EXPECT_EQ(1,sa_intervals.size());
EXPECT_EQ(no_occ,1);

EXPECT_EQ(sites.front().front().first, 7);
EXPECT_EQ(sites.front().front().second.front(), 1);
EXPECT_EQ(sites.front().front().second.size(), 1);
EXPECT_EQ(sites.front().back().first, 5);
EXPECT_EQ(sites.front().back().second.front(), 1);
EXPECT_EQ(sites.front().back().second.size(), 1);`

If we print the sites variable we see this

p sites
$1 = std::list = {[0] = std::vector of length 2, capacity 100 = {{first = 7, second= std::vector of length 1, capacity 1 = {1}}, {first = 5, second = std::vector of length 1, capacity 1 = {1}}}} 

Then we clear variables:

sa_intervals.clear(); sa_intervals_rev.clear(); sites.clear();

and search with the bidir forwards search. Now as soon as we finish the forward search, we print sites and see this:

p sites $3 = std::list = {[0] = std::vector of length 4, capacity 4 = {{first = 5, second =std::vector of length 3, capacity 3 = {1, 2, 1}}, {first = 5, second = std::vector of length 4, capacity 4 = {1, 2, 1, 2}}, {first = 7, second = std::vector of length 4, capacity 4 = {1, 2, 1, 2}}, {first = 7, second = std::vector of length 5, capacity 5 = {1, 2, 1, 2, 1}}}}

That doesnt look right to me at all, and when we get to the assertions we have this code

EXPECT_EQ(sites.front().front().first, 5); EXPECT_EQ(sites.front().front().second.front(), 1); EXPECT_EQ(sites.front().front().second.size(), 1); EXPECT_EQ(sites.front().back().first, 7); EXPECT_EQ(sites.front().back().second.front(), 1); EXPECT_EQ(sites.front().back().second.size(), 1);

and this output
`./test/unittest_bidir_search_bwd_fwd.cpp:586: Failure
Value of: 1
Expected: sites.front().front().second.size()
Which is: 3

./test/unittest_bidir_search_bwd_fwd.cpp:589: Failure
Value of: 1
Expected: sites.front().back().second.size()
Which is: 5
`

Excesive memory consumption whilst precalculating kmers

Precalculating kmers takes an excessive amount of memory.

RAM memory usage peaked at ~32MiB whilst calculating the ksize_9.precalc kmers file for the test1 dataset. See attached file test1_massif.out.12449.zip

This is unlikely due to the thread creation routine. Exiting just after thread creation showed low memory usage on the WG dataset. Memory usage seems to increase quickly as the threads are running.

Read mapping fails due to bidir_search returning an interval [a,a)

PRG:
ATCGCT5CCGCCGGCGA6G5TTTTT

BWM:
$1423245223223323163544444
1423245223223323163544444$
163544444$1423245223223323
223223323163544444$1423245
223323163544444$1423245223
23163544444$14232452232233
23223323163544444$14232452
23245223223323163544444$14
23323163544444$14232452232
245223223323163544444$1423
3163544444$142324522322332
3223323163544444$142324522
323163544444$1423245223223
3245223223323163544444$142
3323163544444$142324522322
3544444$142324522322332316
4$142324522322332316354444
423245223223323163544444$1
44$14232452232233231635444
444$1423245223223323163544 19=left
4444$142324522322332316354
44444$14232452232233231635
45223223323163544444$14232 22=right
5223223323163544444$142324
544444$1423245223223323163
63544444$14232452232233231

read which should map, but which does not:

overlap_end_allele1_and_rhs_flank
GCGATTT

Here's the entry in the kmer precalc file
4 4 4 |1|19 22 22 23 |18 24 25 27 ||5 @|

[19,22) corresponds to the correct interval in the BWM starting with 444=TTT
[22,23) is the interval caused by T5...5TT

Now, as I step through, starting at 444, thing make sense:

p kmer
$12 = std::vector of length 3, capacity 3 = {4 '\004', 4 '\004', 4 '\004'}
p sa_intervals
$13 = std::list = {[0] = {first = 19, second = 22}, [1] = {first = 22, second = 23}}

Then we step into bidir_search_bwd from map.cpp, we have kmer=444 and c=1

As we step in here

while (it!=sa_intervals.end() && it_rev!=sa_intervals_rev.end() && it_s!=sites.end()) { //calculate sum to return- can do this in top fcns => if (bidir_search(csa,(*it).first,(*it).second,(*it_rev).first,(*it_rev).second,c)>0)

we get into bidir_search.cpp with input args (left, right,c) = (19,22,1)

We get c_begin=1 - makes sense
Then:

//rank_l is the number of times c occurs in BWT[0,left]  - number of 1s in [0,19]                                                                                      

(gdb) p rank_l
$19 = 1 << correct, look at RH column of BWM above

// s is total number of times                                                                                                                  
//      chars smaller than c occur in BWT[left,right]                                                                                          

(gdb) p s
$20 = 0 <<<< correct

// b is total number of times                                                                                                                  
//      chars bigger than c occur in BWT[left,right] 

(gdb) p b
$21 = 3 <<< correct

Now this:

//rank_r is the number of times c occurs in BWT[0,right]                                                                                       
uint64_t rank_r = right - left - s - b + rank_l - 1;

p rank_r |
$28 = 0

If we look at RH column the BWM, this should be 1 - it certainly can't be less than rank_l which was 1

That I think must be wrong.

As a result of this

left = c_begin + rank_l;

(gdb) p left
$29 = (uint64_t &) @0x69f8f0: 2

right = c_begin + rank_r + 1;

(gdb) p right
$30 = (uint64_t &) @0x69f8f8: 2

This is because rank_r<rank_l

Back in bidir_search_bwd.cpp, we return here

                    while (it!=sa_intervals.end() && it_rev!=sa_intervals_rev.end() && it_s!=sites.end()) {
                      //calculate sum to return- can do this in top fcns                                                                       
                      if (bidir_search(csa,(*it).first,(*it).second,(*it_rev).first,(*it_rev).second,c)>0) {
                        ++it;
                        ++it_rev;
                        ++it_s;
                      }

That if condition fails, because bidir_search returns right-left=0.
So we never increment the iterator. We now have iterator.first=2, iterator.second=2 (left and right),
which we then erase
it=sa_intervals.erase(it);

And from then on there's no way the read can map. The next loop round, we start on the sa_interval [22,23) which is the false one from T5...5TT, which doesn't extend to allow the read to map.
Again we go into bidir_search
rank_l=1
rank_r=0
s=0
b=1
and that results in right=left=2

.. etc - the read fails to map.

Compiler warnings

precalc_gen.hpp: In function ‘void* worker(void_)’:
precalc_gen.hpp:84:1: warning: no return statement in function returning non-void [-Wreturn-type]
precalc_gen.hpp: In function ‘void read_precalc_kmers(std::string, sequence_map<std::vector, std::list<std::pair<long unsigned int, long unsigned int> > >&, sequence_map<std::vector, std::list<std::pair<long unsigned int, long unsigned int> > >&, sequence_map<std::vector, std::list<std::vector<std::pair<unsigned int, std::vector > > > >&, sequence_set<std::vector >&)’:
precalc_gen.hpp:201:48: warning: deprecated conversion from string constant to ‘char_’ [-Wwrite-strings]
precalc_gen.hpp:206:34: warning: deprecated conversion from string constant to ‘char_’ [-Wwrite-strings]
precalc_gen.hpp:211:54: warning: deprecated conversion from string constant to ‘char_’ [-Wwrite-strings]
precalc_gen.hpp:212:58: warning: deprecated conversion from string constant to ‘char_’ [-Wwrite-strings]
precalc_gen.hpp:213:33: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
precalc_gen.hpp:214:37: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
precalc_gen.hpp:230:33: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
precalc_gen.hpp:233:42: warning: deprecated conversion from string constant to ‘char_’ [-Wwrite-strings]
precalc_gen.hpp:234:56: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
precalc_gen.hpp:240:37: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
make: warning: Clock skew detected. Your build may be incomplete.
bash-4.1$

Add tests to linear PRG construction script

need more tests for the linear PRG construction script. Should check if site markers appear in increasing order, if even numbers are always enclosed by 2 odds and if there are exactly 2 occurences of each odd marker

deletion

need to add support for whole site deletion in the mapping function (i.e. 2 numbers next to each other)

Bug parsing human PRG with all 1000g variants (80 million)

There seems to be a bug in the actual parsing of the vBWT.

First - the human PRG was constructed like this

perl ~/dev/git/gramtools/utils/vcf_to_linear_prg.pl --vcf /Net/banyan/data0/users/zam/results/20150429_build_1000g_for_gramtools/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf --ref /Net/birch/data/zam/ref/hs/fasta/grc37/Homo_sapiens.GRCh37.60.dna.WHOLE_GENOME.fa --outfile human_prg_all
Finished printing linear PRG. Final number in alphabet is 158548838

This implies about 80million variants, which matches the input. Fine.

Then

pwd
/data3/projects/gramtools/wabi/human/vbwt_construction/ALL

bash-4.1$ time ~/dev/git/gramtools/src/vBWT /data2/users/zam/analyses/2016/0405_human_lin_prg_construction/human_prg_all.without_final_carriage_return csa dummy_reads.fa dummymask1 dummymask2 output_allelcovgs output_reads binaryprg memlog 19 dummykmers >& output

What does output show:


Thu May 5 10:17:41 2016
Start CSA construction
PRG size: 5266662401
Alphabet size: 117591703

PRG size is right, but alphabet size is wrong. Should not be 11million, but 15.85 million, as above.

So i can't remember all of my reasoning, but I thought this might have been a problem loading the PRG file.

There is no matching destructor for csa's constructor - mem leak

from valgrind

==55756== 688 bytes in 1 blocks are definitely lost in loss record 1 of 1
==55756== at 0x4A09320: malloc (vg_replace_malloc.c:263)
==55756== by 0x40DA4C: csa_constr(std::string, std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > >&, char_, char_, char*, bool) (csa_construction.cpp:33)

Move kmers indexing to build step

At the moment, if the .precalc file is not found, the SA-intervals for small kmers are precalculated at the beginning of mapping. Better to move this step as part of the --build process, as it only needs to be done once

Strange sa_intervals found in kmer.precalc for toy example

PRG is:

ATCGCT5CCGCCGGCGA6G5TTTTT

prg.mask_alleles is
0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 2 0 0 0 0 0 0

prg.mask_sites is
0 0 0 0 0 0 0 5 5 5 5 5 5 5 5 5 5 0 5 0 0 0 0 0 0

This is the list of kmers generated by Carlos's script using the -n option:

CTC
TCC
CCG
CGC
GCC
CGG
GGC
GCG
CGA
GAT
ATT
CTG
TGT
GTT
ATC
TCG
GCT
TTT

and this is the line of the precalc file corresponding to TTT

4 4 4 |1|19 22 22 23 |18 24 25 27 ||5 @|

So this thinks TTT can be found at interval (19,22) and (22,23)

here is the BWM of the PRG

5CCGCCGGCGA6G5TTTTTATCGCT
5TTTTTATCGCT5CCGCCGGCGA6G
6G5TTTTTATCGCT5CCGCCGGCGA
A6G5TTTTTATCGCT5CCGCCGGCG
ATCGCT5CCGCCGGCGA6G5TTTTT
CCGCCGGCGA6G5TTTTTATCGCT5
CCGGCGA6G5TTTTTATCGCT5CCG
CGA6G5TTTTTATCGCT5CCGCCGG
CGCCGGCGA6G5TTTTTATCGCT5C
CGCT5CCGCCGGCGA6G5TTTTTAT
CGGCGA6G5TTTTTATCGCT5CCGC
CT5CCGCCGGCGA6G5TTTTTATCG
G5TTTTTATCGCT5CCGCCGGCGA6
GA6G5TTTTTATCGCT5CCGCCGGC
GCCGGCGA6G5TTTTTATCGCT5CC
GCGA6G5TTTTTATCGCT5CCGCCG
GCT5CCGCCGGCGA6G5TTTTTATC
GGCGA6G5TTTTTATCGCT5CCGCC
T5CCGCCGGCGA6G5TTTTTATCGC
TATCGCT5CCGCCGGCGA6G5TTTT index 19
TCGCT5CCGCCGGCGA6G5TTTTTA index 20
TTATCGCT5CCGCCGGCGA6G5TTT index 21
TTTATCGCT5CCGCCGGCGA6G5TT index 22
TTTTATCGCT5CCGCCGGCGA6G5T index 23
TTTTTATCGCT5CCGCCGGCGA6G5 index 24

reminder, so you don't need to scroll
prg=ATCGCT5CCGCCGGCGA6G5TTTTT

The intervals (19,22) and (22,23) make no sense. Also bear in mind Sorina uses [a,b) intervals , so these should be [19,22) and [22,23). None of the rows 19-21 even start with TTT

100a/36a test is too slow

Sorina, I tried shortening the 100a.txt PRG to 36a.txt but it still does not run overnight, even with -O3.
Seems to me we'll get the same value from a shorter string - even down to 10a - otherwise the total number of substrings is just too big. OK with you? Otherwise, we could have a set of veryslow tests that run for weeks, but if we were doing that I might target other tests thanthis one

Add toy data to allow new user try out the code

It would be useful to have some tiny data included in the distribution to allow a new user to try out the code. E.g., this

perl utils/vcf_to_linear_prg.pl --outfile my_species.prg --vcf big.vcf --ref reference.fa

would refer to some small VCF and reference included in the git repo.

Crash on reading PRG with no variation

cat zam
AACGTT

cat zam.all.kmers
AA
AC
CG
GT
TT

cat zam.mask_alleles
0 0 0 0 0 0
cat zam.mask_sites
0 0 0 0 0 0

/home/zam/dev/git/gramtools/src/gramtools --prg zam5 --csa out/csa --ps zam.
mask_sites --pa zam.mask_alleles --co out/covg --ro out/read --po out/bin -
-log out/memlog --ksize 2 --kfile zam.all.kmers --input read.fa


Mon Jul 25 23:37:15 2016
Start CSA construction
PRG size: 6
Alphabet size: 0


Mon Jul 25 23:37:16 2016
End CSA construction

6 6
Segmentation fault

Run through valgrind:


Mon Jul 25 23:38:00 2016
End CSA construction

6 6
==46905== Invalid read of size 8
==46905== at 0x408BCC: std::vector<int, std::allocator >::size() const (stl_vector.h:626)
==46905== by 0x417702: parse_masks(std::vector<unsigned long, std::allocator >&, std::vector<int, std::allocator >&, std::string, std::string, std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > >&) (parse_masks.cpp:57)
==46905== by 0x4422AA: main (map.cpp:146)
==46905== Address 0xfffffffffffffff0 is not stack'd, malloc'd or (recently) free'd
==46905==
==46905==
==46905== Process terminating with default action of signal 11 (SIGSEGV)
==46905== Access not within mapped region at address 0xFFFFFFFFFFFFFFF0
==46905== at 0x408BCC: std::vector<int, std::allocator >::size() const (stl_vector.h:626)
==46905== by 0x417702: parse_masks(std::vector<unsigned long, std::allocator >&, std::vector<int, std::allocator >&, std::string, std::string, std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > >&) (parse_masks.cpp:57)
==46905== by 0x4422AA: main (map.cpp:146)

Run through gdb and we see this

Program received signal SIGSEGV, Segmentation fault.
0x0000000000408bcc in std::vector<int, std::allocator >::size (
this=0xffffffffffffffe8)
at /opt/centos/devtoolset-1.1/root/usr/lib/gcc/x86_64-redhat-linux/4.7.2/../../../../include/c++/4.7.2/bits/stl_vector.h:626
626 { return size_type(this->_M_impl._M_finish - this->_M_impl._M_start); }
(gdb) bt
#0 0x0000000000408bcc in std::vector<int, std::allocator >::size (

this=0xffffffffffffffe8)
at /opt/centos/devtoolset-1.1/root/usr/lib/gcc/x86_64-redhat-linux/4.7.2/../../../../include/c++/4.7.2/bits/stl_vector.h:626

#1 0x0000000000417703 in parse_masks (

mask_s=std::vector of length 6, capacity 8 = {...}, 
mask_a=std::vector of length 6, capacity 8 = {...}, sites_fname=
"/data2/users/zam/analyses/2016/0720_debug_gramtools/zam5.mask_sites", 
alleles_fname=
"/data2/users/zam/analyses/2016/0720_debug_gramtools/zam5.mask_alleles", 
covgs=std::vector of length 0, capacity 0) at ./parse_masks.cpp:57

#2 0x00000000004422ab in main (argc=23, argv=0x7fffffffcaa8)

at ./map.cpp:146

(gdb) frame 2
#2 0x00000000004422ab in main (argc=23, argv=0x7fffffffcaa8)

at ./map.cpp:146

146 uint64_t maxx=parse_masks(mask_s,mask_a,_sitemask,_allmask,covgs);
(gdb) p mask_s
$1 = std::vector of length 6, capacity 8 = {0, 0, 0, 0, 0, 0}
(gdb) p mask_a
$2 = std::vector of length 6, capacity 8 = {0, 0, 0, 0, 0, 0}
(gdb) p _sitemask
$3 = "/data2/users/zam/analyses/2016/0720_debug_gramtools/zam5.mask_sites"
(gdb) p covgs
$4 = std::vector of length 0, capacity 0
(gdb) frame 1
#1 0x0000000000417703 in parse_masks (

mask_s=std::vector of length 6, capacity 8 = {...}, 
mask_a=std::vector of length 6, capacity 8 = {...}, sites_fname=
"/data2/users/zam/analyses/2016/0720_debug_gramtools/zam5.mask_sites", 
alleles_fname=
"/data2/users/zam/analyses/2016/0720_debug_gramtools/zam5.mask_alleles", 
covgs=std::vector of length 0, capacity 0) at ./parse_masks.cpp:57

57 cout<<"Covgs dim"<<covgs.size()<<" "<<covgs.front().size()<<" "<<covgs.back().size()<<endl;
(gdb) l
52 mask_a.push_back(0);
53 }*/
54
55 cout<<endl<<mask_s.size()<<" "<<mask_a.size()<<endl;
56
57 cout<<"Covgs dim"<<covgs.size()<<" "<<covgs.front().size()<<" "<<covgs.back().size()<<endl;
58 return(no_sites+1); //no_sites is last odd number in mask_sites, but alphabet size is the even number corresponding to it
59 }
(gdb) p covgs.size()
$5 = 0
(gdb) p covgs.front()
$6 = <error reading variable: Cannot access memory at address 0x8>
(gdb)

ie annoying thing - crashes if you call front() on an empty covgs array trying to print this

cout<<"Covgs dim"<<covgs.size()<<" "<<covgs.front().size()<<" "<<covgs.back().size()<<endl;

kmers in ref

Do we really need the kmers_in_ref hash? is kmers_in_ref[kmer] not equivalent with kmers_sites[kmer].empty() ?

Valgrind finds Invalid read (means bad memory access) in BackwardSearchTest.NoVariants1

At this commit:

[master a7c1d1e] Fixed bugs in references to data files, and fixed buggy mask input file.

I ran valgrind on the fast unit tests:

valgrind --leak-check=yes ./unittest_bidir_search_bwd_fwd

[==========] Running 9 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 9 tests from BackwardSearchTest
[ RUN ] BackwardSearchTest.NoVariants1
==78121== Invalid read of size 1
==78121== at 0x43547E: bidir_search_fwd(sdsl::csa_wt<sdsl::wt_int<sdsl::int_vector<(unsigned char)1>, sdsl::rank_support_v5<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >, 2u, 16777216u, sdsl::sa_order_sa_sampling<(unsigned char)0>, sdsl::isa_sampling<(unsigned char)0>, sdsl::int_alphabet<sdsl::sd_vector<sdsl::int_vector<(unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >, sdsl::rank_support_sd<(unsigned char)1, sdsl::int_vector<(unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >, sdsl::select_support_sd<(unsigned char)1, sdsl::int_vector<(unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >, sdsl::int_vector<(unsigned char)0> > >&, unsigned long, unsigned long, unsigned long, unsigned long, __gnu_cxx::__normal_iterator<unsigned char*, std::vector<unsigned char, std::allocator > >, __gnu_cxx::_normal_iterator<unsigned char*, std::vector<unsigned char, std::allocator > >, std::list<std::pair<unsigned long, unsigned long>, std::allocator<std::pair<unsigned long, unsigned long> > >&, std::list<std::pair<unsigned long, unsigned long>, std::allocator<std::pair<unsigned long, unsigned long> > >&, std::list<std::vector<std::pair<unsigned int, std::vector<int, std::allocator > >, std::allocator<std::pair<unsigned int, std::vector<int, std::allocator > > > >, std::allocator<std::vector<std::pair<unsigned int, std::vector<int, std::allocator > >, std::allocator<std::pair<unsigned int, std::vector<int, std::allocator > > > > > >&, std::vector<int, std::allocator >&, unsigned long, bool&, bool) (bidir_search_fwd.cpp:134)
==78121== by 0x4360AE: BackwardSearchTest_NoVariants1_Test::TestBody() (unittest_bidir_search_bwd_fwd.cpp:78)
==78121== by 0x4815C7: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test
, void (testing::Test::)(), char const) (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x47CA91: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test_, void (testing::Test::)(), char const) (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x464042: testing::Test::Run() (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x46481F: testing::TestInfo::Run() (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x464EB9: testing::TestCase::Run() (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x46B67F: testing::internal::UnitTestImpl::RunAllTests() (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x4828C7: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl_, bool (testing::internal::UnitTestImpl::)(), char const) (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x47D7C9: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl_, bool (testing::internal::UnitTestImpl::)(), char const) (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x46A2A5: testing::UnitTest::Run() (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x43F343: RUN_ALL_TESTS() (gtest.h:2288)
==78121== Address 0x5341b61 is 0 bytes after a block of size 1 alloc'd
==78121== at 0x4A08EB6: operator new(unsigned long) (vg_replace_malloc.c:287)
==78121== by 0x42981B: gnu_cxx::new_allocator::allocate(unsigned long, void const) (new_allocator.h:94)
==78121== by 0x427BE2: std::Vector_base<unsigned char, std::allocator >::M_allocate(unsigned long) (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x440955: void std::vector<unsigned char, std::allocator >::M_emplace_back_aux(unsigned char&&) (vector.tcc:402)
==78121== by 0x440186: void std::vector<unsigned char, std::allocator >::emplace_back(unsigned char&&) (vector.tcc:102)
==78121== by 0x43F95B: std::vector<unsigned char, std::allocator >::push_back(unsigned char&&) (stl_vector.h:900)
==78121== by 0x4359CB: BackwardSearchTest_NoVariants1_Test::TestBody() (unittest_bidir_search_bwd_fwd.cpp:59)
==78121== by 0x4815C7: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test
, void (testing::Test::
)(), char const
) (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x47CA91: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test
, void (testing::Test::_)(), char const*) (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x464042: testing::Test::Run() (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x46481F: testing::TestInfo::Run() (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121== by 0x464EB9: testing::TestCase::Run() (in /Net/fs1/home/zam/dev/git/gramtools/src/unittest_bidir_search_bwd_fwd)
==78121==
[ OK ] BackwardSearchTest.NoVariants1 (64701 ms)
[ RUN ] BackwardSearchTest.OneSNP
[ OK ] BackwardSearchTest.OneSNP (34704 ms)
..etc

The bad access happens here

`
while (it!=sa_intervals.end() && it_rev!=sa_intervals_rev.end() && it_s!=sites.end()) {

                    //calculate sum to return- can do this in top fcns                                                                         

                    if (bidir_search(csa_rev,(*it).first,(*it).second,(*it_rev).first,(*it_rev).second,c)>0) {

                            ++it;

                            ++it_rev;

                            ++it_s;
                    }
                    else {
                            if (it==sa_intervals.begin()) first_del=true;
                            //might need to see first_del from top fcns to check if there are matches in the reference                         
                            it=sa_intervals.erase(it);
                            it_rev=sa_intervals_rev.erase(it_rev);
                            it_s=sites.erase(it_s);
                    }
            }
            ++pat_it;  
            c=*pat_it;   <<<<<<<<<<<< HERE!!

`

gramtools no longer compiles on refactor_bidir_bwd

I get this

bash-4.1$ make
Scanning dependencies of target gram
make[2]: Warning: File CMakeFiles/gram.dir/depend.make' has modification time 39 s in the future [ 4%] Building CXX object CMakeFiles/gram.dir/src/map.cpp.o [ 8%] Building CXX object CMakeFiles/gram.dir/src/kmers.cpp.o [ 12%] Building CXX object CMakeFiles/gram.dir/src/bidir_search_bwd.cpp.o [ 16%] Building CXX object CMakeFiles/gram.dir/src/bidir_search_fwd.cpp.o [ 20%] Building CXX object CMakeFiles/gram.dir/src/skip.cpp.o [ 25%] Building CXX object CMakeFiles/gram.dir/src/bidir_search.cpp.o [ 29%] Building CXX object CMakeFiles/gram.dir/src/masks.cpp.o [ 33%] Building CXX object CMakeFiles/gram.dir/src/fm_index.cpp.o [ 37%] Building CXX object CMakeFiles/gram.dir/src/variants.cpp.o [ 41%] Building CXX object CMakeFiles/gram.dir/src/ranks.cpp.o Linking CXX static library libgram.a make[2]: warning: Clock skew detected. Your build may be incomplete. [ 41%] Built target gram Scanning dependencies of target git_version_py [ 41%] Built target git_version_py Scanning dependencies of target git_version [ 41%] Built target git_version Scanning dependencies of target gramtools make[2]: Warning: File CMakeFiles/gramtools.dir/depend.make' has modification time 39 s in the future
[ 45%] Building CXX object CMakeFiles/gramtools.dir/src/main.cpp.o
In file included from /home/zam/dev/git/gramtools/src/main.cpp:11:0:
/home/zam/dev/git/gramtools/include/git_version/git_version.hpp:1:1: error: stray ‘\’ in program
compilation terminated due to -Wfatal-errors.
make[2]: *** [CMakeFiles/gramtools.dir/src/main.cpp.o] Error 1
make[1]: *** [CMakeFiles/gramtools.dir/all] Error 2
make: *** [all] Error 2
bash-4.1$

Tried this yesterday after git pull, and today I deleted the entire dir and recloned - same issue

git log | head -n 10
commit 0e07f66
Author: Robyn Ffrancon [email protected]
Date: Wed Jun 28 19:35:05 2017 +0100

stdout stored in quasimap run report.json

Move fm-index construction to build step

the fm-index construction function should be moved to the --build step, after the generation of PRG and kmers. The function already dumps the fm-index data structure to a file, so it should just be loaded from that file during the --quasimap step. Especially important for human data, where the fm-index construction takes a few hours

Fails to map/match to a PRG with no variation

Just reinstated the unit tests with 1byte and 100a's, both with no variation. In both cases, all reads/strings fail to match

Commit: 32b286f

./unittest_bidir_search_bwd_fwd
[==========] Running 9 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 9 tests from BackwardSearchTest
[ RUN ] BackwardSearchTest.NoVariants1
PRG size: 1
Alphabet size: 0
./test/unittest_bidir_search_bwd_fwd.cpp:76: Failure
Value of: occ_expt
Actual: 0
Expected: no_occ
Which is: 1
PRG size: 1
Alphabet size: 0
./test/unittest_bidir_search_bwd_fwd.cpp:89: Failure
Value of: occ_expt
Actual: 0
Expected: no_occ
Which is: 1
[ FAILED ] BackwardSearchTest.NoVariants1 (1663 ms)
[ RUN ] BackwardSearchTest.NoVariants2
PRG size: 100
Alphabet size: 0
./test/unittest_bidir_search_bwd_fwd.cpp:152: Failure
Value of: occ_expt
Actual: 0
Expected: no_occ
Which is: 100
...
etc

Pointing gramtools at current SDSL results in assertion failure in unit tests

At commit 53598fc

Unit tests pass pointing at Sorina's SDSL work fine:

./unittest_bidir_search_bwd_fwd
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from BackwardSearchTest
[ RUN ] BackwardSearchTest.OneSNP
PRG size: 27
Alphabet size: 5
PRG size: 27
Alphabet size: 5
[ OK ] BackwardSearchTest.OneSNP (1044 ms)
[ RUN ] BackwardSearchTest.TwoSNPs
PRG size: 34
Alphabet size: 7
PRG size: 34
Alphabet size: 7
[ OK ] BackwardSearchTest.TwoSNPs (722 ms)
[ RUN ] BackwardSearchTest.Two_matches_one_variable_one_nonvariable_region
PRG size: 39
Alphabet size: 5
PRG size: 39
Alphabet size: 5
[ OK ] BackwardSearchTest.Two_matches_one_variable_one_nonvariable_region (599 ms)
[ RUN ] BackwardSearchTest.Two_long_sites
PRG size: 86
Alphabet size: 7
PRG size: 86
Alphabet size: 7
[ OK ] BackwardSearchTest.Two_long_sites (483 ms)
[ RUN ] BackwardSearchTest.Match_within_long_site_match_outside
PRG size: 97
Alphabet size: 7
PRG size: 97
Alphabet size: 7
[ OK ] BackwardSearchTest.Match_within_long_site_match_outside (484 ms)
[ RUN ] BackwardSearchTest.Multiple_matches_over_multiple_sites
PRG size: 93
Alphabet size: 7
PRG size: 93
Alphabet size: 7
[ OK ] BackwardSearchTest.Multiple_matches_over_multiple_sites (434 ms)
[ RUN ] BackwardSearchTest.One_match_many_sites
PRG size: 76
Alphabet size: 15
PRG size: 76
Alphabet size: 15
[ OK ] BackwardSearchTest.One_match_many_sites (443 ms)
[----------] 7 tests from BackwardSearchTest (4209 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test case ran. (4209 ms total)
[ PASSED ] 7 tests.

If I point instead at current tip of SDSL:

then we hit an assert in skip()

Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from BackwardSearchTest
[ RUN ] BackwardSearchTest.OneSNP
PRG size: 27
Alphabet size: 5
unittest_bidir_search_bwd_fwd: ./skip.cpp:111: bool skip(sdsl::csa_wtsdsl::wt_int<sdsl::int_vector<1u, sdsl::rank_support_v5<> >, 2u, 16777216u>&, uint64_t&, uint64_t&, uint64_t&, uint64_t&, uint32_t): Assertion `right>left' failed.
Aborted

Install/compile fails

git clone --recursive [email protected]:iqbal-lab/gramtools.git

bash install.sh

Initialized empty Git repository in /Net/fs1/home/zam/dev/git/gramtools/sdsl-lite/.git/
remote: Counting objects: 12888, done.
remote: Total 12888 (delta 0), reused 0 (delta 0), pack-reused 12888
Receiving objects: 100% (12888/12888), 9.01 MiB | 4.09 MiB/s, done.
Resolving deltas: 100% (7357/7357), done.
Submodule 'external/googletest' (https://chromium.googlesource.com/external/googletest) registered for path 'external/googletest'
Submodule 'external/libdivsufsort' (https://github.com/simongog/libdivsufsort.git) registered for path 'external/libdivsufsort'
Initialized empty Git repository in /Net/fs1/home/zam/dev/git/gramtools/sdsl-lite/external/googletest/.git/
remote: Total 4340 (delta 3230), reused 4340 (delta 3230)
Receiving objects: 100% (4340/4340), 1.69 MiB, done.
Resolving deltas: 100% (3230/3230), done.
Submodule path 'external/googletest': checked out '8245545b6dc9c4703e6496d1efd19e975ad2b038'
Initialized empty Git repository in /Net/fs1/home/zam/dev/git/gramtools/sdsl-lite/external/libdivsufsort/.git/
remote: Counting objects: 209, done.
remote: Total 209 (delta 0), reused 0 (delta 0), pack-reused 209
Receiving objects: 100% (209/209), 93.53 KiB, done.
Resolving deltas: 100% (133/133), done.
Submodule path 'external/libdivsufsort': checked out '0f24acd8de208464769c782119dacf158647f7ed'
Library will be installed in '/Net/fs1/home/zam'
Copy pre-commit into .git/hooks
-- The CXX compiler identification is GNU 4.7.2
-- The C compiler identification is GNU 4.7.2
-- Check for working CXX compiler: /opt/centos/devtoolset-1.1/root/usr/bin/c++
-- Check for working CXX compiler: /opt/centos/devtoolset-1.1/root/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working C compiler: /opt/centos/devtoolset-1.1/root/usr/bin/cc
-- Check for working C compiler: /opt/centos/devtoolset-1.1/root/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Compiler is recent enough to support C++11.
-- Performing Test HAVE_GCC_STD=C__11__WALL__WEXTRA___DNDEBUG
-- Performing Test HAVE_GCC_STD=C__11__WALL__WEXTRA___DNDEBUG - Success
-- Performing Test HAVE_GCC_O3__FFAST_MATH__FUNROLL_LOOPS
-- Performing Test HAVE_GCC_O3__FFAST_MATH__FUNROLL_LOOPS - Success
-- Performing Test HAVE_GCC_MSSE4.2
-- Performing Test HAVE_GCC_MSSE4.2 - Success
CMake Warning (dev) at external/libdivsufsort/CMakeLists.txt:7 (project):
Policy CMP0048 is not set: project() command manages VERSION variables.
Run "cmake --help-policy CMP0048" for policy details. Use the cmake_policy
command to set the policy and suppress this warning.

The following variable(s) would be set to empty:

PROJECT_VERSION
PROJECT_VERSION_MAJOR
PROJECT_VERSION_MINOR
PROJECT_VERSION_PATCH

This warning is for project developers. Use -Wno-dev to suppress it.

-- Performing Test HAVE_GCC_WALL
-- Performing Test HAVE_GCC_WALL - Success
-- Performing Test HAVE_GCC_FOMIT_FRAME_POINTER
-- Performing Test HAVE_GCC_FOMIT_FRAME_POINTER - Success
-- Looking for inttypes.h
-- Looking for inttypes.h - found
-- Looking for memory.h
-- Looking for memory.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stdlib.h
-- Looking for stdlib.h - found
-- Looking for string.h
-- Looking for string.h - found
-- Looking for strings.h
-- Looking for strings.h - found
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Performing Test HAVE_INLINE
-- Performing Test HAVE_INLINE - Success
-- Performing Test HAVE___INLINE
-- Performing Test HAVE___INLINE - Success
-- Performing Test HAVE___INLINE__
-- Performing Test HAVE___INLINE__ - Success
-- Performing Test HAVE___DECLSPEC_DLLEXPORT_
-- Performing Test HAVE___DECLSPEC_DLLEXPORT_ - Failed
-- Performing Test HAVE___DECLSPEC_DLLIMPORT_
-- Performing Test HAVE___DECLSPEC_DLLIMPORT_ - Failed
-- Check size of uint8_t
-- Check size of uint8_t - done
-- Check size of int32_t
-- Check size of int32_t - done
-- Looking for PRId32
-- Looking for PRId32 - found
-- Check size of int64_t
-- Check size of int64_t - done
-- Looking for PRId64
-- Looking for PRId64 - found
CMake Warning (dev) at external/googletest/CMakeLists.txt:42 (project):
Policy CMP0048 is not set: project() command manages VERSION variables.
Run "cmake --help-policy CMP0048" for policy details. Use the cmake_policy
command to set the policy and suppress this warning.

The following variable(s) would be set to empty:

PROJECT_VERSION
PROJECT_VERSION_MAJOR
PROJECT_VERSION_MINOR
PROJECT_VERSION_PATCH

This warning is for project developers. Use -Wno-dev to suppress it.

-- Found PythonInterp: /apps/well/python/2.7.7-venv/bin/python (found version "2.7.7")
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Looking for C++ include cstdio
-- Looking for C++ include cstdio - found
-- Found Doxygen: /usr/bin/doxygen (found version "1.6.1")
-- Configuring done
-- Generating done
-- Build files have been written to: /home/zam/dev/git/gramtools/sdsl-lite/build
make: Warning: File Makefile' has modification time 71 s in the future make[1]: Warning: FileCMakeFiles/Makefile2' has modification time 78 s in the future
make[2]: Warning: File CMakeFiles/Makefile2' has modification time 78 s in the future make[3]: Warning: Filelib/CMakeFiles/sdsl.dir/flags.make' has modification time 72 s in the future
Scanning dependencies of target sdsl
make[3]: warning: Clock skew detected. Your build may be incomplete.
make[3]: Warning: File `lib/CMakeFiles/sdsl.dir/flags.make' has modification time 71 s in the future
[ 0%] Building CXX object lib/CMakeFiles/sdsl.dir/bits.cpp.o
[ 0%] Building CXX object lib/CMakeFiles/sdsl.dir/bp_support_algorithm.cpp.o
[ 0%] Building CXX object lib/CMakeFiles/sdsl.dir/coder_elias_delta.cpp.o
[ 14%] Building CXX object lib/CMakeFiles/sdsl.dir/coder_elias_gamma.cpp.o
[ 14%] Building CXX object lib/CMakeFiles/sdsl.dir/coder_fibonacci.cpp.o
[ 14%] Building CXX object lib/CMakeFiles/sdsl.dir/config.cpp.o
[ 28%] Building CXX object lib/CMakeFiles/sdsl.dir/construct_config.cpp.o
[ 28%] Building CXX object lib/CMakeFiles/sdsl.dir/construct_isa.cpp.o
[ 28%] Building CXX object lib/CMakeFiles/sdsl.dir/construct_lcp.cpp.o
In file included from /home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_huff.hpp:25:0,
from /home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/construct_lcp.hpp:33,
from /home/zam/dev/git/gramtools/sdsl-lite/lib/construct_lcp.cpp:6:
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp: In member function ‘std::array<std::vector<std::array<long unsigned int, 2ul> >, 2ul> sdsl::wt_pc<t_shape, t_bitvector, t_rank, t_select, t_select_zero, t_tree_strat>::expand(const node_type&, const range_vec_type&) const’:
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:744:52: error: return type ‘struct std::array<std::vector<std::array<long unsigned int, 2ul> >, 2ul>’ is incomplete
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp: In member function ‘std::array<std::vector<std::array<long unsigned int, 2ul> >, 2ul> sdsl::wt_pc<t_shape, t_bitvector, t_rank, t_select, t_select_zero, t_tree_strat>::expand(const node_type&, sdsl::range_vec_type&&) const’:
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:761:47: error: return type ‘struct std::array<std::vector<std::array<long unsigned int, 2ul> >, 2ul>’ is incomplete
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:766:67: error: no match for ‘operator[]’ in ‘r[0]’
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:767:67: error: no match for ‘operator[]’ in ‘r[1]’
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:769:39: error: no match for ‘operator[]’ in ‘r[1]’
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:769:44: error: no match for ‘operator[]’ in ‘r[0]’
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:772:36: error: no match for ‘operator[]’ in ‘r[0]’
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp: In member function ‘std::array<std::array<long unsigned int, 2ul>, 2ul> sdsl::wt_pc<t_shape, t_bitvector, t_rank, t_select, t_select_zero, t_tree_strat>::expand(const node_type&, const range_type&) const’:
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:791:63: error: return type ‘struct std::array<std::array<long unsigned int, 2ul>, 2ul>’ is incomplete
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:793:63: error: no match for ‘operator[]’ in ‘r[0]’
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:794:63: error: no match for ‘operator[]’ in ‘r[1]’
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:796:35: error: no match for ‘operator[]’ in ‘r[1]’
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:796:40: error: no match for ‘operator[]’ in ‘r[0]’
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_pc.hpp:799:32: error: no match for ‘operator[]’ in ‘r[0]’
In file included from /home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/construct_lcp.hpp:34:0,
from /home/zam/dev/git/gramtools/sdsl-lite/lib/construct_lcp.cpp:6:
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_algorithm.hpp: In lambda function:
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_algorithm.hpp:72:36: error: no match for ‘operator[]’ in ‘r[1]’
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_algorithm.hpp:72:41: error: no match for ‘operator[]’ in ‘r[0]’
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_algorithm.hpp: In function ‘void sdsl::_interval_symbols_rec(const t_wt&, sdsl::range_type, typename t_wt::size_type&, std::vector&, std::vector&, std::vector&, const typename t_wt::node_type&)’:
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_algorithm.hpp:129:1: error: ‘r’ has incomplete type
In file included from /opt/centos/devtoolset-1.1/root/usr/lib/gcc/x86_64-redhat-linux/4.7.2/../../../../include/c++/4.7.2/functional:56:0,
from /opt/centos/devtoolset-1.1/root/usr/lib/gcc/x86_64-redhat-linux/4.7.2/../../../../include/c++/4.7.2/memory:81,
from /home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/structure_tree.hpp:13,
from /home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/int_vector.hpp:25,
from /home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/construct_lcp.hpp:26,
from /home/zam/dev/git/gramtools/sdsl-lite/lib/construct_lcp.cpp:6:
/opt/centos/devtoolset-1.1/root/usr/lib/gcc/x86_64-redhat-linux/4.7.2/../../../../include/c++/4.7.2/tuple:857:42: error: declaration of ‘sdsl::range_type {aka struct std::array<long unsigned int, 2ul>}’
In file included from /home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/construct_lcp.hpp:34:0,
from /home/zam/dev/git/gramtools/sdsl-lite/lib/construct_lcp.cpp:6:
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_algorithm.hpp: In instantiation of ‘void sdsl::_interval_symbols(const t_wt&, typename t_wt::size_type, typename t_wt::size_type, typename t_wt::size_type&, std::vector&, std::vector&, std::vector&) [with t_wt = sdsl::wt_pc<sdsl::huff_shape, sdsl::int_vector<1u>, sdsl::rank_support_v<1u, 1u>, sdsl::select_support_scan<1u>, sdsl::select_support_scan<0u>, sdsl::byte_tree<> >; typename t_wt::size_type = long unsigned int; typename t_wt::value_type = unsigned char; typename t_wt::size_type = long unsigned int]’:
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_algorithm.hpp:216:9: required from ‘void sdsl::interval_symbols(const t_wt&, typename t_wt::size_type, typename t_wt::size_type, typename t_wt::size_type&, std::vector&, std::vector&, std::vector&) [with t_wt = sdsl::wt_pc<sdsl::huff_shape, sdsl::int_vector<1u>, sdsl::rank_support_v<1u, 1u>, sdsl::select_support_scan<1u>, sdsl::select_support_scan<0u>, sdsl::byte_tree<> >; typename t_wt::size_type = long unsigned int; typename t_wt::value_type = unsigned char; typename t_wt::size_type = long unsigned int]’
/home/zam/dev/git/gramtools/sdsl-lite/lib/construct_lcp.cpp:681:68: required from here
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_algorithm.hpp:175:9: error: no matching function for call to ‘_interval_symbols_rec(const sdsl::wt_pc<sdsl::huff_shape, sdsl::int_vector<1u>, sdsl::rank_support_v<1u, 1u>, sdsl::select_support_scan<1u>, sdsl::select_support_scan<0u>, sdsl::byte_tree<> >&, , sdsl::wt_pc<sdsl::huff_shape, sdsl::int_vector<1u>, sdsl::rank_support_v<1u, 1u>, sdsl::select_support_scan<1u>, sdsl::select_support_scan<0u>, sdsl::byte_tree<> >::size_type&, std::vector&, std::vector&, std::vector&, sdsl::wt_pc<sdsl::huff_shape, sdsl::int_vector<1u>, sdsl::rank_support_v<1u, 1u>, sdsl::select_support_scan<1u>, sdsl::select_support_scan<0u>, sdsl::byte_tree<> >::node_type)’
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_algorithm.hpp:175:9: note: candidate is:
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_algorithm.hpp:129:1: note: template void sdsl::_interval_symbols_rec(const t_wt&, sdsl::range_type, typename t_wt::size_type&, std::vector&, std::vector&, std::vector&, const typename t_wt::node_type&)
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_algorithm.hpp:129:1: note: template argument deduction/substitution failed:
/home/zam/dev/git/gramtools/sdsl-lite/build/lib/../include/sdsl/wt_algorithm.hpp:175:9: note: cannot convert ‘{i, (j + 18446744073709551615ul)}’ (type ‘’) to type ‘sdsl::range_type {aka std::array<long unsigned int, 2ul>}’
make[3]: *** [lib/CMakeFiles/sdsl.dir/construct_lcp.cpp.o] Error 1
make[2]: *** [lib/CMakeFiles/sdsl.dir/all] Error 2
make[1]: *** [lib/CMakeFiles/sdsl.dir/rule] Error 2
make: *** [sdsl] Error 2
ERROR: Build failed.
Initialized empty Git repository in /Net/fs1/home/zam/dev/git/gramtools/sdsl-lite/googletest/.git/
remote: Counting objects: 7145, done.
remote: Total 7145 (delta 0), reused 0 (delta 0), pack-reused 7145
Receiving objects: 100% (7145/7145), 2.61 MiB | 1.37 MiB/s, done.
Resolving deltas: 100% (5307/5307), done.
bash-4.1$

Ranks of rare characters

Remember when Simon Gog was saying this?

'Frequent symbols will have a O(log(log(sigma))) access/rank/select. The worst case is still log(sigma) for rare symbols (although we could plug in the wt_gmr to solve this).
But also note that the log(log(sigma)) solution makes 2 or 3 accesses to WTs of hight log(log(simga)), while the factor is just one for the traditional log(simga) WTs...
Experiments on a IR collection showed that wt_ap outperform traditional WTs.
You can also have a look in the paper of Barbay, Claude, Gagie, and Navarro for experiments with alphabet partitioned WTs or use the WT benchmark of this library ;) '

I know he's talking about the wt_ap here, but I wonder if there are similar differences between frequent and rare symbols in the standard wt_int. I realised we never need ranks of rare symbols (i.e. non ACGT)
because we can get their position in the leftmost column of the BW matrix from the ordering. We could probably leverage this into better performance, but would need to ask Simon Gog.

Store bitmask over BWT to file during build step

The bit-vector mask over the BWT can also be calculated together with the fm-index as part of the --build step and stored to a file in the cache dir. I think sdsl makes it easy to store its data structures.

Store ranks over BWT to file during build step

In the ranks branch, gramtools precalculates the ranks array, but doesn't store it to file, so this step is made redundantly when mapping each sample. Should read in from file if the precalculation was already made and as with the kmers, the rank precalc step should be moved from --quasimap to --build and result stored in the cache dir.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.