Code Monkey home page Code Monkey logo

pandora's People

Contributors

ffranr avatar iqbal-lab avatar leoisl avatar mbhall88 avatar rmcolq avatar rob-p avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pandora's Issues

Boost linker error

Boost linker error when compiling ca57d6c.

/usr/local/bin/cmake --build /home/rffrancon/Documents/pandora/cmake-build-debug --target pandora -- -j 2
[ 20%] Built target boost
Scanning dependencies of target pandora
[ 25%] Building CXX object CMakeFiles/pandora.dir/src/check_kmergraph_main.cpp.o
[ 25%] Building CXX object CMakeFiles/pandora.dir/src/compare_main.cpp.o
[ 25%] Building CXX object CMakeFiles/pandora.dir/src/estimate_parameters.cpp.o
[ 28%] Building CXX object CMakeFiles/pandora.dir/src/fastaq_handler.cpp.o
[ 30%] Building CXX object CMakeFiles/pandora.dir/src/get_vcf_ref_main.cpp.o
[ 33%] Building CXX object CMakeFiles/pandora.dir/src/index.cpp.o
[ 35%] Building CXX object CMakeFiles/pandora.dir/src/index_main.cpp.o
/home/rffrancon/Documents/pandora/src/index_main.cpp: In function ‘int pandora_index(int, char**)’:
/home/rffrancon/Documents/pandora/src/index_main.cpp:54:10: warning: variable ‘update’ set but not used [-Wunused-but-set-variable]
     bool update = false;
          ^~~~~~
[ 38%] Building CXX object CMakeFiles/pandora.dir/src/interval.cpp.o
[ 41%] Building CXX object CMakeFiles/pandora.dir/src/inthash.cpp.o
[ 41%] Building CXX object CMakeFiles/pandora.dir/src/kmergraph.cpp.o
[ 43%] Building CXX object CMakeFiles/pandora.dir/src/kmernode.cpp.o
[ 46%] Building CXX object CMakeFiles/pandora.dir/src/localPRG.cpp.o
[ 48%] Building CXX object CMakeFiles/pandora.dir/src/localgraph.cpp.o
[ 51%] Building CXX object CMakeFiles/pandora.dir/src/localnode.cpp.o
[ 53%] Building CXX object CMakeFiles/pandora.dir/src/main.cpp.o
[ 56%] Building CXX object CMakeFiles/pandora.dir/src/map_main.cpp.o
[ 56%] Building CXX object CMakeFiles/pandora.dir/src/minihit.cpp.o
[ 58%] Building CXX object CMakeFiles/pandora.dir/src/minihits.cpp.o
[ 61%] Building CXX object CMakeFiles/pandora.dir/src/minimizer.cpp.o
[ 64%] Building CXX object CMakeFiles/pandora.dir/src/minirecord.cpp.o
[ 66%] Building CXX object CMakeFiles/pandora.dir/src/noise_filtering.cpp.o
[ 69%] Building CXX object CMakeFiles/pandora.dir/src/path.cpp.o
[ 71%] Building CXX object CMakeFiles/pandora.dir/src/seq.cpp.o
[ 71%] Building CXX object CMakeFiles/pandora.dir/src/utils.cpp.o
/home/rffrancon/Documents/pandora/src/utils.cpp: In function ‘void infer_localPRG_order_for_reads(const std::vector<LocalPRG*>&, MinimizerHits*, pangenome::Graph*, int, const uint&, const float&, uint, uint)’:
/home/rffrancon/Documents/pandora/src/utils.cpp:415:69: warning: unused parameter ‘genome_size’ [-Wunused-parameter]
                                     const int max_diff, const uint &genome_size, const float &scale_cluster_size,
                                                                     ^~~~~~~~~~~
[ 74%] Building CXX object CMakeFiles/pandora.dir/src/vcf.cpp.o
[ 76%] Building CXX object CMakeFiles/pandora.dir/src/vcfrecord.cpp.o
[ 79%] Building CXX object CMakeFiles/pandora.dir/src/walk_main.cpp.o
[ 82%] Building CXX object CMakeFiles/pandora.dir/src/de_bruijn/graph.cpp.o
[ 84%] Building CXX object CMakeFiles/pandora.dir/src/de_bruijn/node.cpp.o
[ 87%] Building CXX object CMakeFiles/pandora.dir/src/de_bruijn/ns.cpp.o
[ 87%] Building CXX object CMakeFiles/pandora.dir/src/pangenome/ns.cpp.o
[ 89%] Building CXX object CMakeFiles/pandora.dir/src/pangenome/pangraph.cpp.o
[ 92%] Building CXX object CMakeFiles/pandora.dir/src/pangenome/pannode.cpp.o
/home/rffrancon/Documents/pandora/src/pangenome/pannode.cpp:189:15: warning: ‘std::ostream& pangenome::operator<<(std::ostream&, const pangenome::Node&)’ has not been declared within pangenome
 std::ostream &pangenome::operator<<(std::ostream &out, pangenome::Node const &n) {
               ^~~~~~~~~
In file included from /home/rffrancon/Documents/pandora/src/pangenome/pannode.cpp:5:0:
/home/rffrancon/Documents/pandora/include/pangenome/pannode.h:43:26: note: only here as a friend
     friend std::ostream &operator<<(std::ostream &out, const Node &n);
                          ^~~~~~~~
/home/rffrancon/Documents/pandora/src/pangenome/pangraph.cpp:481:15: warning: ‘std::ostream& pangenome::operator<<(std::ostream&, const pangenome::Graph&)’ has not been declared within pangenome
 std::ostream &pangenome::operator<<(std::ostream &out, pangenome::Graph const &m) {
               ^~~~~~~~~
In file included from /home/rffrancon/Documents/pandora/include/utils.h:12:0,
                 from /home/rffrancon/Documents/pandora/src/pangenome/pangraph.cpp:12:
/home/rffrancon/Documents/pandora/include/pangenome/pangraph.h:77:26: note: only here as a friend
     friend std::ostream &operator<<(std::ostream &out, const Graph &m);
                          ^~~~~~~~
[ 94%] Building CXX object CMakeFiles/pandora.dir/src/pangenome/panread.cpp.o
/home/rffrancon/Documents/pandora/src/pangenome/panread.cpp:216:15: warning: ‘std::ostream& pangenome::operator<<(std::ostream&, const pangenome::Read&)’ has not been declared within pangenome
 std::ostream &pangenome::operator<<(std::ostream &out, const pangenome::Read &r) {
               ^~~~~~~~~
In file included from /home/rffrancon/Documents/pandora/src/pangenome/panread.cpp:10:0:
/home/rffrancon/Documents/pandora/include/pangenome/panread.h:41:26: note: only here as a friend
     friend std::ostream &operator<<(std::ostream &out, const Read &r);
                          ^~~~~~~~
[ 97%] Building CXX object CMakeFiles/pandora.dir/src/pangenome/pansample.cpp.o
/home/rffrancon/Documents/pandora/src/pangenome/pansample.cpp:37:15: warning: ‘std::ostream& pangenome::operator<<(std::ostream&, const pangenome::Sample&)’ has not been declared within pangenome
 std::ostream &pangenome::operator<<(std::ostream &out, pangenome::Sample const &s) {
               ^~~~~~~~~
In file included from /home/rffrancon/Documents/pandora/src/pangenome/pansample.cpp:7:0:
/home/rffrancon/Documents/pandora/include/pangenome/pansample.h:31:26: note: only here as a friend
     friend std::ostream &operator<<(std::ostream &out, const Sample &s);
                          ^~~~~~~~
[100%] Linking CXX executable pandora
/usr/bin/ld: /home/rffrancon/Documents/pandora/cmake-build-debug/lib/libboost_filesystem.a(operations.o): relocation R_X86_64_32 against `.rodata.str1.1' cannot be used when making a shared object; recompile with -fPIC
/usr/bin/ld: /home/rffrancon/Documents/pandora/cmake-build-debug/lib/libboost_filesystem.a(path.o): relocation R_X86_64_32 against `.rodata.str1.8' cannot be used when making a shared object; recompile with -fPIC
/usr/bin/ld: /home/rffrancon/Documents/pandora/cmake-build-debug/lib/libboost_system.a(error_code.o): relocation R_X86_64_32 against `.rodata.str1.1' cannot be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
CMakeFiles/pandora.dir/build.make:979: recipe for target 'pandora' failed
make[3]: *** [pandora] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/pandora.dir/all' failed
make[2]: *** [CMakeFiles/pandora.dir/all] Error 2
CMakeFiles/Makefile2:79: recipe for target 'CMakeFiles/pandora.dir/rule' failed
make[1]: *** [CMakeFiles/pandora.dir/rule] Error 2
Makefile:129: recipe for target 'pandora' failed
make: *** [pandora] Error 2

Crazy RAM use

Here is a quick excel graph showing for some values of coverage (x-axis) the time, max memory (RAM in MB) and average memory used by pandora.
image
To be updated with a better graph and extended to 900x (since we have the dataset for it.
Note that max RAM seems to grow linearly with coverage.

Duplicate coordinates

In the function get_read_overlap_coordinates https://github.com/rmcolq/pandora/blob/a4a596cb17c319048f294b286301e2bcbe9ee17a/src/extract_reads.cpp#L157-L200 the output vector often has multiple instances of the exact same coordinates.
As we later loop over this vector of coordinates and do operations on reads based on them (https://github.com/rmcolq/pandora/blob/a4a596cb17c319048f294b286301e2bcbe9ee17a/src/extract_reads.cpp#L239-L301) maybe we should remove duplicates or use a set of some sort?

Incorrect detection of non-ACGT base

When running pandora map on /nfs/leia/research/iqbal/mbhall/Projects/Pandora_variation/Analysis/Pandora/toy.fastq the log file shows that non-ACGT bases were found in multiple reads. On inspection of the reads in question, I am unable to find any non-ACGT bases.

Log file and fastq file attached below.

pandora.map.log
toy.fastq.gz

Memory leaks when running unit tests

Valgrind output: valgrind_output.txt

Valgrind command used:
valgrind --log-file="valgrind_output" ../../cmake-build-debug/test/pandora_test
with working directory: pandora/test/test_cases. This command was executed on my laptop. All of the tests passed.

Commit: b914159
Branch: dev

For example:

==15934== Process terminating with default action of signal 6 (SIGABRT)
==15934==    at 0x5DF9E97: raise (raise.c:51)
==15934==    by 0x5DFB800: abort (abort.c:79)
==15934==    by 0x5DEB399: __assert_fail_base (assert.c:92)
==15934==    by 0x5DEB411: __assert_fail (assert.c:101)
==15934==    by 0x5E29E9: Interval::Interval(unsigned int, unsigned int) (interval.cpp:11)
==15934==    by 0x6136FD: Minimizer::Minimizer(unsigned long, unsigned int, unsigned int, bool) (minimizer.cpp:11)
==15934==    by 0x71C536: MinimizerTest_create_Test::TestBody() (minimizer_test.cpp:51)
==15934==    by 0x800F76: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /home/rffrancon/github/pandora/cmake-build-debug/test/pandora_test)
==15934==    by 0x7FAEA2: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /home/rffrancon/github/pandora/cmake-build-debug/test/pandora_test)
==15934==    by 0x7D99E9: testing::Test::Run() (in /home/rffrancon/github/pandora/cmake-build-debug/test/pandora_test)
==15934==    by 0x7DA346: testing::TestInfo::Run() (in /home/rffrancon/github/pandora/cmake-build-debug/test/pandora_test)
==15934==    by 0x7DA9C9: testing::TestCase::Run() (in /home/rffrancon/github/pandora/cmake-build-debug/test/pandora_test)
==15934== 
==15934== HEAP SUMMARY:
==15934==     in use at exit: 7,154,922 bytes in 24,129 blocks
==15934==   total heap usage: 571,866 allocs, 547,737 frees, 125,362,426,822 bytes allocated
==15934== 
==15934== LEAK SUMMARY:
==15934==    definitely lost: 5,792 bytes in 71 blocks
==15934==    indirectly lost: 2,722,572 bytes in 14,408 blocks
==15934==      possibly lost: 1,996,141 bytes in 931 blocks
==15934==    still reachable: 2,430,417 bytes in 8,719 blocks
==15934==         suppressed: 0 bytes in 0 blocks
==15934== Rerun with --leak-check=full to see details of leaked memory

Fix `hits_along_path`

For a test case with lmp (1 [4, 5) G) (2 [8, 9) C)
It doesn't think hit (3, 3, 3, 3{[4, 5)[8, 9)[16, 17)}, 1, 0) lies on this path

Fix `kmernode_path_from_localnode_path`

Found a wrong answer in function:

    uint32_t pnode_id = 3, prg_id = 3, read_id = 0, knode_id = 0;
    string pnode_name = "three";
    bool orientation(true);
    deque<Interval> d;
    prg::Path prg_path;
    MinimizerHitPtr mh;

    // define localPRG
    LocalPRG l3(prg_id, "nested varsite", "A 5 G 7 C 8 T 7 T 9 CCG 10 CGG 9  6 G 5 TAT");
    Index *idx;
    idx = new Index();
    auto w = 1, k = 3;
    l3.minimizer_sketch(idx, w, k);

    // define localpath and kmerpath
    // corresponds to sequence (A) G C T CGG  (TAT)
    vector<LocalNodePtr> lmp = {//l3.prg.nodes[0],
            l3.prg.nodes[1], l3.prg.nodes[2], l3.prg.nodes[4],
            l3.prg.nodes[6], l3.prg.nodes[7]//, l3.prg.nodes[9]
    };
    for (const auto &l : lmp)
        cout << *l << endl;

    vector<KmerNodePtr> kmp = l3.kmernode_path_from_localnode_path(lmp);
    for (const auto &n : kmp)
        cout << *n << endl;

Finds an additional kmernode 3 3{[0, 1)[36, 37)[40, 41)} 0, 0 which it shouldn't

Change path termination process for local assembly

Currently, as we build a path I test whether it ends in the end k-mer. But it would make more sense to test whether it ends in any of the end k-mers we have which exist in the graph.

To test which end k-mers exist we can use the hash/method we produce for #59 to mark all end k-mers that exist in the graph upfront (and their reverse).

Speed up update_covgs

Currently takes about 10 mins for a single nanopore read, and 1 PRG, everything else takes seconds.

Make sure compare command runs

Add tests (multisample dataset)
Update in line with how map now works

  • include genotyping
  • updates which fix the localPRG as something that doesn't change and creates a copy of kmergraphs within pangraph structure where counts/coverages are added
  • see how new vcf structure works with adding samples/reads in this case

Fix anything with vcf_ref

Function get_vcf_ref sometimes outputs vcfs which are far too short to be real. Also must output some other sort of bad vcf (not yet diagnosed).
Latest fail on gene GC00000258_r1_1 in vcf_ref /nfs/leia/research/iqbal/rmcolq/projects/pangenome_prg/ecoli/290818_all/ecoli_pangenome_PRG_290818.k12_vcf_ref.fa in directory /nfs/leia/research/iqbal/rmcolq/projects/pandora_variation/analysis/loman_k12_debug/work/5b/be999de3e77a9778ffc7f449005cee/.

In the very least, asserts need to be added to catch bad vcf_refs.

Memory leaks

Check the output from command:
/home/zam/bin/bin/valgrind --leak-check=yes --track-origins=yes -v build/pandora test/test_cases/prg4567.fa test/test_cases/read0.fa wow &> build/pandora_valgrind_output.txt

START: Wed Oct 26 20:23:44 2016

Building Index from PRG file
--39147-- REDIR: 0x3a146bd1d0 (operator new[](unsigned long)) redirected to 0x4a0794a (operator new[](unsigned long))
--39147-- REDIR: 0x3a11e83350 (memchr) redirected to 0x4a09890 (memchr)
--39147-- REDIR: 0x3a146bb2f0 (operator delete(void_)) redirected to 0x4a072ec (operator delete(void_))
--39147-- REDIR: 0x3a11e833d0 (bcmp) redirected to 0x480260a (_vgnU_ifunc_wrapper)
--39147-- REDIR: 0x3a11f3e6d0 (__memcmp_sse4_1) redirected to 0x4a0a8c0 (bcmp)
--39147-- REDIR: 0x3a11e83800 (memmove) redirected to 0x4a0ad40 (memmove)
--39147-- REDIR: 0x3a146bb330 (operator delete) redirected to 0x4a06f68 (operator delete)
Reading Read file
Finished building PanGraph from reads
Writing PanGraph to file
FINISH: Wed Oct 26 20:23:46 2016

==39147==
==39147== HEAP SUMMARY:
==39147== in use at exit: 504 bytes in 16 blocks
==39147== total heap usage: 1,934 allocs, 1,918 frees, 454,491 bytes allocated
==39147==
==39147== Searching for pointers to 16 not-freed blocks
==39147== Checked 207,608 bytes
==39147==
==39147== 504 (48 direct, 456 indirect) bytes in 1 blocks are definitely lost in loss record 9 of 9
==39147== at 0x4A07EB6: operator new(unsigned long) (vg_replace_malloc.c:287)
==39147== by 0x44790C: main (in /data2/users/rachel/projects/pandora/build/pandora)
==39147==
==39147== LEAK SUMMARY:
==39147== definitely lost: 48 bytes in 1 blocks
==39147== indirectly lost: 456 bytes in 15 blocks
==39147== possibly lost: 0 bytes in 0 blocks
==39147== still reachable: 0 bytes in 0 blocks
==39147== suppressed: 0 bytes in 0 blocks
==39147==
==39147== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 4 from 4)
==39147==
==39147== 1 errors in context 1 of 3:
==39147== Conditional jump or move depends on uninitialised value(s)
==39147== at 0x3A11A1754B: index (in /lib64/ld-2.12.so)
==39147== by 0x3A11A06237: expand_dynamic_string_token (in /lib64/ld-2.12.so)
==39147== by 0x3A11A082F1: _dl_map_object (in /lib64/ld-2.12.so)
==39147== by 0x3A11A016DD: map_doit (in /lib64/ld-2.12.so)
==39147== by 0x3A11A0E265: _dl_catch_error (in /lib64/ld-2.12.so)
==39147== by 0x3A11A015F6: do_preload (in /lib64/ld-2.12.so)
==39147== by 0x3A11A0480E: dl_main (in /lib64/ld-2.12.so)
==39147== by 0x3A11A15B4D: _dl_sysdep_start (in /lib64/ld-2.12.so)
==39147== by 0x3A11A014A3: _dl_start (in /lib64/ld-2.12.so)
==39147== by 0x3A11A00B07: ??? (in /lib64/ld-2.12.so)
==39147== by 0x3: ???
==39147== by 0xFFEFFF912: ???
==39147== Uninitialised value was created by a stack allocation
==39147== at 0x3A11A032A1: dl_main (in /lib64/ld-2.12.so)
==39147==
==39147==
==39147== 1 errors in context 2 of 3:
==39147== Conditional jump or move depends on uninitialised value(s)
==39147== at 0x3A11A17546: index (in /lib64/ld-2.12.so)
==39147== by 0x3A11A06237: expand_dynamic_string_token (in /lib64/ld-2.12.so)
==39147== by 0x3A11A082F1: _dl_map_object (in /lib64/ld-2.12.so)
==39147== by 0x3A11A016DD: map_doit (in /lib64/ld-2.12.so)
==39147== by 0x3A11A0E265: _dl_catch_error (in /lib64/ld-2.12.so)
==39147== by 0x3A11A015F6: do_preload (in /lib64/ld-2.12.so)
==39147== by 0x3A11A0480E: dl_main (in /lib64/ld-2.12.so)
==39147== by 0x3A11A15B4D: _dl_sysdep_start (in /lib64/ld-2.12.so)
==39147== by 0x3A11A014A3: _dl_start (in /lib64/ld-2.12.so)
==39147== by 0x3A11A00B07: ??? (in /lib64/ld-2.12.so)
==39147== by 0x3: ???
==39147== by 0xFFEFFF912: ???
==39147== Uninitialised value was created by a stack allocation
==39147== at 0x3A11A032A1: dl_main (in /lib64/ld-2.12.so)
==39147==
--39147--
--39147-- used_suppression: 4 U1004-ARM-_dl_relocate_object
==39147==
==39147== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 4 from 4)

Optimise writes

Use \n rather than endl and write longer strings less often, based on Martin's advice about the performance hit.

Save mapped read strings and coordinates takes forever

In src/pangenome/pangraph.cpp I now have a function:

void Graph::save_mapped_read_strings(const string& readfilepath, \\
const string& outdir, const int32_t buff)

Which for each gene, finds the coordinates on the reads for each cluster of hits and creates a file with this segment of read output to it.
This function takes about 2 days (or more) to run on full K12 pangenome and 900X nanopore reads (compared to an hour and a half for the rest of the map code)

Annotation of mosaics

Much of the time, if I have a set of 30 genomes I am comparing, there's no need to give a VCF with respect to a known reference (we just want to compare our samples).

{yes yes, I know that sometimes one of these 30 is special and we want to compare with it}

One thing we can do to add value to the best-reference-per-gene that Pandora infers, is simply to annotate it. We could just run Prokka

https://github.com/tseemann/prokka

and since we have a file with one read per gene, I think it would be super easy.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.