iqbal-lab-org / pandora Goto Github PK
View Code? Open in Web Editor NEWPan-genome inference and genotyping with long noisy or short accurate reads
License: MIT License
Pan-genome inference and genotyping with long noisy or short accurate reads
License: MIT License
GATB dumps a .h5
file for each graph it builds. These are not needed later and can clog up the file system so we should remove them at the end of each local assembly.
Boost linker error when compiling ca57d6c.
/usr/local/bin/cmake --build /home/rffrancon/Documents/pandora/cmake-build-debug --target pandora -- -j 2
[ 20%] Built target boost
Scanning dependencies of target pandora
[ 25%] Building CXX object CMakeFiles/pandora.dir/src/check_kmergraph_main.cpp.o
[ 25%] Building CXX object CMakeFiles/pandora.dir/src/compare_main.cpp.o
[ 25%] Building CXX object CMakeFiles/pandora.dir/src/estimate_parameters.cpp.o
[ 28%] Building CXX object CMakeFiles/pandora.dir/src/fastaq_handler.cpp.o
[ 30%] Building CXX object CMakeFiles/pandora.dir/src/get_vcf_ref_main.cpp.o
[ 33%] Building CXX object CMakeFiles/pandora.dir/src/index.cpp.o
[ 35%] Building CXX object CMakeFiles/pandora.dir/src/index_main.cpp.o
/home/rffrancon/Documents/pandora/src/index_main.cpp: In function ‘int pandora_index(int, char**)’:
/home/rffrancon/Documents/pandora/src/index_main.cpp:54:10: warning: variable ‘update’ set but not used [-Wunused-but-set-variable]
bool update = false;
^~~~~~
[ 38%] Building CXX object CMakeFiles/pandora.dir/src/interval.cpp.o
[ 41%] Building CXX object CMakeFiles/pandora.dir/src/inthash.cpp.o
[ 41%] Building CXX object CMakeFiles/pandora.dir/src/kmergraph.cpp.o
[ 43%] Building CXX object CMakeFiles/pandora.dir/src/kmernode.cpp.o
[ 46%] Building CXX object CMakeFiles/pandora.dir/src/localPRG.cpp.o
[ 48%] Building CXX object CMakeFiles/pandora.dir/src/localgraph.cpp.o
[ 51%] Building CXX object CMakeFiles/pandora.dir/src/localnode.cpp.o
[ 53%] Building CXX object CMakeFiles/pandora.dir/src/main.cpp.o
[ 56%] Building CXX object CMakeFiles/pandora.dir/src/map_main.cpp.o
[ 56%] Building CXX object CMakeFiles/pandora.dir/src/minihit.cpp.o
[ 58%] Building CXX object CMakeFiles/pandora.dir/src/minihits.cpp.o
[ 61%] Building CXX object CMakeFiles/pandora.dir/src/minimizer.cpp.o
[ 64%] Building CXX object CMakeFiles/pandora.dir/src/minirecord.cpp.o
[ 66%] Building CXX object CMakeFiles/pandora.dir/src/noise_filtering.cpp.o
[ 69%] Building CXX object CMakeFiles/pandora.dir/src/path.cpp.o
[ 71%] Building CXX object CMakeFiles/pandora.dir/src/seq.cpp.o
[ 71%] Building CXX object CMakeFiles/pandora.dir/src/utils.cpp.o
/home/rffrancon/Documents/pandora/src/utils.cpp: In function ‘void infer_localPRG_order_for_reads(const std::vector<LocalPRG*>&, MinimizerHits*, pangenome::Graph*, int, const uint&, const float&, uint, uint)’:
/home/rffrancon/Documents/pandora/src/utils.cpp:415:69: warning: unused parameter ‘genome_size’ [-Wunused-parameter]
const int max_diff, const uint &genome_size, const float &scale_cluster_size,
^~~~~~~~~~~
[ 74%] Building CXX object CMakeFiles/pandora.dir/src/vcf.cpp.o
[ 76%] Building CXX object CMakeFiles/pandora.dir/src/vcfrecord.cpp.o
[ 79%] Building CXX object CMakeFiles/pandora.dir/src/walk_main.cpp.o
[ 82%] Building CXX object CMakeFiles/pandora.dir/src/de_bruijn/graph.cpp.o
[ 84%] Building CXX object CMakeFiles/pandora.dir/src/de_bruijn/node.cpp.o
[ 87%] Building CXX object CMakeFiles/pandora.dir/src/de_bruijn/ns.cpp.o
[ 87%] Building CXX object CMakeFiles/pandora.dir/src/pangenome/ns.cpp.o
[ 89%] Building CXX object CMakeFiles/pandora.dir/src/pangenome/pangraph.cpp.o
[ 92%] Building CXX object CMakeFiles/pandora.dir/src/pangenome/pannode.cpp.o
/home/rffrancon/Documents/pandora/src/pangenome/pannode.cpp:189:15: warning: ‘std::ostream& pangenome::operator<<(std::ostream&, const pangenome::Node&)’ has not been declared within pangenome
std::ostream &pangenome::operator<<(std::ostream &out, pangenome::Node const &n) {
^~~~~~~~~
In file included from /home/rffrancon/Documents/pandora/src/pangenome/pannode.cpp:5:0:
/home/rffrancon/Documents/pandora/include/pangenome/pannode.h:43:26: note: only here as a friend
friend std::ostream &operator<<(std::ostream &out, const Node &n);
^~~~~~~~
/home/rffrancon/Documents/pandora/src/pangenome/pangraph.cpp:481:15: warning: ‘std::ostream& pangenome::operator<<(std::ostream&, const pangenome::Graph&)’ has not been declared within pangenome
std::ostream &pangenome::operator<<(std::ostream &out, pangenome::Graph const &m) {
^~~~~~~~~
In file included from /home/rffrancon/Documents/pandora/include/utils.h:12:0,
from /home/rffrancon/Documents/pandora/src/pangenome/pangraph.cpp:12:
/home/rffrancon/Documents/pandora/include/pangenome/pangraph.h:77:26: note: only here as a friend
friend std::ostream &operator<<(std::ostream &out, const Graph &m);
^~~~~~~~
[ 94%] Building CXX object CMakeFiles/pandora.dir/src/pangenome/panread.cpp.o
/home/rffrancon/Documents/pandora/src/pangenome/panread.cpp:216:15: warning: ‘std::ostream& pangenome::operator<<(std::ostream&, const pangenome::Read&)’ has not been declared within pangenome
std::ostream &pangenome::operator<<(std::ostream &out, const pangenome::Read &r) {
^~~~~~~~~
In file included from /home/rffrancon/Documents/pandora/src/pangenome/panread.cpp:10:0:
/home/rffrancon/Documents/pandora/include/pangenome/panread.h:41:26: note: only here as a friend
friend std::ostream &operator<<(std::ostream &out, const Read &r);
^~~~~~~~
[ 97%] Building CXX object CMakeFiles/pandora.dir/src/pangenome/pansample.cpp.o
/home/rffrancon/Documents/pandora/src/pangenome/pansample.cpp:37:15: warning: ‘std::ostream& pangenome::operator<<(std::ostream&, const pangenome::Sample&)’ has not been declared within pangenome
std::ostream &pangenome::operator<<(std::ostream &out, pangenome::Sample const &s) {
^~~~~~~~~
In file included from /home/rffrancon/Documents/pandora/src/pangenome/pansample.cpp:7:0:
/home/rffrancon/Documents/pandora/include/pangenome/pansample.h:31:26: note: only here as a friend
friend std::ostream &operator<<(std::ostream &out, const Sample &s);
^~~~~~~~
[100%] Linking CXX executable pandora
/usr/bin/ld: /home/rffrancon/Documents/pandora/cmake-build-debug/lib/libboost_filesystem.a(operations.o): relocation R_X86_64_32 against `.rodata.str1.1' cannot be used when making a shared object; recompile with -fPIC
/usr/bin/ld: /home/rffrancon/Documents/pandora/cmake-build-debug/lib/libboost_filesystem.a(path.o): relocation R_X86_64_32 against `.rodata.str1.8' cannot be used when making a shared object; recompile with -fPIC
/usr/bin/ld: /home/rffrancon/Documents/pandora/cmake-build-debug/lib/libboost_system.a(error_code.o): relocation R_X86_64_32 against `.rodata.str1.1' cannot be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
CMakeFiles/pandora.dir/build.make:979: recipe for target 'pandora' failed
make[3]: *** [pandora] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/pandora.dir/all' failed
make[2]: *** [CMakeFiles/pandora.dir/all] Error 2
CMakeFiles/Makefile2:79: recipe for target 'CMakeFiles/pandora.dir/rule' failed
make[1]: *** [CMakeFiles/pandora.dir/rule] Error 2
Makefile:129: recipe for target 'pandora' failed
make: *** [pandora] Error 2
To convert to an integer, round rather than implicit casting on line
https://github.com/rmcolq/pandora/blob/3df77ce343f2e1483f60a7188d98197209e92b18/src/utils.cpp#L461
Also add an assert which fails at this point if coverage is 0 - too low.
Want to avoid poly-A kmers always achieving the minimizing value - distribute minimizing kmers more evenly across the set of all kmers
In the function get_read_overlap_coordinates
https://github.com/rmcolq/pandora/blob/a4a596cb17c319048f294b286301e2bcbe9ee17a/src/extract_reads.cpp#L157-L200 the output vector often has multiple instances of the exact same coordinates.
As we later loop over this vector of coordinates and do operations on reads based on them (https://github.com/rmcolq/pandora/blob/a4a596cb17c319048f294b286301e2bcbe9ee17a/src/extract_reads.cpp#L239-L301) maybe we should remove duplicates or use a set of some sort?
@rmcolq Can this folder be removed?
https://github.com/rffrancon/pandora/tree/dev/temp
When running pandora map on /nfs/leia/research/iqbal/mbhall/Projects/Pandora_variation/Analysis/Pandora/toy.fastq
the log file shows that non-ACGT bases were found in multiple reads. On inspection of the reads in question, I am unable to find any non-ACGT bases.
Log file and fastq file attached below.
Valgrind output: valgrind_output.txt
Valgrind command used:
valgrind --log-file="valgrind_output" ../../cmake-build-debug/test/pandora_test
with working directory: pandora/test/test_cases
. This command was executed on my laptop. All of the tests passed.
Commit: b914159
Branch: dev
For example:
==15934== Process terminating with default action of signal 6 (SIGABRT)
==15934== at 0x5DF9E97: raise (raise.c:51)
==15934== by 0x5DFB800: abort (abort.c:79)
==15934== by 0x5DEB399: __assert_fail_base (assert.c:92)
==15934== by 0x5DEB411: __assert_fail (assert.c:101)
==15934== by 0x5E29E9: Interval::Interval(unsigned int, unsigned int) (interval.cpp:11)
==15934== by 0x6136FD: Minimizer::Minimizer(unsigned long, unsigned int, unsigned int, bool) (minimizer.cpp:11)
==15934== by 0x71C536: MinimizerTest_create_Test::TestBody() (minimizer_test.cpp:51)
==15934== by 0x800F76: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /home/rffrancon/github/pandora/cmake-build-debug/test/pandora_test)
==15934== by 0x7FAEA2: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /home/rffrancon/github/pandora/cmake-build-debug/test/pandora_test)
==15934== by 0x7D99E9: testing::Test::Run() (in /home/rffrancon/github/pandora/cmake-build-debug/test/pandora_test)
==15934== by 0x7DA346: testing::TestInfo::Run() (in /home/rffrancon/github/pandora/cmake-build-debug/test/pandora_test)
==15934== by 0x7DA9C9: testing::TestCase::Run() (in /home/rffrancon/github/pandora/cmake-build-debug/test/pandora_test)
==15934==
==15934== HEAP SUMMARY:
==15934== in use at exit: 7,154,922 bytes in 24,129 blocks
==15934== total heap usage: 571,866 allocs, 547,737 frees, 125,362,426,822 bytes allocated
==15934==
==15934== LEAK SUMMARY:
==15934== definitely lost: 5,792 bytes in 71 blocks
==15934== indirectly lost: 2,722,572 bytes in 14,408 blocks
==15934== possibly lost: 1,996,141 bytes in 931 blocks
==15934== still reachable: 2,430,417 bytes in 8,719 blocks
==15934== suppressed: 0 bytes in 0 blocks
==15934== Rerun with --leak-check=full to see details of leaked memory
For a test case with lmp (1 [4, 5) G) (2 [8, 9) C)
It doesn't think hit (3, 3, 3, 3{[4, 5)[8, 9)[16, 17)}, 1, 0)
lies on this path
Found a wrong answer in function:
uint32_t pnode_id = 3, prg_id = 3, read_id = 0, knode_id = 0;
string pnode_name = "three";
bool orientation(true);
deque<Interval> d;
prg::Path prg_path;
MinimizerHitPtr mh;
// define localPRG
LocalPRG l3(prg_id, "nested varsite", "A 5 G 7 C 8 T 7 T 9 CCG 10 CGG 9 6 G 5 TAT");
Index *idx;
idx = new Index();
auto w = 1, k = 3;
l3.minimizer_sketch(idx, w, k);
// define localpath and kmerpath
// corresponds to sequence (A) G C T CGG (TAT)
vector<LocalNodePtr> lmp = {//l3.prg.nodes[0],
l3.prg.nodes[1], l3.prg.nodes[2], l3.prg.nodes[4],
l3.prg.nodes[6], l3.prg.nodes[7]//, l3.prg.nodes[9]
};
for (const auto &l : lmp)
cout << *l << endl;
vector<KmerNodePtr> kmp = l3.kmernode_path_from_localnode_path(lmp);
for (const auto &n : kmp)
cout << *n << endl;
Finds an additional kmernode 3 3{[0, 1)[36, 37)[40, 41)} 0, 0
which it shouldn't
Currently, as we build a path I test whether it ends in the end k-mer. But it would make more sense to test whether it ends in any of the end k-mers we have which exist in the graph.
To test which end k-mers exist we can use the hash/method we produce for #59 to mark all end k-mers that exist in the graph upfront (and their reverse).
Robyn suggested this as a cleaner way to have parameters and help messages
get_node
is currently causing up to 6 iterations through the graph per start/end kmer combination.
Currently takes about 10 mins for a single nanopore read, and 1 PRG, everything else takes seconds.
Change code so use sstream and can compare these during test runs without writing to file
Remove goto
statements and replace with a function call.
Add tests (multisample dataset)
Update in line with how map
now works
probably good to do
Function get_vcf_ref sometimes outputs vcfs which are far too short to be real. Also must output some other sort of bad vcf (not yet diagnosed).
Latest fail on gene GC00000258_r1_1
in vcf_ref /nfs/leia/research/iqbal/rmcolq/projects/pangenome_prg/ecoli/290818_all/ecoli_pangenome_PRG_290818.k12_vcf_ref.fa
in directory /nfs/leia/research/iqbal/rmcolq/projects/pandora_variation/analysis/loman_k12_debug/work/5b/be999de3e77a9778ffc7f449005cee/
.
In the very least, asserts need to be added to catch bad vcf_refs.
If at the end of the loop through pangraph nodes there are no nodes due to this being called
https://github.com/rmcolq/pandora/blob/3df77ce343f2e1483f60a7188d98197209e92b18/src/map_main.cpp#L285
Then output a message querying if genome_size is correct. Default was 5,000,000
Please improve title.
Should add a license to this repo
Currently outputs a pangenome matrix file with a dodgy name (missing /?)
Currently outputs a directory for every gene found in any sample - make this one for everything.
Note to self
Re: comments from @rffrancon in #15 all occurrences of uint
should be changed to uint32_t
for consistency.
Check the output from command:
/home/zam/bin/bin/valgrind --leak-check=yes --track-origins=yes -v build/pandora test/test_cases/prg4567.fa test/test_cases/read0.fa wow &> build/pandora_valgrind_output.txt
START: Wed Oct 26 20:23:44 2016
Building Index from PRG file
--39147-- REDIR: 0x3a146bd1d0 (operator new[](unsigned long)) redirected to 0x4a0794a (operator new[](unsigned long))
--39147-- REDIR: 0x3a11e83350 (memchr) redirected to 0x4a09890 (memchr)
--39147-- REDIR: 0x3a146bb2f0 (operator delete(void_)) redirected to 0x4a072ec (operator delete(void_))
--39147-- REDIR: 0x3a11e833d0 (bcmp) redirected to 0x480260a (_vgnU_ifunc_wrapper)
--39147-- REDIR: 0x3a11f3e6d0 (__memcmp_sse4_1) redirected to 0x4a0a8c0 (bcmp)
--39147-- REDIR: 0x3a11e83800 (memmove) redirected to 0x4a0ad40 (memmove)
--39147-- REDIR: 0x3a146bb330 (operator delete) redirected to 0x4a06f68 (operator delete)
Reading Read file
Finished building PanGraph from reads
Writing PanGraph to file
FINISH: Wed Oct 26 20:23:46 2016
==39147==
==39147== HEAP SUMMARY:
==39147== in use at exit: 504 bytes in 16 blocks
==39147== total heap usage: 1,934 allocs, 1,918 frees, 454,491 bytes allocated
==39147==
==39147== Searching for pointers to 16 not-freed blocks
==39147== Checked 207,608 bytes
==39147==
==39147== 504 (48 direct, 456 indirect) bytes in 1 blocks are definitely lost in loss record 9 of 9
==39147== at 0x4A07EB6: operator new(unsigned long) (vg_replace_malloc.c:287)
==39147== by 0x44790C: main (in /data2/users/rachel/projects/pandora/build/pandora)
==39147==
==39147== LEAK SUMMARY:
==39147== definitely lost: 48 bytes in 1 blocks
==39147== indirectly lost: 456 bytes in 15 blocks
==39147== possibly lost: 0 bytes in 0 blocks
==39147== still reachable: 0 bytes in 0 blocks
==39147== suppressed: 0 bytes in 0 blocks
==39147==
==39147== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 4 from 4)
==39147==
==39147== 1 errors in context 1 of 3:
==39147== Conditional jump or move depends on uninitialised value(s)
==39147== at 0x3A11A1754B: index (in /lib64/ld-2.12.so)
==39147== by 0x3A11A06237: expand_dynamic_string_token (in /lib64/ld-2.12.so)
==39147== by 0x3A11A082F1: _dl_map_object (in /lib64/ld-2.12.so)
==39147== by 0x3A11A016DD: map_doit (in /lib64/ld-2.12.so)
==39147== by 0x3A11A0E265: _dl_catch_error (in /lib64/ld-2.12.so)
==39147== by 0x3A11A015F6: do_preload (in /lib64/ld-2.12.so)
==39147== by 0x3A11A0480E: dl_main (in /lib64/ld-2.12.so)
==39147== by 0x3A11A15B4D: _dl_sysdep_start (in /lib64/ld-2.12.so)
==39147== by 0x3A11A014A3: _dl_start (in /lib64/ld-2.12.so)
==39147== by 0x3A11A00B07: ??? (in /lib64/ld-2.12.so)
==39147== by 0x3: ???
==39147== by 0xFFEFFF912: ???
==39147== Uninitialised value was created by a stack allocation
==39147== at 0x3A11A032A1: dl_main (in /lib64/ld-2.12.so)
==39147==
==39147==
==39147== 1 errors in context 2 of 3:
==39147== Conditional jump or move depends on uninitialised value(s)
==39147== at 0x3A11A17546: index (in /lib64/ld-2.12.so)
==39147== by 0x3A11A06237: expand_dynamic_string_token (in /lib64/ld-2.12.so)
==39147== by 0x3A11A082F1: _dl_map_object (in /lib64/ld-2.12.so)
==39147== by 0x3A11A016DD: map_doit (in /lib64/ld-2.12.so)
==39147== by 0x3A11A0E265: _dl_catch_error (in /lib64/ld-2.12.so)
==39147== by 0x3A11A015F6: do_preload (in /lib64/ld-2.12.so)
==39147== by 0x3A11A0480E: dl_main (in /lib64/ld-2.12.so)
==39147== by 0x3A11A15B4D: _dl_sysdep_start (in /lib64/ld-2.12.so)
==39147== by 0x3A11A014A3: _dl_start (in /lib64/ld-2.12.so)
==39147== by 0x3A11A00B07: ??? (in /lib64/ld-2.12.so)
==39147== by 0x3: ???
==39147== by 0xFFEFFF912: ???
==39147== Uninitialised value was created by a stack allocation
==39147== at 0x3A11A032A1: dl_main (in /lib64/ld-2.12.so)
==39147==
--39147--
--39147-- used_suppression: 4 U1004-ARM-_dl_relocate_object
==39147==
==39147== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 4 from 4)
Have been using a uint16_t which is too small
Add assert that read sequence and quality lines are the same length
Is it correct that spec is
@Header
SEQ
+
QUA
Use \n rather than endl and write longer strings less often, based on Martin's advice about the performance hit.
I'm not sure how to properly install GATB so that we can use it as a library. I've opened an issue here: GATB/gatb-core#21
This is the work that we've done so far on this issue: mbhall88@346c4f7
In src/pangenome/pangraph.cpp
I now have a function:
void Graph::save_mapped_read_strings(const string& readfilepath, \\
const string& outdir, const int32_t buff)
Which for each gene, finds the coordinates on the reads for each cluster of hits and creates a file with this segment of read output to it.
This function takes about 2 days (or more) to run on full K12 pangenome and 900X nanopore reads (compared to an hour and a half for the rest of the map code)
In cluster finding/filtering use collinearity to avoid some spurious hits
Much of the time, if I have a set of 30 genomes I am comparing, there's no need to give a VCF with respect to a known reference (we just want to compare our samples).
{yes yes, I know that sometimes one of these 30 is special and we want to compare with it}
One thing we can do to add value to the best-reference-per-gene that Pandora infers, is simply to annotate it. We could just run Prokka
https://github.com/tseemann/prokka
and since we have a file with one read per gene, I think it would be super easy.
make a clear error message
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.