mcveanlab / mccortex Goto Github PK
View Code? Open in Web Editor NEWDe novo genome assembly and multisample variant calling
Home Page: https://github.com/mcveanlab/mccortex/wiki
License: MIT License
De novo genome assembly and multisample variant calling
Home Page: https://github.com/mcveanlab/mccortex/wiki
License: MIT License
https://github.com/mcveanlab/mccortex/wiki/Graph-Construction
--fq_threshold 5
=> --fq-cutoff 5
?
Interleaved files may not be sorted with first mate then second mate. If reads get swapped this will cause issues when creating links and removing PCR duplicates.
What is the -G
option of contigs
command used for? Does it impact assembly quality?
I'm impressed by how cleanly mccortex
installed and runs!
I got these warnings in thread
[16 Apr 2016 12:18:33-LOx][generate_paths.c:422] Warn: Reads may overlap in fragment: 151 + 151 > frag len min: 0; max: 1000
We get lots of overlapping PE reads from NextSeq and MiSeq due to suboptimal Nextera XT library prep.
Should I be concerned?
Will this affect the results?
Any value for -C <float>
in contigs
causes assert error:
% mccortex63 contigs -C 0.5 -f -o foo1.fa -p infer.ctp.gz infer.ctx
[25 Apr 2016 10:59:35-XuR][cmd] mccortex63 contigs -C 0.5 -f -o foo1.fa -p infer.ctp.gz infer.ctx
[25 Apr 2016 10:59:35-XuR][cwd] /mnt/seq/JOBS/MDU/J2016-06490/nullarbor.ST80/2014-23625
[25 Apr 2016 10:59:35-XuR][version] mccortex=86b4ffe zlib=1.2.7 htslib=1.3-37-gfc93dfc ASSERTS=ON hash=Lookup3 CHECKS=ON k=33..63
[25 Apr 2016 10:59:35-XuR] Taking number of kmers as genome size: 2,918,130
[25 Apr 2016 10:59:35-XuR][memory] 202 bits per kmer
[25 Apr 2016 10:59:35-XuR][memory] graph: 94.9MB
[25 Apr 2016 10:59:35-XuR][memory] paths: 30.1MB
[25 Apr 2016 10:59:35-XuR][memory] total: 95.1MB of 377.6GB RAM
[25 Apr 2016 10:59:35-XuR][hasht] Allocating table with 3,932,160 entries, using 60.2MB
[25 Apr 2016 10:59:35-XuR][hasht] number of buckets: 131,072, bucket size: 30
[25 Apr 2016 10:59:35-XuR][graph] kmer-size: 63; colours: 1; capacity: 3,932,160
[25 Apr 2016 10:59:35-XuR][GPathReader] need 7216 paths 16810 bytes
[25 Apr 2016 10:59:35-XuR][GPathSet] Allocating for 7,216 paths, 7KB colset, 9.4KB seq => 143.3KB total
[25 Apr 2016 10:59:35-XuR][FileFilter] Loading file infer.ctx [1 colour]
[25 Apr 2016 10:59:35-XuR][GReader] 2,918,130 kmers, 58.4MB filesize
[25 Apr 2016 10:59:35-XuR][GReader] Loaded 2,918,130 / 2,918,130 (100.00%) of kmers parsed
[25 Apr 2016 10:59:35-XuR][hasht] buckets: 131,072 [2^17]; bucket size: 30; memory: 60.2MB; occupancy: 2,918,130 / 3,932,160 (74.21%)
[25 Apr 2016 10:59:35-XuR][hasht] collisions 0: 2901076
[25 Apr 2016 10:59:35-XuR][hasht] collisions 1: 16416
[25 Apr 2016 10:59:35-XuR][hasht] collisions 2: 609
[25 Apr 2016 10:59:35-XuR][hasht] collisions 3: 25
[25 Apr 2016 10:59:35-XuR][hasht] collisions 4: 3
[25 Apr 2016 10:59:35-XuR][hasht] collisions 5: 1
[25 Apr 2016 10:59:35-XuR][FileFilter] Loading file infer.ctp.gz [1 colour]
[25 Apr 2016 10:59:35-XuR][GPathSet] Allocating for 32,768 paths, 32KB colset, 384KB seq => 1MB total
[25 Apr 2016 10:59:35-XuR] Loaded 7,216 paths from 1,688 kmers
[25 Apr 2016 10:59:35-XuR][Assemble] Assembling contigs with 2 threads, walking colour 0
[25 Apr 2016 10:59:35-XuR][Assemble] Using missing info check: yes
[25 Apr 2016 10:59:35-XuR][Assemble] Stop traversal if step cummulative confidence < 0.500000
[25 Apr 2016 10:59:35-XuR][Assemble] Writing contigs to foo1.fa
[25 Apr 2016 10:59:35-XuR][Assemble] Seeding with random kmers...
[src/tools/assemble_stats.c:43] Assert Failed assem2str(): strlen(assem_stop_str[assem]) < size
[25 Apr 2016 10:59:35-XuR] Assert Error
Aborted (core dumped)
Hello
I tried to follow the "Workflow Examples" in the wiki to do variant calling with two plant genotypes based on paired end reads. Here is the sequence of commands I used:
ctx63 build -m 150G -t 10 -k 61 --sample genotype1 --seq2 genotype1_r1.fastq:genotype1_r2.fastq genotype1.ctx
ctx63 build -m 150G -t 10 -k 61 --sample genotype2 --seq2 genotype2_r1.fastq:genotype2_r2.fastq genotype2.ctx
ctx63 build -m 150G -t 10 -k 61 --sample reference --seq reference.fa reference.ctx
ctx63 clean -m 150G -t 10 --out genotype1.clean.ctx genotype1.ctx
ctx63 clean -m 150G -t 10 --out genotype2.clean.ctx genotype2.ctx
ctx63 join -m 150G --out refAndSamples.clean.ctx reference.ctx genotype1.clean.ctx genotype2.clean.ctx
ctx63 inferedges -m 150G -t 10 refAndSamples.clean.ctx
ctx63 thread -m 150G -t 10 --seq2 genotype1_r1.fastq:genotype1_r2.fastq --out genotype1.ctp.gz refAndSamples.clean.ctx:1
ctx63 thread -m 150G -t 10 --seq2 genotype2_r1.fastq:genotype2_r2.fastq --out genotype2.ctp.gz refAndSamples.clean.ctx:2
Unfortunately, both read threading steps failed with this error:
[src/alignment/correct_aln_stats.c:137] Assert Failed correct_aln_stats_print_summary(): num_reads > 0
[15 Aug 2014 16:22:40-keZ] Assert Error
I tried to "fix" this by adding this code above line 135 in alignment/correct_aln_stats.c:
if(num_reads == 0)
{
status("[CorrectAln] Didn't see any SE reads");
}
After recompiling the threading runs smooth. However, now I am stuck at the contigs command
ctx63 contigs -m 150G -t 10 -p 1:genotype1.ctp.gz -p 2:genotype2.ctp.gz refAndSamples.clean.ctx
which yields:
[src/graph_paths/gpath_checks.c:143] Error gpath_load_sample_pop(): No point loading paths for colours other than --colour 0
[18 Aug 2014 09:42:17-gOC] Fatal Error
Any clues what went wrong?
e.g.
ATGCGTTATATTCGCCTGTGT
{
"key": "ACACAGGCGAATATAACGCAT", "colours": [1],
"left": "AT", "right": "A",
"edges": "a3",
"links": []
}
I'm not sure if this is intended behaviour though?
Running mccortex clean
without any option prints help output, which says If neither -T or -U specified, just saves output statistics.
However, the default behaviour seems to be to automatically apply -T
and -U
options with computed values.
It would be good to integrate VCF annotation/filtering with the variant calling pipeline. Would be good to flag/remove homopolymer variants etc.
Next version of the graph format should use a json header.
I run into the following error on a very small sample dataset:
[05 Sep 2016 02:26:45-HUD][version] mccortex=v0.0.3-503-gf025dbf zlib=1.2.7 htslib=1.3.1-61-ge87ae87 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31
[05 Sep 2016 02:26:45-HUD][memory] 179 bits per kmer
[05 Sep 2016 02:26:45-HUD][memory] graph: 3.6GB
[05 Sep 2016 02:26:45-HUD][memory](of which threads: 20 x 42991616 = 820MB)
[05 Sep 2016 02:26:45-HUD][memory] paths: 1.4GB
[05 Sep 2016 02:26:45-HUD][memory] total: 3.7GB of 377.6GB RAM
[05 Sep 2016 02:26:45-HUD][hasht] Allocating table with 171,966,464 entries, using 1.3GB
[05 Sep 2016 02:26:45-HUD][hasht] number of buckets: 4,194,304, bucket size: 41
[05 Sep 2016 02:26:45-HUD][graph] kmer-size: 31; colours: 3; capacity: 171,966,464
[05 Sep 2016 02:26:45-HUD][GPathReader] need 3589273 paths 21033608 bytes
[05 Sep 2016 02:26:45-HUD][GPathSet] Allocating for 3,589,273 paths, 3.4MB colset, 17.5MB seq => 82.5MB total
[05 Sep 2016 02:26:45-HUD][FileFilter] Reading file ./inb1_pl/k31/graphs/MA-ES-3.clean.ctx [1 src colour]
[05 Sep 2016 02:26:45-HUD][GReader] 121,137,470 kmers, 1.5GB filesize
[05 Sep 2016 02:27:15-HUD][GReader] Loaded 121,137,470 / 121,137,470 (100.00%) of kmers parsed
[05 Sep 2016 02:27:15-HUD][FileFilter] Reading file ./inb1_pl/k31/graphs/MN-DM-1.clean.ctx [1 src colour] with filter: 0->1
[05 Sep 2016 02:27:15-HUD][GReader] 127,420,164 kmers, 1.5GB filesize
[05 Sep 2016 02:27:40-HUD][hasht] buckets: 4,194,304 [2^22]; bucket size: 41; memory: 1.3GB; occupancy: 161,890,833 / 171,966,464 (94.14%)
[05 Sep 2016 02:27:40-HUD][hasht] collisions 0: 154224356
[05 Sep 2016 02:27:40-HUD][hasht] collisions 1: 5397571
[05 Sep 2016 02:27:40-HUD][hasht] collisions 2: 1426637
[05 Sep 2016 02:27:40-HUD][hasht] collisions 3: 495114
[05 Sep 2016 02:27:40-HUD][hasht] collisions 4: 195947
[05 Sep 2016 02:27:40-HUD][hasht] collisions 5: 82824
[05 Sep 2016 02:27:40-HUD][hasht] collisions 6: 36706
[05 Sep 2016 02:27:40-HUD][hasht] collisions 7: 16913
[05 Sep 2016 02:27:40-HUD][hasht] collisions 8: 7662
[05 Sep 2016 02:27:40-HUD][hasht] collisions 9: 3639
[05 Sep 2016 02:27:40-HUD][hasht] collisions 10: 1772
[05 Sep 2016 02:27:40-HUD][hasht] collisions 11: 834
[05 Sep 2016 02:27:40-HUD][hasht] collisions 12: 426
[05 Sep 2016 02:27:40-HUD][hasht] collisions 13: 219
[05 Sep 2016 02:27:40-HUD][hasht] collisions 14: 113
[05 Sep 2016 02:27:40-HUD][hasht] collisions 15: 52
[05 Sep 2016 02:27:40-HUD][hasht] collisions 16: 24
[05 Sep 2016 02:27:40-HUD][hasht] collisions 17: 12
[05 Sep 2016 02:27:40-HUD][hasht] collisions 18: 6
[05 Sep 2016 02:27:40-HUD][hasht] collisions 19: 6
[05 Sep 2016 02:27:40-HUD][hash_table.c:293] Fatal Error: Hash table is full
The error arises during the mccortex bubbles step of the pipeline.
When threading reads, a large amount of time is spent writing the link files (.ctp.gz) to disk. Currently the compression is single threaded. If many threads simultaneously compressed blocks in parallel, a single lock could be used to write one compressed block to disk at a time.
Not sure if we should use bgzf block compression in htslib or roll our own block headers.
E.g.
TGCCGGCGCGGGTGGCCGCCG
{
"key": "CGGCGGCCACCCGCGCCGGCA", "colours": [1],
", "right": "C",
"edges": "02",
"links": []
}
Looks like it's missing the "left" entry partially.
Add new command query
to query sorted and indexed graphs on disk:
mccortex31 query [-k <kmer>|-1 <kmers.fa>|-o out.ctx] <graph.ctx> <graph.ctx.idx>
Output is either in the same format as view
command or in graph format, written to disk.
_ _ _
_/a\__/b\__/c\_
\_/ \_/ \_/
A B C
We currently report adjacent bubbles as one big bubble. For the given example we report abc|ABC
, bc|BC
, c|C
, and the reverse calls.
Instead we should only report stand alone bubbles: a|A
, b|B
, c|C
.
Any plans to output your actual contigs in GFA?
Inferring edges is only required with multiple samples AND links. If not required, we can miss it out and improve performance. This is risky, so not a suggested setting.
Hello,
as discussed briefly with Zam, it would be pretty neat if cortex could handle two layers of colors:
I am using mccortex compiled from git (SHA: 22e27b7) using make all
.
Configure graph walking with a common argument across commands:
-a,--assembly [missInfoCheck=T|F,minCumulConf=,minStepConf=]
e.g.
mccortex31 contigs --assembly missInfoCheck=F,minStepConf=0.99 graph.ctx > contigs.fa
Hi @noporpoise,
I'm running mccortex building and cleaning on a bunch of samples from the ENA and a small percentage are failing on the cleaning step without much information as to why.
Here's a dump of the log. Building works fine but then cleaning dies with Assert Failed db_graph_alloc(): num_of_cols > 0
Any ideas?
[29 Dec 2016 11:12:04-QaP][cmd] /users/iqbal/phelimb/apps/mccortex/bin/mccortex31 build -t 1 -m 7G -k 31 -s ERR233401 --seq /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_1.fastq.gz --seq /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_2.fastq.gz /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/uncleaned/ERR233401.ctx
[29 Dec 2016 11:12:04-QaP][cwd] /gpfs2/well/iqbal/people/phelim
[29 Dec 2016 11:12:04-QaP][version] mccortex=v0.0.3-539-g22e27b7 zlib=1.2.8 htslib=1.3.2-134-g1bc5c56 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31
[29 Dec 2016 11:12:04-QaP] Saving graph to: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/uncleaned/ERR233401.ctx
[29 Dec 2016 11:12:04-QaP][sample] 0: ERR233401
[29 Dec 2016 11:12:04-QaP][task] /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_1.fastq.gz; FASTQ offset: auto-detect, threshold: off; cut homopolymers: off; remove PCR duplicates: no; colour: 0
[29 Dec 2016 11:12:04-QaP][task] /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_2.fastq.gz; FASTQ offset: auto-detect, threshold: off; cut homopolymers: off; remove PCR duplicates: no; colour: 0
[29 Dec 2016 11:12:04-QaP][memory] 104 bits per kmer
[29 Dec 2016 11:12:04-QaP][memory] graph: 6.9GB
[29 Dec 2016 11:12:04-QaP][memory] total: 6.9GB of 94.6GB RAM
[29 Dec 2016 11:12:04-QaP] Writing 1 colour graph to /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/uncleaned/ERR233401.ctx
[29 Dec 2016 11:12:04-QaP][hasht] Allocating table with 570,425,344 entries, using 4.3GB
[29 Dec 2016 11:12:04-QaP][hasht] number of buckets: 16,777,216, bucket size: 34
[29 Dec 2016 11:12:04-QaP][graph] kmer-size: 31; colours: 1; capacity: 570,425,344
[29 Dec 2016 11:12:04-QaP][hasht] buckets: 16,777,216 [2^24]; bucket size: 34;
[29 Dec 2016 11:12:04-QaP][hasht] memory: 4.3GB; filled: 0 / 570,425,344 (0.00%)
[29 Dec 2016 11:12:04-QaP][asyncio] Inputs: 2; Threads: 1
[29 Dec 2016 11:12:04-QaP][seq] Parsing sequence file /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_2.fastq.gz
[29 Dec 2016 11:12:04-QaP][seq] Parsing sequence file /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_1.fastq.gz
[29 Dec 2016 11:12:04-QaP] /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_2.fastq.gz: Qual scores: Sanger (Phred+33) [offset: 33, range: [33,73], sample: [33,69]]
[29 Dec 2016 11:12:04-QaP] /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_1.fastq.gz: Qual scores: Sanger (Phred+33) [offset: 33, range: [33,73], sample: [33,73]]
[29 Dec 2016 11:17:05-QaP][BuildGraph] Read 5,000,000 entries (reads / read pairs)
[29 Dec 2016 11:20:57-QaP][BuildGraph] Read 10,000,000 entries (reads / read pairs)
[29 Dec 2016 11:21:02-QaP][seq] Loaded 5,082,194 reads and 0 reads pairs (file: /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_1.fastq.gz)
[29 Dec 2016 11:21:04-QaP][seq] Loaded 5,082,194 reads and 0 reads pairs (file: /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_2.fastq.gz)
[29 Dec 2016 11:21:04-QaP][hasht] buckets: 16,777,216 [2^24]; bucket size: 34;
[29 Dec 2016 11:21:04-QaP][hasht] memory: 4.3GB; filled: 488,170,572 / 570,425,344 (85.58%)
[29 Dec 2016 11:21:04-QaP][hasht] collisions 0: 477616581
[29 Dec 2016 11:21:04-QaP][hasht] collisions 1: 9132135
[29 Dec 2016 11:21:04-QaP][hasht] collisions 2: 1181354
[29 Dec 2016 11:21:04-QaP][hasht] collisions 3: 195799
[29 Dec 2016 11:21:04-QaP][hasht] collisions 4: 35793
[29 Dec 2016 11:21:04-QaP][hasht] collisions 5: 7104
[29 Dec 2016 11:21:04-QaP][hasht] collisions 6: 1436
[29 Dec 2016 11:21:04-QaP][hasht] collisions 7: 297
[29 Dec 2016 11:21:04-QaP][hasht] collisions 8: 60
[29 Dec 2016 11:21:04-QaP][hasht] collisions 9: 11
[29 Dec 2016 11:21:04-QaP][hasht] collisions 11: 2
[29 Dec 2016 11:21:04-QaP][task] input: /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_1.fastq.gz colour: 0
[29 Dec 2016 11:21:04-QaP] SE reads: 10,164,388 PE reads: 0
[29 Dec 2016 11:21:04-QaP] good reads: 10,163,375 bad reads: 1,013
[29 Dec 2016 11:21:04-QaP] dup SE reads: 0 dup PE pairs: 0
[29 Dec 2016 11:21:04-QaP] bases read: 1,524,658,200 bases loaded: 1,520,553,440
[29 Dec 2016 11:21:04-QaP] num contigs: 10,186,918 num kmers: 1,214,945,900 novel kmers: 488,170,572
[29 Dec 2016 11:21:04-QaP][task] input: /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_2.fastq.gz colour: 0
[29 Dec 2016 11:21:04-QaP] SE reads: 0 PE reads: 0
[29 Dec 2016 11:21:04-QaP] good reads: 0 bad reads: 0
[29 Dec 2016 11:21:04-QaP] dup SE reads: 0 dup PE pairs: 0
[29 Dec 2016 11:21:04-QaP] bases read: 0 bases loaded: 0
[29 Dec 2016 11:21:04-QaP] num contigs: 0 num kmers: 0 novel kmers: 0
[29 Dec 2016 11:21:04-QaP] Dumping graph...
[29 Dec 2016 11:21:04-QaP][graphwriter] Saving file to: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/uncleaned/ERR233401.ctx
[29 Dec 2016 11:21:04-QaP][FileFilter] Writing graph [1 src colour]
[29 Dec 2016 11:21:47-QaP][graphwriter] Dumped 488,170,572 kmers in 1 colour into: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/uncleaned/ERR233401.ctx (ver: 6)
[29 Dec 2016 11:21:47-QaP][memory] We made 19 allocs
[29 Dec 2016 11:21:47-QaP] Done.
[29 Dec 2016 11:21:47-QaP][time] 583.00 seconds (9 mins 43 secs)
[29 Dec 2016 11:21:47-mIT][cmd] /users/iqbal/phelimb/apps/mccortex/bin/mccortex31 clean -m 6GB -B 2 -U -T -o /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/ERR233401.ctx -c /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_before.csv -C /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_after.csv -l /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_lbefore.csv -L /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_lafter.csv /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/uncleaned/ERR233401.ctx
[29 Dec 2016 11:21:47-mIT][cwd] /gpfs2/well/iqbal/people/phelim
[29 Dec 2016 11:21:47-mIT][version] mccortex=v0.0.3-539-g22e27b7 zlib=1.2.8 htslib=1.3.2-134-g1bc5c56 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31
[29 Dec 2016 11:21:47-mIT] Actions:
[29 Dec 2016 11:21:47-mIT] 0. Saving kmer coverage distribution to: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_before.csv
[29 Dec 2016 11:21:47-mIT] 1. Saving unitig length distribution to: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_lbefore.csv
[29 Dec 2016 11:21:47-mIT] 2. Cleaning tips shorter than 62 nodes
[29 Dec 2016 11:21:47-mIT] 3. Cleaning unitigs with auto-detected threshold
[29 Dec 2016 11:21:47-mIT] 4. Saving kmer coverage distribution to: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_after.csv
[29 Dec 2016 11:21:47-mIT] 5. Saving unitig length distribution to: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_lafter.csv
[29 Dec 2016 11:21:47-mIT][memory] 72 bits per kmer
[29 Dec 2016 11:21:47-mIT][memory] graph: 5.5GB
[29 Dec 2016 11:21:47-mIT][cleaning] 1 input graph, max kmers: 488,170,572, using 0 colours
[29 Dec 2016 11:21:47-mIT][memory] total: 5.5GB of 94.6GB RAM
[src/graph/db_graph.c:39] Assert Failed db_graph_alloc(): num_of_cols > 0
[29 Dec 2016 11:21:47-mIT] Assert Error
I am trying to clean a joined graph containing three colors as follows:
ctx63 clean ${GRAPH_DIR} -o /refAndSamples.basalAndropogonae.clean.ctx ${GRAPH_DIR}/refAndSamples.basalAndropogonae.ctx
I get the following output:
[05 Jun 2014 23:46:12-Wij][cmd] /programs/mccortex/bin/ctx63 clean /workdir/er432/andropogonae/mccortex_out -o /refAndSamples.basalAndropogonae.clean.ctx /workdir/er432/andropogonae/mccortex_out/refAndSamples.basalAndropogonae.ctx
[05 Jun 2014 23:46:12-Wij][cwd] /local/workdir/er432/andropogonae/mccortex_out
[05 Jun 2014 23:46:12-Wij][version] ctx=470b4ca zlib=1.2.3 htslib=0.2.0-rc8-6-gd49dfa6 ASSERTS=ON CHECKS=ON k=33..63
[src/kmer/graph_reader.c:101] Error graph_file_read_header(): Couldn't read 'Magic word': expected 6; recieved: 0; [file: /workdir/er432/andropogonae/mccortex_out]
[05 Jun 2014 23:46:12-Wij] Fatal Error
I try running the following:
$MCCORTEX contigs -m 490G -n 12G --colour 1 -p 0:Coelorachis.clean.ctp -p 1:Vossia.k63.clean.ctp refAndSamples.basalAndropogonae.inferredEdges.clean.ctx > Vossia.clean.k63.fa
And I get this:
[16 Jun 2014 13:01:26-cEm][cmd] /programs/mccortex_5_30_14/bin/ctx63 contigs -m 490G -n 12G --colour 1 -p 0:Coelorachis.clean.ctp -p 1:Vossia.k63.clean.ctp refAndSamples.basalAndropogonae.inferredEdges.clean.ctx
[16 Jun 2014 13:01:26-cEm][cwd] /local/workdir/er432/andropogonae/mccortex_out
[16 Jun 2014 13:01:26-cEm][version] ctx=v0.0.3 zlib=1.2.3 htslib=0.2.0-rc8-6-gd49dfa6-dirty ASSERTS=ON CHECKS=ON k=33..63
[16 Jun 2014 13:01:26-cEm][memory] graph: 305GB
[16 Jun 2014 13:01:26-cEm][memory] paths: 49.6GB
[16 Jun 2014 13:01:26-cEm][memory] total: 354.6GB of 504.8GB RAM
[16 Jun 2014 13:01:26-cEm][hashtable] Allocating table with 12,884,901,888 entries, using 192.5GB
[16 Jun 2014 13:01:26-cEm][hashtable] number of buckets: 268,435,456, bucket size: 48
[16 Jun 2014 13:02:50-cEm][graph] kmer-size: 63; colours: 3; capacity: 12,884,901,888
[16 Jun 2014 13:04:27-cEm][paths] Setting up path store to use 49.6GB main
[16 Jun 2014 13:04:27-cEm] Loading file refAndSamples.basalAndropogonae.inferredEdges.clean.ctx [3 colours] into colours 0-2
[16 Jun 2014 13:04:27-cEm] 2,223,283,362 kmers, 64.2GB filesize
[16 Jun 2014 13:04:27-cEm][CtxLoad] First col 0, into cols 0..2, file has 3 cols: refAndSamples.basalAndropogonae.inferredEdges.clean.ctx
[16 Jun 2014 13:14:42-cEm] Loaded 2,223,283,362 / 2,223,283,362 (100.00%) of kmers parsed
[16 Jun 2014 13:14:42-cEm][hash] buckets: 268,435,456 [2^28]; bucket size: 48; memory: 192.5GB; occupancy: 2,223,283,362 / 12,884,901,888 (17.25%)
[16 Jun 2014 13:14:42-cEm] collisions 0: 2223283362
[16 Jun 2014 13:14:42-cEm][PathFormat] With 2 files, require 11859397612 tmp memory [0 extra bytes]
[16 Jun 2014 13:14:42-cEm] Loading file Coelorachis.clean.ctp [1 colour] into colour 0
[16 Jun 2014 13:14:42-cEm] 2,039,725,230 paths, 38.6GB path-bytes, 27,492,743 kmers, 39.2GB filesize
[16 Jun 2014 13:16:45-cEm][paths] Setup tmp path memory to use 11GB [remaining 38.6GB]
[16 Jun 2014 13:16:45-cEm] Loading file Vossia.k63.clean.ctp [1 colour] with colour filter: 0 into colour 1
[16 Jun 2014 13:16:45-cEm] 633,841,256 paths, 11GB path-bytes, 25,553,986 kmers, 11.6GB filesize
[src/kmer/path_store.c:186] Error path_store_add_packed(): Out of memory for paths
[16 Jun 2014 13:18:45-cEm] Fatal Error
Running only with the path for Vossia as follows: $MCCORTEX contigs -m 490G -n 12G --ncontigs 1000000 --print --colour 1 -p 1:Vossia.k63.clean.ctp refAndSampl
es.basalAndropogonae.inferredEdges.clean.ctx > Vossia.clean.k63.fa
Gives this:
[16 Jun 2014 12:43:19-cEm][cmd] /programs/mccortex_5_30_14/bin/ctx63 contigs -m 490G -n 12G --ncontigs 1000000 --colour 1 -p 1:Vossia.k63.clean.ctp refAndSamples.basalAndropogonae.inferredEdges.clean.ctx
[16 Jun 2014 12:43:19-cEm][cwd] /local/workdir/er432/andropogonae/mccortex_out
[16 Jun 2014 12:43:19-cEm][version] ctx=v0.0.3 zlib=1.2.3 htslib=0.2.0-rc8-6-gd49dfa6-dirty ASSERTS=ON CHECKS=ON k=33..63
[16 Jun 2014 12:43:19-cEm][memory] graph: 305GB
[16 Jun 2014 12:43:19-cEm][memory] paths: 11GB
[16 Jun 2014 12:43:19-cEm][memory] total: 316GB of 504.8GB RAM
[16 Jun 2014 12:43:19-cEm][hashtable] Allocating table with 12,884,901,888 entries, using 192.5GB
[16 Jun 2014 12:43:19-cEm][hashtable] number of buckets: 268,435,456, bucket size: 48
[16 Jun 2014 12:44:45-cEm][graph] kmer-size: 63; colours: 3; capacity: 12,884,901,888
[16 Jun 2014 12:46:28-cEm][paths] Setting up path store to use 11GB main
[16 Jun 2014 12:46:28-cEm] Loading file refAndSamples.basalAndropogonae.inferredEdges.clean.ctx [3 colours] into colours 0-2
[16 Jun 2014 12:46:28-cEm] 2,223,283,362 kmers, 64.2GB filesize
[16 Jun 2014 12:46:28-cEm][CtxLoad] First col 0, into cols 0..2, file has 3 cols: refAndSamples.basalAndropogonae.inferredEdges.clean.ctx
[16 Jun 2014 12:57:16-cEm] Loaded 2,223,283,362 / 2,223,283,362 (100.00%) of kmers parsed
[16 Jun 2014 12:57:16-cEm][hash] buckets: 268,435,456 [2^28]; bucket size: 48; memory: 192.5GB; occupancy: 2,223,283,362 / 12,884,901,888 (17.25%)
[16 Jun 2014 12:57:16-cEm] collisions 0: 2223283362
[16 Jun 2014 12:57:16-cEm][PathFormat] With 1 files, require 0 tmp memory [0 extra bytes]
[16 Jun 2014 12:57:16-cEm] Loading file Vossia.k63.clean.ctp [1 colour] with colour filter: 0 into colour 1
[16 Jun 2014 12:57:16-cEm] 633,841,256 paths, 11GB path-bytes, 25,553,986 kmers, 11.6GB filesize
[src/kmer/path_format.c:476] Assert Failed paths_format_merge(): hdr->num_path_bytes == 0 || pstore->tmpstore != ((void *)0)
[16 Jun 2014 12:57:16-cEm] Assert Error
However, I can successfully run when I only try to get contigs for color 0, as follows:
$MCCORTEX contigs -m 490G -n 12G --ncontigs 1000000 --print --colour 0 -p 0:Coelorachis.clean.ctp refAndSamples.basalAndropogonae.inferredEdges.clean.ctx > Coelorachis.clean.k63.fa
Using reads command to obtain the subset of read-pairs represented in the population graph, e.g., using the command below:
$ mccortex31 reads -m 460G -n 6G -t 52 -2 R1.fastq.gz:R2.fastq.gz:pg pg.ctx
produces three output files:
pg.1.fq.gz
pg.2.fq.gz
pg.fq.gz
The last file is empty. And not expected, right?
https://github.com/mcveanlab/mccortex/wiki/Workflow-Examples#read-threading
The example for thread
with --seq2 R1 R2
should be --seq2 R1:R2
.
Write a server command to respond to queries about kmers and links. Requests would be HTTP and responses in JSON. At first the server would hold the graph and links in memory and the data would be read only. Javascript apps could then be written to visualise and interact with the graph.
tests/pipeline
test sometimes fails. Could be due to unlucky coverage but seems too common with coverage 50X.
Says "If neither -t or -s specified, just saves output statistics." but -t
is threads and -s
doesn't exist?
usage: mccortex63 clean [options] <in.ctx> [in2.ctx ...]
Clean a cortex graph. Joins graphs first, if multiple inputs given.
If neither -t or -s specified, just saves output statistics.
-h, --help This help message
-q, --quiet Silence status output normally printed to STDERR
-f, --force Overwrite output files
-o, --out <out.ctx> Save output graph file [required]
-m, --memory <mem> Memory to use
-n, --nkmers <kmers> Number of hash table entries (e.g. 1G ~ 1 billion)
-t, --threads <T> Number of threads to use [default: 2]
-N, --ncols <N> Number of graph colours to use
Cleaning:
-T[L], --tips[=L] Clip tips shorter than <L> kmers [default: auto]
-U[X], --unitigs[=X] Remove low coverage unitigs with median cov < X [default: auto]
-B, --fallback <T> Fall back threshold if we can't pick
Statistics:
-c, --covg-before <out.csv> Save kmer coverage histogram before cleaning
-C, --covg-after <out.csv> Save kmer coverage histogram after cleaning
-l, --len-before <out.csv> Save unitig length histogram before cleaning
-L, --len-after <out.csv> Save unitig length histogram after cleaning
--unitigs without a threshold, causes a calculated threshold to be used
Default: --tips 2*kmer_size --unitigs
Set thresholds to zero to turn-off cleaning
Add breakpoint ref filter similar to --haploid <col>
filter for the bubble caller, to filter out calls where the non-ref path has ref coverage. Suggest -H,--haploid-ref
.
The breakpoint caller should mark kmers if they are used in a call. Unused kmers that are in a sample and not the referece ("novel kmers") should be used to seed a second breakpoint caller.
The novel kmer breakpoint caller should perform a breadth-first search (BFS) to find the nearest kmer that occurs in the reference either side. The shortest path between these kmers should be reported as a putative breakpoint call. This caller will have a high false positive rate, but may catch a few reasonable SNPs / indels.
Using a filename that contains a colon as input results in an error:
$ ./ctx31 build -k 31 -sA -1 A:2.fasta out
[07 Jul 2014 13:34:59-Zid][cmd] ./ctx31 build -k 31 -sA -1 A:2.fasta out
[07 Jul 2014 13:34:59-Zid][cwd] /home/jk/work/wt/prg/hla_graph
[07 Jul 2014 13:34:59-Zid][version] ctx=v0.0.3-29-gf1407a7 zlib=1.2.7 htslib=0.2.0-rc8-6-gd49dfa6 ASSERTS=ON CHECKS=ON k=3..31
[src/basic/async_read_io.c:51] Error asyncio_task_parse(): Expected -1
[07 Jul 2014 13:34:59-Zid] Fatal Error
One can pass multiple graphs to popbubbles
command at once. Will it join multiple graphs first, or pop bubbles in each graph and then join? I am assuming it's the former like it is for clean
command. Would be helpful to clarify this in the help output (like for clean
command) or at least on the wiki for now.
Hello, does current McCortex perform more reliably than stable cortex release? Or might there be major undiscovered problems?
Thanks,
Yannick
When running the standard pipeline the vcfcov step doesn't seem to include the additional NKMERS argument. This is a problem for larger graphs I've found.
Genotype a VCF against a multicolour graph. Much of the work has already been done in ctx_geno.c
.
a la https://stackoverflow.com/questions/29534519/why-gcc-doesnt-recognize-rdynamic-option
make all | grep bcftools
cd bcftools && /Applications/Xcode.app/Contents/Developer/usr/bin/make
gcc -rdynamic -o bcftools main.o vcfindex.o tabix.o vcfstats.o vcfisec.o vcfmerge.o vcfquery.o vcffilter.o filter.o vcfsom.o vcfnorm.o vcfgtcheck.o vcfview.o vcfannotate.o vcfroh.o vcfconcat.o vcfcall.o mcall.o vcmp.o gvcf.o reheader.o convert.o vcfconvert.o tsv2vcf.o vcfcnv.o HMM.o vcfplugin.o consensus.o ploidy.o version.o ccall.o em.o prob1.o kmin.o ../htslib/libhts.a -lpthread -lz -lm -ldl
gcc: error: unrecognized command line option '-rdynamic'
make[2]: *** [bcftools] Error 1
make[1]: *** [bcftools] Error 2
make: *** [libs-other] Error 2
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/local/libexec/gcc/x86_64-apple-darwin13/4.9.3/lto-wrapper
Target: x86_64-apple-darwin13
Configured with: /opt/local/var/macports/build/_opt_mports_dports_lang_gcc49/gcc49/work/gcc-4.9.3/configure --prefix=/opt/local --build=x86_64-apple-darwin13 --enable-languages=c,c++,objc,obj-c++,lto,fortran,java --libdir=/opt/local/lib/gcc49 --includedir=/opt/local/include/gcc49 --infodir=/opt/local/share/info --mandir=/opt/local/share/man --datarootdir=/opt/local/share/gcc-4.9 --with-local-prefix=/opt/local --with-system-zlib --disable-nls --program-suffix=-mp-4.9 --with-gxx-include-dir=/opt/local/include/gcc49/c++/ --with-gmp=/opt/local --with-mpfr=/opt/local --with-mpc=/opt/local --with-isl=/opt/local --disable-isl-version-check --with-cloog=/opt/local --disable-cloog-version-check --enable-stage1-checking --disable-multilib --enable-lto --enable-libstdcxx-time --with-as=/opt/local/bin/as --with-ld=/opt/local/bin/ld --with-ar=/opt/local/bin/ar --with-bugurl=https://trac.macports.org/newticket --with-pkgversion='MacPorts gcc49 4.9.3_0'
Thread model: posix
gcc version 4.9.3 (MacPorts gcc49 4.9.3_0)
Use a buffer to read a block of a file, then decompose it. stream_buffer.h
already supports buffered reading, so should be pretty simple.
mmap() failing for very large files:
[src/graph/graph_reader.c:690] Error graph_files_merge(): Cannot memory map file: graph.ctx [Cannot allocate memory]
Should try using fseek / fread / fwrite in blocks of ~10,000 kmers.
Consider writing links to disk to reduce peak memory.
Writing links to ~256 files, which can then be sorting and merged.
Temporary file will look like:
<kmer><junctions><tab><count>
Sorting can be done with unix sort
command.
mccortex is quite chatty, which is useful to give users information about what is going on. However, in general, a user should be able to control the amount of information printed out.
One option is to provide a -v option, which turns up verbosity. Multiple -v options provided increase the verbosity. This usually implies that the program is silent by default. Another alternative is to provide a -q --quiet option to allow a user to silence the program.
Are k-mers containing N automatically ignored?
The bubble caller has less power to assemble the reference allele vs sample allele since the reference colour is just a bag of kmers without link information. Reference sequences can provide reasonable link info if we break at repetitive kmers.
Be good if it hyperlinked to the paper or pubmed abstract:
Would be nice if the McCortex pipeline could automatically generate a report with plots and figures on the processing that has been done, summary of the results, QA etc.
Would require R and Latex as dependencies.
The last release is quite old: 414 commits to master since this release
Can you make a new pre-release?
Will help me package it.
We are getting this error message during installation. Any ideas on how to resolve this? Thanks.
make all
cd libs; make
make[1]: Entering directory '/home/microstaff/Software/mccortex/libs'
make[1]: *** No rule to make target 'xxHash/Makefile', needed by 'xxHash'. Stop.
make[1]: Leaving directory '/home/microstaff/Software/mccortex/libs'
Makefile:324: recipe for target 'libs/string_buffer/string_buffer.h' failed
make: *** [libs/string_buffer/string_buffer.h] Error 2
Hi Mac
/home/mgonnet/Soft/mccortex//bin/mccortex31 build -m 1G -t 2 -k 27 --sample 74_373 --seq2 74_373_1.fastq:74_373_2.fastq out/k27/graphs/74_373.raw.ctx >& out/k27/graphs/74_373.raw.ctx.log
my_job_script.mk:329: recipe for target 'out/k27/graphs/74_373.raw.ctx' failed
make: *** [out/k27/graphs/74_373.raw.ctx] Error 1
make: *** Deleting file 'out/k27/graphs/74_373.raw.ctx'
The end of the corresponding log file that produce the issue is
05 Aug 2015 19:05:07-vab][hasht] collisions 15: 10
[05 Aug 2015 19:05:07-vab][hasht] collisions 16: 6
[05 Aug 2015 19:05:07-vab][hasht] collisions 17: 6
[src/graph/hash_table.c:311] Error hash_table_find_or_insert_mt(): Hash table is full
[05 Aug 2015 19:05:07-vab] Fatal Error
any idea?
Best regards
Not sure what the spec says, but Bandage expects a tab, not a space, after the initial H on line 1.
Currently Mccortex outputs a space.
I think out.ctp
should be out.ctx
?
usage: mccortex63 inferedges [options] <pop.ctx>
-o, --out <out.ctp> Save output file
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.