mcveanlab / mccortex Goto Github PK

View Code? Open in Web Editor NEW

112.0 15.0 25.0 10.1 MB

De novo genome assembly and multisample variant calling

Home Page: https://github.com/mcveanlab/mccortex/wiki

License: MIT License

Shell 2.21% Makefile 8.67% C 72.83% Objective-C 0.15% TeX 0.20% R 2.47% Perl 8.30% C++ 1.27% Python 3.37% Roff 0.54%

genome-assembly de-bruijn-graphs kmer cortex variant-calling genome-graph contigs genome-analysis genomics

mccortex's People

Contributors

Stargazers

Watchers

mccortex's Issues

Typo in wiki Graph-Construction

https://github.com/mcveanlab/mccortex/wiki/Graph-Construction

--fq_threshold 5 => --fq-cutoff 5 ?

Loading interleaved files assumes first read is first

Interleaved files may not be sorted with first mate then second mate. If reads get swapped this will cause issues when creating links and removing PCR duplicates.

Use of -G option of contigs command.

What is the -G option of contigs command used for? Does it impact assembly quality?

Overlapping paired-end reads

I'm impressed by how cleanly mccortex installed and runs!

I got these warnings in thread

[16 Apr 2016 12:18:33-LOx][generate_paths.c:422] Warn: Reads may overlap in fragment: 151 + 151 > frag len min: 0; max: 1000

We get lots of overlapping PE reads from NextSeq and MiSeq due to suboptimal Nextera XT library prep.

Should I be concerned?
Will this affect the results?

Assert Failed assem2str(): strlen(assem_stop_str[assem]) < size

Any value for -C <float> in contigs causes assert error:

% mccortex63 contigs -C 0.5 -f -o foo1.fa -p infer.ctp.gz infer.ctx

[25 Apr 2016 10:59:35-XuR][cmd] mccortex63 contigs -C 0.5 -f -o foo1.fa -p infer.ctp.gz infer.ctx
[25 Apr 2016 10:59:35-XuR][cwd] /mnt/seq/JOBS/MDU/J2016-06490/nullarbor.ST80/2014-23625
[25 Apr 2016 10:59:35-XuR][version] mccortex=86b4ffe zlib=1.2.7 htslib=1.3-37-gfc93dfc ASSERTS=ON hash=Lookup3 CHECKS=ON k=33..63
[25 Apr 2016 10:59:35-XuR] Taking number of kmers as genome size: 2,918,130
[25 Apr 2016 10:59:35-XuR][memory] 202 bits per kmer
[25 Apr 2016 10:59:35-XuR][memory] graph: 94.9MB
[25 Apr 2016 10:59:35-XuR][memory] paths: 30.1MB
[25 Apr 2016 10:59:35-XuR][memory] total: 95.1MB of 377.6GB RAM
[25 Apr 2016 10:59:35-XuR][hasht] Allocating table with 3,932,160 entries, using 60.2MB
[25 Apr 2016 10:59:35-XuR][hasht]  number of buckets: 131,072, bucket size: 30
[25 Apr 2016 10:59:35-XuR][graph] kmer-size: 63; colours: 1; capacity: 3,932,160
[25 Apr 2016 10:59:35-XuR][GPathReader] need 7216 paths 16810 bytes
[25 Apr 2016 10:59:35-XuR][GPathSet] Allocating for 7,216 paths, 7KB colset, 9.4KB seq => 143.3KB total
[25 Apr 2016 10:59:35-XuR][FileFilter] Loading file infer.ctx [1 colour]
[25 Apr 2016 10:59:35-XuR][GReader] 2,918,130 kmers, 58.4MB filesize
[25 Apr 2016 10:59:35-XuR][GReader] Loaded 2,918,130 / 2,918,130 (100.00%) of kmers parsed
[25 Apr 2016 10:59:35-XuR][hasht] buckets: 131,072 [2^17]; bucket size: 30; memory: 60.2MB; occupancy: 2,918,130 / 3,932,160 (74.21%)
[25 Apr 2016 10:59:35-XuR][hasht]  collisions  0: 2901076
[25 Apr 2016 10:59:35-XuR][hasht]  collisions  1: 16416
[25 Apr 2016 10:59:35-XuR][hasht]  collisions  2: 609
[25 Apr 2016 10:59:35-XuR][hasht]  collisions  3: 25
[25 Apr 2016 10:59:35-XuR][hasht]  collisions  4: 3
[25 Apr 2016 10:59:35-XuR][hasht]  collisions  5: 1
[25 Apr 2016 10:59:35-XuR][FileFilter] Loading file infer.ctp.gz [1 colour]
[25 Apr 2016 10:59:35-XuR][GPathSet] Allocating for 32,768 paths, 32KB colset, 384KB seq => 1MB total
[25 Apr 2016 10:59:35-XuR] Loaded 7,216 paths from 1,688 kmers
[25 Apr 2016 10:59:35-XuR][Assemble] Assembling contigs with 2 threads, walking colour 0
[25 Apr 2016 10:59:35-XuR][Assemble] Using missing info check: yes
[25 Apr 2016 10:59:35-XuR][Assemble] Stop traversal if step cummulative confidence < 0.500000
[25 Apr 2016 10:59:35-XuR][Assemble]   Writing contigs to foo1.fa
[25 Apr 2016 10:59:35-XuR][Assemble] Seeding with random kmers...
[src/tools/assemble_stats.c:43] Assert Failed assem2str(): strlen(assem_stop_str[assem]) < size
[25 Apr 2016 10:59:35-XuR] Assert Error
Aborted (core dumped)

Variant calling with paired-end reads fails

Hello

I tried to follow the "Workflow Examples" in the wiki to do variant calling with two plant genotypes based on paired end reads. Here is the sequence of commands I used:

ctx63 build -m 150G -t 10 -k 61 --sample genotype1 --seq2 genotype1_r1.fastq:genotype1_r2.fastq genotype1.ctx

ctx63 build -m 150G -t 10 -k 61 --sample genotype2 --seq2 genotype2_r1.fastq:genotype2_r2.fastq genotype2.ctx

ctx63 build -m 150G -t 10 -k 61 --sample reference --seq reference.fa reference.ctx

ctx63 clean -m 150G -t 10 --out genotype1.clean.ctx genotype1.ctx
ctx63 clean -m 150G -t 10 --out genotype2.clean.ctx genotype2.ctx

ctx63 join -m 150G --out refAndSamples.clean.ctx reference.ctx genotype1.clean.ctx genotype2.clean.ctx

ctx63 inferedges -m 150G -t 10 refAndSamples.clean.ctx

ctx63 thread -m 150G -t 10 --seq2 genotype1_r1.fastq:genotype1_r2.fastq --out genotype1.ctp.gz refAndSamples.clean.ctx:1
ctx63 thread -m 150G -t 10 --seq2 genotype2_r1.fastq:genotype2_r2.fastq --out genotype2.ctp.gz refAndSamples.clean.ctx:2

Unfortunately, both read threading steps failed with this error:

[src/alignment/correct_aln_stats.c:137] Assert Failed correct_aln_stats_print_summary(): num_reads > 0
[15 Aug 2014 16:22:40-keZ] Assert Error

I tried to "fix" this by adding this code above line 135 in alignment/correct_aln_stats.c:

if(num_reads == 0)
{
status("[CorrectAln] Didn't see any SE reads");
}

After recompiling the threading runs smooth. However, now I am stuck at the contigs command

ctx63 contigs -m 150G -t 10 -p 1:genotype1.ctp.gz -p 2:genotype2.ctp.gz refAndSamples.clean.ctx

which yields:

[src/graph_paths/gpath_checks.c:143] Error gpath_load_sample_pop(): No point loading paths for colours other than --colour 0
[18 Aug 2014 09:42:17-gOC] Fatal Error

Any clues what went wrong?

server occasionally returns edges that aren't valid ints

e.g.

ATGCGTTATATTCGCCTGTGT
{
"key": "ACACAGGCGAATATAACGCAT", "colours": [1],
"left": "AT", "right": "A",
"edges": "a3",
"links": []
}

I'm not sure if this is intended behaviour though?

Running mccortex clean without any option prints help output, which says If neither -T or -U specified, just saves output statistics. However, the default behaviour seems to be to automatically apply -T and -U options with computed values.

VCF annotation/filtering

It would be good to integrate VCF annotation/filtering with the variant calling pipeline. Would be good to flag/remove homopolymer variants etc.

Graph format should use JSON header

Next version of the graph format should use a json header.

Fatal Error: Hash table is full

I run into the following error on a very small sample dataset:

[05 Sep 2016 02:26:45-HUD][version] mccortex=v0.0.3-503-gf025dbf zlib=1.2.7 htslib=1.3.1-61-ge87ae87 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31
[05 Sep 2016 02:26:45-HUD][memory] 179 bits per kmer
[05 Sep 2016 02:26:45-HUD][memory] graph: 3.6GB
[05 Sep 2016 02:26:45-HUD][memory](of which threads: 20 x 42991616 = 820MB)
[05 Sep 2016 02:26:45-HUD][memory] paths: 1.4GB
[05 Sep 2016 02:26:45-HUD][memory] total: 3.7GB of 377.6GB RAM
[05 Sep 2016 02:26:45-HUD][hasht] Allocating table with 171,966,464 entries, using 1.3GB
[05 Sep 2016 02:26:45-HUD][hasht] number of buckets: 4,194,304, bucket size: 41
[05 Sep 2016 02:26:45-HUD][graph] kmer-size: 31; colours: 3; capacity: 171,966,464
[05 Sep 2016 02:26:45-HUD][GPathReader] need 3589273 paths 21033608 bytes
[05 Sep 2016 02:26:45-HUD][GPathSet] Allocating for 3,589,273 paths, 3.4MB colset, 17.5MB seq => 82.5MB total
[05 Sep 2016 02:26:45-HUD][FileFilter] Reading file ./inb1_pl/k31/graphs/MA-ES-3.clean.ctx [1 src colour]
[05 Sep 2016 02:26:45-HUD][GReader] 121,137,470 kmers, 1.5GB filesize
[05 Sep 2016 02:27:15-HUD][GReader] Loaded 121,137,470 / 121,137,470 (100.00%) of kmers parsed
[05 Sep 2016 02:27:15-HUD][FileFilter] Reading file ./inb1_pl/k31/graphs/MN-DM-1.clean.ctx [1 src colour] with filter: 0->1
[05 Sep 2016 02:27:15-HUD][GReader] 127,420,164 kmers, 1.5GB filesize
[05 Sep 2016 02:27:40-HUD][hasht] buckets: 4,194,304 [2^22]; bucket size: 41; memory: 1.3GB; occupancy: 161,890,833 / 171,966,464 (94.14%)
[05 Sep 2016 02:27:40-HUD][hasht] collisions 0: 154224356
[05 Sep 2016 02:27:40-HUD][hasht] collisions 1: 5397571
[05 Sep 2016 02:27:40-HUD][hasht] collisions 2: 1426637
[05 Sep 2016 02:27:40-HUD][hasht] collisions 3: 495114
[05 Sep 2016 02:27:40-HUD][hasht] collisions 4: 195947
[05 Sep 2016 02:27:40-HUD][hasht] collisions 5: 82824
[05 Sep 2016 02:27:40-HUD][hasht] collisions 6: 36706
[05 Sep 2016 02:27:40-HUD][hasht] collisions 7: 16913
[05 Sep 2016 02:27:40-HUD][hasht] collisions 8: 7662
[05 Sep 2016 02:27:40-HUD][hasht] collisions 9: 3639
[05 Sep 2016 02:27:40-HUD][hasht] collisions 10: 1772
[05 Sep 2016 02:27:40-HUD][hasht] collisions 11: 834
[05 Sep 2016 02:27:40-HUD][hasht] collisions 12: 426
[05 Sep 2016 02:27:40-HUD][hasht] collisions 13: 219
[05 Sep 2016 02:27:40-HUD][hasht] collisions 14: 113
[05 Sep 2016 02:27:40-HUD][hasht] collisions 15: 52
[05 Sep 2016 02:27:40-HUD][hasht] collisions 16: 24
[05 Sep 2016 02:27:40-HUD][hasht] collisions 17: 12
[05 Sep 2016 02:27:40-HUD][hasht] collisions 18: 6
[05 Sep 2016 02:27:40-HUD][hasht] collisions 19: 6
[05 Sep 2016 02:27:40-HUD][hash_table.c:293] Fatal Error: Hash table is full

The error arises during the mccortex bubbles step of the pipeline.

Link files should use bgzf block compression

When threading reads, a large amount of time is spent writing the link files (.ctp.gz) to disk. Currently the compression is single threaded. If many threads simultaneously compressed blocks in parallel, a single lock could be used to write one compressed block to disk at a time.

Not sure if we should use bgzf block compression in htslib or roll our own block headers.

server occasionally returns invalid JSON

E.g.

TGCCGGCGCGGGTGGCCGCCG
{
"key": "CGGCGGCCACCCGCGCCGGCA", "colours": [1],
", "right": "C",
"edges": "02",
"links": []
}

Looks like it's missing the "left" entry partially.

New command: query

Add new command query to query sorted and indexed graphs on disk:

mccortex31 query [-k <kmer>|-1 <kmers.fa>|-o out.ctx] <graph.ctx> <graph.ctx.idx>

Output is either in the same format as view command or in graph format, written to disk.

Bubble caller: don't report serial bubbles

  _    _    _  
_/a\__/b\__/c\_
 \_/  \_/  \_/ 
  A    B    C

We currently report adjacent bubbles as one big bubble. For the given example we report abc|ABC, bc|BC, c|C, and the reverse calls.

Instead we should only report stand alone bubbles: a|A, b|B, c|C.

Output contigs in GFA format?

Any plans to output your actual contigs in GFA?

pipeline should have option to leave out inferedges

Inferring edges is only required with multiple samples AND links. If not required, we can miss it out and improve performance. This is risky, so not a suggested setting.

2-color cortex?

Hello,
as discussed briefly with Zam, it would be pretty neat if cortex could handle two layers of colors:

one to identify sample as already exists
and one for 10x genomics chromium-like information (about linkage between molecules: with their tech, molecules of ~150kb are tagged with individual barcodes and sequenced at low-ish coverage to provide linkage and phase information over greater distances than normally possible).
This should reduce ambiguity (bubbles) in assembly, and allow phase-resolved assembly and genotyping in multi-ploid species.

Help output shows version 0.0.3.

I am using mccortex compiled from git (SHA: 22e27b7) using make all.

Add common arg to configure walking: -a,--assembly

Configure graph walking with a common argument across commands:
-a,--assembly [missInfoCheck=T|F,minCumulConf=,minStepConf=]

e.g.
mccortex31 contigs --assembly missInfoCheck=F,minStepConf=0.99 graph.ctx > contigs.fa

Cleaning failing with Assert Failed db_graph_alloc(): num_of_cols > 0

Hi @noporpoise,

I'm running mccortex building and cleaning on a bunch of samples from the ENA and a small percentage are failing on the cleaning step without much information as to why.

Here's a dump of the log. Building works fine but then cleaning dies with Assert Failed db_graph_alloc(): num_of_cols > 0

Any ideas?

[29 Dec 2016 11:12:04-QaP][cmd] /users/iqbal/phelimb/apps/mccortex/bin/mccortex31 build -t 1 -m 7G -k 31 -s ERR233401 --seq /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_1.fastq.gz --seq /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_2.fastq.gz /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/uncleaned/ERR233401.ctx
[29 Dec 2016 11:12:04-QaP][cwd] /gpfs2/well/iqbal/people/phelim
[29 Dec 2016 11:12:04-QaP][version] mccortex=v0.0.3-539-g22e27b7 zlib=1.2.8 htslib=1.3.2-134-g1bc5c56 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31
[29 Dec 2016 11:12:04-QaP] Saving graph to: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/uncleaned/ERR233401.ctx
[29 Dec 2016 11:12:04-QaP][sample] 0: ERR233401
[29 Dec 2016 11:12:04-QaP][task] /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_1.fastq.gz; FASTQ offset: auto-detect, threshold: off; cut homopolymers: off; remove PCR duplicates: no; colour: 0
[29 Dec 2016 11:12:04-QaP][task] /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_2.fastq.gz; FASTQ offset: auto-detect, threshold: off; cut homopolymers: off; remove PCR duplicates: no; colour: 0
[29 Dec 2016 11:12:04-QaP][memory] 104 bits per kmer
[29 Dec 2016 11:12:04-QaP][memory] graph: 6.9GB
[29 Dec 2016 11:12:04-QaP][memory] total: 6.9GB of 94.6GB RAM
[29 Dec 2016 11:12:04-QaP] Writing 1 colour graph to /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/uncleaned/ERR233401.ctx
[29 Dec 2016 11:12:04-QaP][hasht] Allocating table with 570,425,344 entries, using 4.3GB
[29 Dec 2016 11:12:04-QaP][hasht]  number of buckets: 16,777,216, bucket size: 34
[29 Dec 2016 11:12:04-QaP][graph] kmer-size: 31; colours: 1; capacity: 570,425,344
[29 Dec 2016 11:12:04-QaP][hasht] buckets: 16,777,216 [2^24]; bucket size: 34;
[29 Dec 2016 11:12:04-QaP][hasht] memory: 4.3GB; filled: 0 / 570,425,344 (0.00%)
[29 Dec 2016 11:12:04-QaP][asyncio] Inputs: 2; Threads: 1
[29 Dec 2016 11:12:04-QaP][seq] Parsing sequence file /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_2.fastq.gz
[29 Dec 2016 11:12:04-QaP][seq] Parsing sequence file /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_1.fastq.gz
[29 Dec 2016 11:12:04-QaP] /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_2.fastq.gz: Qual scores: Sanger (Phred+33) [offset: 33, range: [33,73], sample: [33,69]]
[29 Dec 2016 11:12:04-QaP] /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_1.fastq.gz: Qual scores: Sanger (Phred+33) [offset: 33, range: [33,73], sample: [33,73]]
[29 Dec 2016 11:17:05-QaP][BuildGraph] Read 5,000,000 entries (reads / read pairs)
[29 Dec 2016 11:20:57-QaP][BuildGraph] Read 10,000,000 entries (reads / read pairs)
[29 Dec 2016 11:21:02-QaP][seq] Loaded 5,082,194 reads and 0 reads pairs (file: /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_1.fastq.gz)
[29 Dec 2016 11:21:04-QaP][seq] Loaded 5,082,194 reads and 0 reads pairs (file: /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_2.fastq.gz)
[29 Dec 2016 11:21:04-QaP][hasht] buckets: 16,777,216 [2^24]; bucket size: 34;
[29 Dec 2016 11:21:04-QaP][hasht] memory: 4.3GB; filled: 488,170,572 / 570,425,344 (85.58%)
[29 Dec 2016 11:21:04-QaP][hasht]  collisions  0: 477616581
[29 Dec 2016 11:21:04-QaP][hasht]  collisions  1: 9132135
[29 Dec 2016 11:21:04-QaP][hasht]  collisions  2: 1181354
[29 Dec 2016 11:21:04-QaP][hasht]  collisions  3: 195799
[29 Dec 2016 11:21:04-QaP][hasht]  collisions  4: 35793
[29 Dec 2016 11:21:04-QaP][hasht]  collisions  5: 7104
[29 Dec 2016 11:21:04-QaP][hasht]  collisions  6: 1436
[29 Dec 2016 11:21:04-QaP][hasht]  collisions  7: 297
[29 Dec 2016 11:21:04-QaP][hasht]  collisions  8: 60
[29 Dec 2016 11:21:04-QaP][hasht]  collisions  9: 11
[29 Dec 2016 11:21:04-QaP][hasht]  collisions 11: 2
[29 Dec 2016 11:21:04-QaP][task] input: /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_1.fastq.gz colour: 0
[29 Dec 2016 11:21:04-QaP]  SE reads: 10,164,388  PE reads: 0
[29 Dec 2016 11:21:04-QaP]  good reads: 10,163,375  bad reads: 1,013
[29 Dec 2016 11:21:04-QaP]  dup SE reads: 0  dup PE pairs: 0
[29 Dec 2016 11:21:04-QaP]  bases read: 1,524,658,200  bases loaded: 1,520,553,440
[29 Dec 2016 11:21:04-QaP]  num contigs: 10,186,918  num kmers: 1,214,945,900 novel kmers: 488,170,572
[29 Dec 2016 11:21:04-QaP][task] input: /well/iqbal/projects/atlas/fastq/ERR233/ERR233401/ERR233401_2.fastq.gz colour: 0
[29 Dec 2016 11:21:04-QaP]  SE reads: 0  PE reads: 0
[29 Dec 2016 11:21:04-QaP]  good reads: 0  bad reads: 0
[29 Dec 2016 11:21:04-QaP]  dup SE reads: 0  dup PE pairs: 0
[29 Dec 2016 11:21:04-QaP]  bases read: 0  bases loaded: 0
[29 Dec 2016 11:21:04-QaP]  num contigs: 0  num kmers: 0 novel kmers: 0
[29 Dec 2016 11:21:04-QaP] Dumping graph...
[29 Dec 2016 11:21:04-QaP][graphwriter] Saving file to: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/uncleaned/ERR233401.ctx
[29 Dec 2016 11:21:04-QaP][FileFilter] Writing graph  [1 src colour]
[29 Dec 2016 11:21:47-QaP][graphwriter] Dumped 488,170,572 kmers in 1 colour into: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/uncleaned/ERR233401.ctx (ver: 6)
[29 Dec 2016 11:21:47-QaP][memory] We made 19 allocs
[29 Dec 2016 11:21:47-QaP] Done.
[29 Dec 2016 11:21:47-QaP][time] 583.00 seconds (9 mins 43 secs)
[29 Dec 2016 11:21:47-mIT][cmd] /users/iqbal/phelimb/apps/mccortex/bin/mccortex31 clean -m 6GB -B 2 -U -T -o /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/ERR233401.ctx -c /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_before.csv -C /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_after.csv -l /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_lbefore.csv -L /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_lafter.csv /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/uncleaned/ERR233401.ctx
[29 Dec 2016 11:21:47-mIT][cwd] /gpfs2/well/iqbal/people/phelim
[29 Dec 2016 11:21:47-mIT][version] mccortex=v0.0.3-539-g22e27b7 zlib=1.2.8 htslib=1.3.2-134-g1bc5c56 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31
[29 Dec 2016 11:21:47-mIT] Actions:
[29 Dec 2016 11:21:47-mIT] 0. Saving kmer coverage distribution to: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_before.csv
[29 Dec 2016 11:21:47-mIT] 1. Saving unitig length distribution to: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_lbefore.csv
[29 Dec 2016 11:21:47-mIT] 2. Cleaning tips shorter than 62 nodes
[29 Dec 2016 11:21:47-mIT] 3. Cleaning unitigs with auto-detected threshold
[29 Dec 2016 11:21:47-mIT] 4. Saving kmer coverage distribution to: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_after.csv
[29 Dec 2016 11:21:47-mIT] 5. Saving unitig length distribution to: /well/iqbal/people/phelim/bins/k31/ERR233/ERR233401/cleaned/stats/ERR233401_lafter.csv
[29 Dec 2016 11:21:47-mIT][memory] 72 bits per kmer
[29 Dec 2016 11:21:47-mIT][memory] graph: 5.5GB
[29 Dec 2016 11:21:47-mIT][cleaning] 1 input graph, max kmers: 488,170,572, using 0 colours
[29 Dec 2016 11:21:47-mIT][memory] total: 5.5GB of 94.6GB RAM
[src/graph/db_graph.c:39] Assert Failed db_graph_alloc(): num_of_cols > 0
[29 Dec 2016 11:21:47-mIT] Assert Error

ctx63 clean can't read "magic word"

I am trying to clean a joined graph containing three colors as follows:

ctx63 clean ${GRAPH_DIR} -o /refAndSamples.basalAndropogonae.clean.ctx ${GRAPH_DIR}/refAndSamples.basalAndropogonae.ctx

I get the following output:

[05 Jun 2014 23:46:12-Wij][cmd] /programs/mccortex/bin/ctx63 clean /workdir/er432/andropogonae/mccortex_out -o /refAndSamples.basalAndropogonae.clean.ctx /workdir/er432/andropogonae/mccortex_out/refAndSamples.basalAndropogonae.ctx
[05 Jun 2014 23:46:12-Wij][cwd] /local/workdir/er432/andropogonae/mccortex_out
[05 Jun 2014 23:46:12-Wij][version] ctx=470b4ca zlib=1.2.3 htslib=0.2.0-rc8-6-gd49dfa6 ASSERTS=ON CHECKS=ON k=33..63
[src/kmer/graph_reader.c:101] Error graph_file_read_header(): Couldn't read 'Magic word': expected 6; recieved: 0; [file: /workdir/er432/andropogonae/mccortex_out]
[05 Jun 2014 23:46:12-Wij] Fatal Error

Contig failing on ctx63 when path colors above 0 used

I try running the following:
$MCCORTEX contigs -m 490G -n 12G --colour 1 -p 0:Coelorachis.clean.ctp -p 1:Vossia.k63.clean.ctp refAndSamples.basalAndropogonae.inferredEdges.clean.ctx > Vossia.clean.k63.fa

And I get this:
[16 Jun 2014 13:01:26-cEm][cmd] /programs/mccortex_5_30_14/bin/ctx63 contigs -m 490G -n 12G --colour 1 -p 0:Coelorachis.clean.ctp -p 1:Vossia.k63.clean.ctp refAndSamples.basalAndropogonae.inferredEdges.clean.ctx
[16 Jun 2014 13:01:26-cEm][cwd] /local/workdir/er432/andropogonae/mccortex_out
[16 Jun 2014 13:01:26-cEm][version] ctx=v0.0.3 zlib=1.2.3 htslib=0.2.0-rc8-6-gd49dfa6-dirty ASSERTS=ON CHECKS=ON k=33..63
[16 Jun 2014 13:01:26-cEm][memory] graph: 305GB
[16 Jun 2014 13:01:26-cEm][memory] paths: 49.6GB
[16 Jun 2014 13:01:26-cEm][memory] total: 354.6GB of 504.8GB RAM
[16 Jun 2014 13:01:26-cEm][hashtable] Allocating table with 12,884,901,888 entries, using 192.5GB
[16 Jun 2014 13:01:26-cEm][hashtable] number of buckets: 268,435,456, bucket size: 48
[16 Jun 2014 13:02:50-cEm][graph] kmer-size: 63; colours: 3; capacity: 12,884,901,888
[16 Jun 2014 13:04:27-cEm][paths] Setting up path store to use 49.6GB main
[16 Jun 2014 13:04:27-cEm] Loading file refAndSamples.basalAndropogonae.inferredEdges.clean.ctx [3 colours] into colours 0-2
[16 Jun 2014 13:04:27-cEm] 2,223,283,362 kmers, 64.2GB filesize
[16 Jun 2014 13:04:27-cEm][CtxLoad] First col 0, into cols 0..2, file has 3 cols: refAndSamples.basalAndropogonae.inferredEdges.clean.ctx
[16 Jun 2014 13:14:42-cEm] Loaded 2,223,283,362 / 2,223,283,362 (100.00%) of kmers parsed
[16 Jun 2014 13:14:42-cEm][hash] buckets: 268,435,456 [2^28]; bucket size: 48; memory: 192.5GB; occupancy: 2,223,283,362 / 12,884,901,888 (17.25%)
[16 Jun 2014 13:14:42-cEm] collisions 0: 2223283362
[16 Jun 2014 13:14:42-cEm][PathFormat] With 2 files, require 11859397612 tmp memory [0 extra bytes]
[16 Jun 2014 13:14:42-cEm] Loading file Coelorachis.clean.ctp [1 colour] into colour 0
[16 Jun 2014 13:14:42-cEm] 2,039,725,230 paths, 38.6GB path-bytes, 27,492,743 kmers, 39.2GB filesize
[16 Jun 2014 13:16:45-cEm][paths] Setup tmp path memory to use 11GB [remaining 38.6GB]
[16 Jun 2014 13:16:45-cEm] Loading file Vossia.k63.clean.ctp [1 colour] with colour filter: 0 into colour 1
[16 Jun 2014 13:16:45-cEm] 633,841,256 paths, 11GB path-bytes, 25,553,986 kmers, 11.6GB filesize
[src/kmer/path_store.c:186] Error path_store_add_packed(): Out of memory for paths
[16 Jun 2014 13:18:45-cEm] Fatal Error

Running only with the path for Vossia as follows: $MCCORTEX contigs -m 490G -n 12G --ncontigs 1000000 --print --colour 1 -p 1:Vossia.k63.clean.ctp refAndSampl
es.basalAndropogonae.inferredEdges.clean.ctx > Vossia.clean.k63.fa

Gives this:
[16 Jun 2014 12:43:19-cEm][cmd] /programs/mccortex_5_30_14/bin/ctx63 contigs -m 490G -n 12G --ncontigs 1000000 --colour 1 -p 1:Vossia.k63.clean.ctp refAndSamples.basalAndropogonae.inferredEdges.clean.ctx
[16 Jun 2014 12:43:19-cEm][cwd] /local/workdir/er432/andropogonae/mccortex_out
[16 Jun 2014 12:43:19-cEm][version] ctx=v0.0.3 zlib=1.2.3 htslib=0.2.0-rc8-6-gd49dfa6-dirty ASSERTS=ON CHECKS=ON k=33..63
[16 Jun 2014 12:43:19-cEm][memory] graph: 305GB
[16 Jun 2014 12:43:19-cEm][memory] paths: 11GB
[16 Jun 2014 12:43:19-cEm][memory] total: 316GB of 504.8GB RAM
[16 Jun 2014 12:43:19-cEm][hashtable] Allocating table with 12,884,901,888 entries, using 192.5GB
[16 Jun 2014 12:43:19-cEm][hashtable] number of buckets: 268,435,456, bucket size: 48
[16 Jun 2014 12:44:45-cEm][graph] kmer-size: 63; colours: 3; capacity: 12,884,901,888
[16 Jun 2014 12:46:28-cEm][paths] Setting up path store to use 11GB main
[16 Jun 2014 12:46:28-cEm] Loading file refAndSamples.basalAndropogonae.inferredEdges.clean.ctx [3 colours] into colours 0-2
[16 Jun 2014 12:46:28-cEm] 2,223,283,362 kmers, 64.2GB filesize
[16 Jun 2014 12:46:28-cEm][CtxLoad] First col 0, into cols 0..2, file has 3 cols: refAndSamples.basalAndropogonae.inferredEdges.clean.ctx
[16 Jun 2014 12:57:16-cEm] Loaded 2,223,283,362 / 2,223,283,362 (100.00%) of kmers parsed
[16 Jun 2014 12:57:16-cEm][hash] buckets: 268,435,456 [2^28]; bucket size: 48; memory: 192.5GB; occupancy: 2,223,283,362 / 12,884,901,888 (17.25%)
[16 Jun 2014 12:57:16-cEm] collisions 0: 2223283362
[16 Jun 2014 12:57:16-cEm][PathFormat] With 1 files, require 0 tmp memory [0 extra bytes]
[16 Jun 2014 12:57:16-cEm] Loading file Vossia.k63.clean.ctp [1 colour] with colour filter: 0 into colour 1
[16 Jun 2014 12:57:16-cEm] 633,841,256 paths, 11GB path-bytes, 25,553,986 kmers, 11.6GB filesize
[src/kmer/path_format.c:476] Assert Failed paths_format_merge(): hdr->num_path_bytes == 0 || pstore->tmpstore != ((void *)0)
[16 Jun 2014 12:57:16-cEm] Assert Error

However, I can successfully run when I only try to get contigs for color 0, as follows:
$MCCORTEX contigs -m 490G -n 12G --ncontigs 1000000 --print --colour 0 -p 0:Coelorachis.clean.ctp refAndSamples.basalAndropogonae.inferredEdges.clean.ctx > Coelorachis.clean.k63.fa

reads command produces an additional bogus file.

Using reads command to obtain the subset of read-pairs represented in the population graph, e.g., using the command below:

$ mccortex31 reads -m 460G -n 6G -t 52 -2 R1.fastq.gz:R2.fastq.gz:pg pg.ctx

produces three output files:

pg.1.fq.gz
pg.2.fq.gz
pg.fq.gz

The last file is empty. And not expected, right?

Missing colon in --seq2 example in Wiki

https://github.com/mcveanlab/mccortex/wiki/Workflow-Examples#read-threading

The example for thread with --seq2 R1 R2 should be --seq2 R1:R2.

Write a server

Write a server command to respond to queries about kmers and links. Requests would be HTTP and responses in JSON. At first the server would hold the graph and links in memory and the data would be read only. Javascript apps could then be written to visualise and interact with the graph.

GFA output should include links as paths

Pipeline test occasionally fails

tests/pipeline test sometimes fails. Could be due to unlucky coverage but seems too common with coverage 50X.

No "-s" option cf. "If neither -t or -s specified, just saves output statistics"

Says "If neither -t or -s specified, just saves output statistics." but -t is threads and -s doesn't exist?

usage: mccortex63 clean [options] <in.ctx> [in2.ctx ...]

  Clean a cortex graph. Joins graphs first, if multiple inputs given.
  If neither -t or -s specified, just saves output statistics.

  -h, --help               This help message
  -q, --quiet              Silence status output normally printed to STDERR
  -f, --force              Overwrite output files
  -o, --out <out.ctx>      Save output graph file [required]
  -m, --memory <mem>       Memory to use
  -n, --nkmers <kmers>     Number of hash table entries (e.g. 1G ~ 1 billion)
  -t, --threads <T>        Number of threads to use [default: 2]
  -N, --ncols <N>          Number of graph colours to use

  Cleaning:
  -T[L], --tips[=L]        Clip tips shorter than <L> kmers [default: auto]
  -U[X], --unitigs[=X]     Remove low coverage unitigs with median cov < X [default: auto]
  -B, --fallback <T>       Fall back threshold if we can't pick

  Statistics:
  -c, --covg-before <out.csv> Save kmer coverage histogram before cleaning
  -C, --covg-after <out.csv>  Save kmer coverage histogram after cleaning
  -l, --len-before <out.csv>  Save unitig length histogram before cleaning
  -L, --len-after <out.csv>   Save unitig length histogram after cleaning

  --unitigs without a threshold, causes a calculated threshold to be used
  Default: --tips 2*kmer_size --unitigs
  Set thresholds to zero to turn-off cleaning

Breakpoint ref filter

Add breakpoint ref filter similar to --haploid <col> filter for the bubble caller, to filter out calls where the non-ref path has ref coverage. Suggest -H,--haploid-ref.

Breakpoint caller should use novel kmers

The breakpoint caller should mark kmers if they are used in a call. Unused kmers that are in a sample and not the referece ("novel kmers") should be used to seed a second breakpoint caller.

The novel kmer breakpoint caller should perform a breadth-first search (BFS) to find the nearest kmer that occurs in the reference either side. The shortest path between these kmers should be reported as a putative breakpoint call. This caller will have a high false positive rate, but may catch a few reasonable SNPs / indels.

Bug: input file names containing colons not accepted

Using a filename that contains a colon as input results in an error:

$ ./ctx31 build -k 31 -sA -1 A:2.fasta out
[07 Jul 2014 13:34:59-Zid][cmd] ./ctx31 build -k 31 -sA -1 A:2.fasta out
[07 Jul 2014 13:34:59-Zid][cwd] /home/jk/work/wt/prg/hla_graph
[07 Jul 2014 13:34:59-Zid][version] ctx=v0.0.3-29-gf1407a7 zlib=1.2.7 htslib=0.2.0-rc8-6-gd49dfa6 ASSERTS=ON CHECKS=ON k=3..31
[src/basic/async_read_io.c:51] Error asyncio_task_parse(): Expected -1
[07 Jul 2014 13:34:59-Zid] Fatal Error

Behaviour of popbubbles command.

One can pass multiple graphs to popbubbles command at once. Will it join multiple graphs first, or pop bubbles in each graph and then join? I am assuming it's the former like it is for clean command. Would be helpful to clarify this in the help output (like for clean command) or at least on the wiki for now.

Add FAQ

Hello, does current McCortex perform more reliably than stable cortex release? Or might there be major undiscovered problems?
Thanks,
Yannick

vcfcov and --nkmers

When running the standard pipeline the vcfcov step doesn't seem to include the additional NKMERS argument. This is a problem for larger graphs I've found.

Genotype a VCF

Genotype a VCF against a multicolour graph. Much of the work has already been done in ctx_geno.c.

bcftools doesn't compile with gcc 4.9 on MacOSX

a la https://stackoverflow.com/questions/29534519/why-gcc-doesnt-recognize-rdynamic-option

make all | grep bcftools

cd bcftools && /Applications/Xcode.app/Contents/Developer/usr/bin/make
gcc -rdynamic  -o bcftools main.o vcfindex.o tabix.o vcfstats.o vcfisec.o vcfmerge.o vcfquery.o vcffilter.o filter.o vcfsom.o vcfnorm.o vcfgtcheck.o vcfview.o vcfannotate.o vcfroh.o vcfconcat.o vcfcall.o mcall.o vcmp.o gvcf.o reheader.o convert.o vcfconvert.o tsv2vcf.o vcfcnv.o HMM.o vcfplugin.o consensus.o ploidy.o version.o ccall.o em.o prob1.o kmin.o  ../htslib/libhts.a -lpthread -lz -lm -ldl  
gcc: error: unrecognized command line option '-rdynamic'
make[2]: *** [bcftools] Error 1
make[1]: *** [bcftools] Error 2
make: *** [libs-other] Error 2

gcc -v

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/local/libexec/gcc/x86_64-apple-darwin13/4.9.3/lto-wrapper
Target: x86_64-apple-darwin13
Configured with: /opt/local/var/macports/build/_opt_mports_dports_lang_gcc49/gcc49/work/gcc-4.9.3/configure --prefix=/opt/local --build=x86_64-apple-darwin13 --enable-languages=c,c++,objc,obj-c++,lto,fortran,java --libdir=/opt/local/lib/gcc49 --includedir=/opt/local/include/gcc49 --infodir=/opt/local/share/info --mandir=/opt/local/share/man --datarootdir=/opt/local/share/gcc-4.9 --with-local-prefix=/opt/local --with-system-zlib --disable-nls --program-suffix=-mp-4.9 --with-gxx-include-dir=/opt/local/include/gcc49/c++/ --with-gmp=/opt/local --with-mpfr=/opt/local --with-mpc=/opt/local --with-isl=/opt/local --disable-isl-version-check --with-cloog=/opt/local --disable-cloog-version-check --enable-stage1-checking --disable-multilib --enable-lto --enable-libstdcxx-time --with-as=/opt/local/bin/as --with-ld=/opt/local/bin/ld --with-ar=/opt/local/bin/ar --with-bugurl=https://trac.macports.org/newticket --with-pkgversion='MacPorts gcc49 4.9.3_0'
Thread model: posix
gcc version 4.9.3 (MacPorts gcc49 4.9.3_0)

Buffered graph loading

Use a buffer to read a block of a file, then decompose it. stream_buffer.h already supports buffered reading, so should be pretty simple.

mmap failing for large populations

mmap() failing for very large files:

[src/graph/graph_reader.c:690] Error graph_files_merge(): Cannot memory map file: graph.ctx [Cannot allocate memory]

Should try using fseek / fread / fwrite in blocks of ~10,000 kmers.

Reduce peak memory used by `thread`

Consider writing links to disk to reduce peak memory.

Writing links to ~256 files, which can then be sorting and merged.
Temporary file will look like:

<kmer><junctions><tab><count>

Sorting can be done with unix sort command.

Feature request: add verbosity control

mccortex is quite chatty, which is useful to give users information about what is going on. However, in general, a user should be able to control the amount of information printed out.

One option is to provide a -v option, which turns up verbosity. Multiple -v options provided increase the verbosity. This usually implies that the program is silent by default. Another alternative is to provide a -q --quiet option to allow a user to silence the program.

What does mccortex build do if it encounters Ns in the read?

Are k-mers containing N automatically ignored?

Generate links info for reference sequence

The bubble caller has less power to assemble the reference allele vs sample allele since the reference colour is just a bag of kmers without link information. Reference sequences can provide reasonable link info if we break at repetitive kmers.

Add links to papers in CITING section

Be good if it hyperlinked to the paper or pubmed abstract:

https://github.com/mcveanlab/mccortex#citing

Generate plots and a report from pipeline

Would be nice if the McCortex pipeline could automatically generate a report with plots and figures on the processing that has been done, summary of the results, QA etc.

Would require R and Latex as dependencies.

Make a new pre-release

The last release is quite old: 414 commits to master since this release

Can you make a new pre-release?

Will help me package it.

Error message during installation

We are getting this error message during installation. Any ideas on how to resolve this? Thanks.

make all
cd libs; make
make[1]: Entering directory '/home/microstaff/Software/mccortex/libs'
make[1]: *** No rule to make target 'xxHash/Makefile', needed by 'xxHash'. Stop.
make[1]: Leaving directory '/home/microstaff/Software/mccortex/libs'
Makefile:324: recipe for target 'libs/string_buffer/string_buffer.h' failed
make: *** [libs/string_buffer/string_buffer.h] Error 2

Fatal error in pipeline

Hi Mac

/home/mgonnet/Soft/mccortex//bin/mccortex31 build  -m 1G -t 2 -k 27 --sample 74_373 --seq2 74_373_1.fastq:74_373_2.fastq out/k27/graphs/74_373.raw.ctx >& out/k27/graphs/74_373.raw.ctx.log
my_job_script.mk:329: recipe for target 'out/k27/graphs/74_373.raw.ctx' failed
make: *** [out/k27/graphs/74_373.raw.ctx] Error 1
make: *** Deleting file 'out/k27/graphs/74_373.raw.ctx'

The end of the corresponding log file that produce the issue is

05 Aug 2015 19:05:07-vab][hasht]  collisions 15: 10
[05 Aug 2015 19:05:07-vab][hasht]  collisions 16: 6
[05 Aug 2015 19:05:07-vab][hasht]  collisions 17: 6
[src/graph/hash_table.c:311] Error hash_table_find_or_insert_mt(): Hash table is full
[05 Aug 2015 19:05:07-vab] Fatal Error

any idea?
Best regards

missing tab means Bandage can't read mccortex GFA

Not sure what the spec says, but Bandage expects a tab, not a space, after the initial H on line 1.
Currently Mccortex outputs a space.

Typo in 'inferedges' --help

I think out.ctp should be out.ctx ?

usage: mccortex63 inferedges [options] <pop.ctx>
  -o, --out <out.ctp>   Save output file

mcveanlab / mccortex Goto Github PK

mccortex's People

Contributors

Stargazers

Watchers

Forkers

mccortex's Issues

Recommend Projects

Recommend Topics

Recommend Org