<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

-k 63 worked just fine! Are the files named <code cla

these *.glue.* are intermediate filed used by <code c

k=64 bug & unrelated discussions on assembly ;),about gatb/bcalm

Comments (38)

rchikhi commented on May 31, 2024 1

closing this discussion because now bcalm handles GFA output, but let's keep in mind it was a great discussion :)

from bcalm.

sjackman commented on May 31, 2024 1

I have a particular interest in support for gap G edges. For my current purposes, they could be visualized the same as regular overlap E edges. Support for displaying paths would be a cool new feature.

from bcalm.

rchikhi commented on May 31, 2024

thanks for posting those bug issues! i guess it's limited to k<=63 then. will provide support for k>=64 shortly

from bcalm.

sjackman commented on May 31, 2024

No worries. Thanks for the quick response.

from bcalm.

sjackman commented on May 31, 2024

-k 63 worked just fine! Are the files named *.glue.* of any use? Their sum size is 18 GB and the final FASTA file is 4.9 GB.

from bcalm.

rchikhi commented on May 31, 2024

these *.glue.* are intermediate filed used by bglue.
you need to run ./bglue [same arguments] after bcalm, (for now.. the software is splitted into 2 modules and there's no wrapper). after bglue, you can delete them. oh, since you already have a final fasta file, then indeed, delete them.

from bcalm.

sjackman commented on May 31, 2024

I've run bglue and got the final unitigs file. I'll delete those glue files now. Thanks for the clarification.

from bcalm.

rchikhi commented on May 31, 2024

sounds good. a 4.8 GB unitigs file, looks like you've completed a human run :) how'd it go?

from bcalm.

sjackman commented on May 31, 2024

Easy to run. Fast and memory efficient. I'm impressed. You have no idea how many eukaryotic assemblers have segfaulted on me this week. DSK took 8.4h, bcalm took 37m and bglue took 18m with maxRSS of 4423 MB and 47708 MB respectively.

bcalm -in pe400.in -out hsapiens-unitigs -k 64 -abundance 5 -nb-cores 64

Contiguity stats

n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
23.05e6	1413270	291114	589225	1195	500	943	1969	4622	3107	43099	2.183e9	hsapiens-unitigs.fa

Alignment to GRCh38

NGA50	Breakpoints
1191	236

from bcalm.

sjackman commented on May 31, 2024

Here are the sizes of the 64 buckets (.glue. files) in MB (sorted by size):
123 127 127 131 131 139 139 144 147 148 149 150 151 157 161 163 163 168 176 177 185 187 188 189 189 190 195 201 203 205 206 208 212 214 224 225 227 230 232 233 258 261 283 288 321 338 366 371 426 485 501 502 507 508 512 513 514 518 528 542 550 551 552 567

from bcalm.

sjackman commented on May 31, 2024

Stats for ABySS 1.9.0 with k=144 at the first output file (hsapiens-1.fa), which includes the error removal algorithms (erosion, tip trimming, low mean coverage unitig removal) and bubble popping of non-indel bubbles. It'd be fun to try with some of these algorithms disabled. The data set is the GIAB HG004 250-bp Illumina reads after error correction with BFC.

n	n:500	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
3590968	727484	98509	120204	6554	500	2921	7580	16969	11112	214335	2.782e9	hsapiens-1.fa

ABySS took 5h59m and 644 MB GB maxRSS.

from bcalm.

rchikhi commented on May 31, 2024

thanks for the extensive feedback. Whoa, 644 MB maxRSS for Abyss is very impressive! just ~2 bits per kmer.. is it (1) a typo, (2) using the disk, (3) using the new bloom filter variant?

47 GB maxRSS for bglue is abnormal.. oh, I think it's because of the number of threads (64). With 16 threads it'd take 4x more time but 4x less mem.

from bcalm.

sjackman commented on May 31, 2024

Hah! Typo. Fixed above. Sorry. That's the standard hash table implementation. We're working on reporting the Bloom filter implementation.

Any thoughts on the uneven distribution of buckets, and whether it can be improved?

from bcalm.

sjackman commented on May 31, 2024

If you want to try some error removal of the BCALM unitigs, abyss-filtergraph takes a GFA file as input and can filter based on length/coverage/topology. ABySS PopBubbles takes a FASTA file and GFA file does what you'd expect. AdjList and abyss-overlap take a FASTA file and produce a GFA file. See their usage here: https://github.com/bcgsc/abyss/blob/master/bin/abyss-pe#L473-L494
Add --assemble to abyss-filtergraph to compact unambiguous edges.

from bcalm.

rchikhi commented on May 31, 2024

re: abyss-filtergraph. this would be neat! ~~would there happen to also be a program that does fasta2gfa. I.e. find all overlaps in multi-fasta sequences that have exact (k-1)-overlaps?~~ Nevermind, just saw your sentence on abyss-overlap! looks like it could be straightforward then. Thanks much for the pointers.
regarding uneven buckets: each *.glue.* file is the result of all compactions done by a given thread (referenced by thread ID). At a high level, the algorithm does multiple passes. At each pass, it takes a group of buckets (1 bucket = 1 minimizer) and assigns a bucket to an available thread (implemented with a thread pool). Groups of buckets are balanced, overall: making sure that the number of all kmers across all threads is more or less constant. So, groups have the same size. However, within a group, some buckets may have more kmers than others (due to partitioning with minimizers, even when using frequency ordering). Hence, I think this is what explains the difference in *.glue.* file sizes: some threads are unlucky and got assigned heavier buckets overall. But it turns out it actually doesn't matter :) The immediate next step is bglue, which reads all glues sequentially, then performs some other partitioning.

from bcalm.

sjackman commented on May 31, 2024

re: abyss-filtergraph. this would be neat! would there happen to also be a program that does fasta2gfa. I.e. find all overlaps in multi-fasta sequences that have exact (k-1)-overlaps? Nevermind, just saw your sentence on abyss-overlap! looks like it could be straightforward then. Thanks much for the pointers.

AdjList finds exact k-1 overlaps much faster than abyss-overlap. It uses a hash table for k-1 overlaps and a suffix array for m ≤ x < k-1 overlaps. abyss-overlap uses an FM-index to find arbitrary overlaps, but does not do transitive reduction. For that, use sga index && sga overlap and convert the ASQG file to GFA using abyss-todot --gfa.

from bcalm.

rchikhi commented on May 31, 2024

Oh ok, then AdjList it is then. Should remind myself that AdjList == Unitigs2Gfa :) Thanks.

Do you actually look for shorter overlaps than (k-1) in abyss? Then it's not strictly a dBG..

from bcalm.

sjackman commented on May 31, 2024

ABYSS the dBG assembler does not look for <k-1 overlaps. After the unitigs (terminology aside) are built by ABYSS dBG, AdjList finds the m < x ≤ k-1 overlaps after the dBG assembly. At this point it's no longer a dBG but a sequence overlap graph.

from bcalm.

sjackman commented on May 31, 2024

Name your unitigs file ${name}-1.fa and you can run the entire abyss-pe contigging/scaffolding pipeline. It assumes that tip-trimming, low-coverage contig removal has already been done by ABYSS (dBG), so you'll need to tweak the parameters to abyss-filtergraph. Try

k=63
abyss-pe k=$k FILTERGRAPH_OPTIONS="-t$((2*k)) -c5 --assemble" …

from bcalm.

sjackman commented on May 31, 2024

47 GB maxRSS for bglue is abnormal.. oh, I think it's because of the number of threads (64). With 16 threads it'd take 4x more time but 4x less mem.

I thought that I had ran bglue with one thread, by omitting -nb-cores, but I had in fact used 128 threads, which is a reasonable default, I think. I reduced bglue to 12 threads, and that reduced its memory usage by a factor of 9. Incidentally, it also reduced the wall-clock run time by a factor of 2, which may or may not be caused by external factors like NFS file performance.

bcalm -in pe400.in -out hsapiens-unitigs -k 63 -abundance 5 -nb-cores 64
133458.02s user 8514.68s system 426% cpu 9:15:18.60 total 4423 MB

bglue -in hsapiens-unitigs.h5 -out hsapiens-unitigs.fa -k 63 -nb-cores 128
9026.65s user 6561.02s system 1290% cpu 20:08.08 total 47627 MB

bglue -in hsapiens-unitigs.h5 -out hsapiens-unitigs.fa -k 63 -nb-cores 12
4106.23s user 197.08s system 654% cpu 10:57.89 total 5382 MB

from bcalm.

rchikhi commented on May 31, 2024

interesting, thanks for the follow-up investigation!

from bcalm.

rchikhi commented on May 31, 2024

now bcalm outputs the mean kmer abundance per unitig, can abyss-pe take it into account somehow?

from bcalm.

sjackman commented on May 31, 2024

Yes, absolutely. That information either needs to be in the comment of each unitig in the FASTA file, or you can put it in the GraphViz .dot or GFA .gfa file. ABySS uses the sum k-mer abundance of the unitig by default (before dividing by the number of k-mers). It'd be easier to output that info for ABySS. abyss-todot can convert from sum coverage (the default) to mean coverage, but I don't think it will convert in the opposite direction.

from bcalm.

rchikhi commented on May 31, 2024

oh, nice. For bcalm, it's currently in the comment of the FASTA file, in the form:
>[id] MA=[mean abundance], e.g. >1 MA=20.9. What form should it be for abyss? (sure, I can add sum of abundances, it's just MA times number of kmers..)

from bcalm.

sjackman commented on May 31, 2024

I wish ABySS had used some typed comment like that, but no luck. It's

>ID LENGTH COVERAGE

LENGTH is length in bp (not k-mers) and COVERAGE is sum coverage (not mean).

from bcalm.

rchikhi commented on May 31, 2024

ok! no problem, it's a matter of writing another wrapper :)

from bcalm.

sjackman commented on May 31, 2024

I've thought of moving the FASTA comment to SAM-style tags.

>123 LN:i:100 KC:i:5678

from bcalm.

sjackman commented on May 31, 2024

I'd heartily recommend outputting a GFA file though with the sum-coverage in the KC:i tag.

from bcalm.

rchikhi commented on May 31, 2024

I agree, GFA would be useful. I'll keep it in mind. It isn't a straightforward patch because currently the glue step "forgets" about the graph structure.

from bcalm.

sjackman commented on May 31, 2024

Yeah, the ABySS dBG assembler also forgets the graph structure, and it's recreated using AdjList. Speaking of which, AdjList extracts the KC value from the FASTA header.

from bcalm.

rchikhi commented on May 31, 2024

so, probably a hacky fasta2gfa could be made for bcalm using AdjList, I suppose?

from bcalm.

sjackman commented on May 31, 2024

Certainly. It would be pretty easy I'd say. Would you consider changing the FASTA header output of bcalm to

>123 LN:i:100 KC:i:5678 KM:f:113.6

There's no support in ABySS for that format yet, but I could add it for the upcoming ABySS 2.0.0 release, and if we both do it, it's standard! AdjList would then work out of the box for bcalm. Note: KC is in the GFA spec. KM is not, but it could be added. It would be useful, I think.

I'm thinking with ABySS 2.0.0 to move the default graph format from GraphViz to GFA. GFA is currently an option with abyss-pe graph=gfa.

from bcalm.

rchikhi commented on May 31, 2024

sure thing, here's to a new standard!

from bcalm.

sjackman commented on May 31, 2024

from bcalm.

sjackman commented on May 31, 2024

That's great! GFA version 1 or 2?

from bcalm.

rchikhi commented on May 31, 2024

v1: L lines. Bandage accepts gfa2?

from bcalm.

sjackman commented on May 31, 2024

@rrwick Any designs on implementing GFA 2 in Bandage? Thanks again for the indispensable tool, Ryan.

from bcalm.

rrwick commented on May 31, 2024

Sorry for the late response... I missed this one!

I will someday (I promise!) make Bandage GFA2 compatible. Or at least for a subset of GFA2, as the GFA2 standard supports more stuff than Bandage can reasonably visualise. It's on my interminable to do list 😄

from bcalm.

k=64 bug & unrelated discussions on assembly ;) about bcalm HOT 38 CLOSED

Comments (38)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent