Comments (38)
closing this discussion because now bcalm handles GFA output, but let's keep in mind it was a great discussion :)
from bcalm.
I have a particular interest in support for gap G
edges. For my current purposes, they could be visualized the same as regular overlap E
edges. Support for displaying paths would be a cool new feature.
from bcalm.
thanks for posting those bug issues! i guess it's limited to k<=63 then. will provide support for k>=64 shortly
from bcalm.
No worries. Thanks for the quick response.
from bcalm.
-k 63
worked just fine! Are the files named *.glue.*
of any use? Their sum size is 18 GB and the final FASTA file is 4.9 GB.
from bcalm.
these *.glue.*
are intermediate filed used by bglue
.
you need to run ./bglue [same arguments]
after bcalm, (for now.. the software is splitted into 2 modules and there's no wrapper). after bglue, you can delete them. oh, since you already have a final fasta file, then indeed, delete them.
from bcalm.
I've run bglue
and got the final unitigs file. I'll delete those glue files now. Thanks for the clarification.
from bcalm.
sounds good. a 4.8 GB unitigs file, looks like you've completed a human run :) how'd it go?
from bcalm.
Easy to run. Fast and memory efficient. I'm impressed. You have no idea how many eukaryotic assemblers have segfaulted on me this week. DSK took 8.4h, bcalm took 37m and bglue took 18m with maxRSS of 4423 MB and 47708 MB respectively.
bcalm -in pe400.in -out hsapiens-unitigs -k 64 -abundance 5 -nb-cores 64
Contiguity stats
n | n:500 | L50 | LG50 | NG50 | min | N80 | N50 | N20 | E-size | max | sum | name |
---|---|---|---|---|---|---|---|---|---|---|---|---|
23.05e6 | 1413270 | 291114 | 589225 | 1195 | 500 | 943 | 1969 | 4622 | 3107 | 43099 | 2.183e9 | hsapiens-unitigs.fa |
Alignment to GRCh38
NGA50 | Breakpoints |
---|---|
1191 | 236 |
from bcalm.
Here are the sizes of the 64 buckets (.glue. files) in MB (sorted by size):
123 127 127 131 131 139 139 144 147 148 149 150 151 157 161 163 163 168 176 177 185 187 188 189 189 190 195 201 203 205 206 208 212 214 224 225 227 230 232 233 258 261 283 288 321 338 366 371 426 485 501 502 507 508 512 513 514 518 528 542 550 551 552 567
from bcalm.
Stats for ABySS 1.9.0 with k=144 at the first output file (hsapiens-1.fa
), which includes the error removal algorithms (erosion, tip trimming, low mean coverage unitig removal) and bubble popping of non-indel bubbles. It'd be fun to try with some of these algorithms disabled. The data set is the GIAB HG004 250-bp Illumina reads after error correction with BFC.
n | n:500 | L50 | LG50 | NG50 | min | N80 | N50 | N20 | E-size | max | sum | name |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3590968 | 727484 | 98509 | 120204 | 6554 | 500 | 2921 | 7580 | 16969 | 11112 | 214335 | 2.782e9 | hsapiens-1.fa |
ABySS took 5h59m and 644 MB GB maxRSS.
from bcalm.
thanks for the extensive feedback. Whoa, 644 MB maxRSS for Abyss is very impressive! just ~2 bits per kmer.. is it (1) a typo, (2) using the disk, (3) using the new bloom filter variant?
47 GB maxRSS for bglue is abnormal.. oh, I think it's because of the number of threads (64). With 16 threads it'd take 4x more time but 4x less mem.
from bcalm.
Hah! Typo. Fixed above. Sorry. That's the standard hash table implementation. We're working on reporting the Bloom filter implementation.
Any thoughts on the uneven distribution of buckets, and whether it can be improved?
from bcalm.
If you want to try some error removal of the BCALM unitigs, abyss-filtergraph
takes a GFA file as input and can filter based on length/coverage/topology. ABySS PopBubbles
takes a FASTA file and GFA file does what you'd expect. AdjList
and abyss-overlap
take a FASTA file and produce a GFA file. See their usage here: https://github.com/bcgsc/abyss/blob/master/bin/abyss-pe#L473-L494
Add --assemble
to abyss-filtergraph
to compact unambiguous edges.
from bcalm.
-
re: abyss-filtergraph. this would be neat!
would there happen to also be a program that does fasta2gfa. I.e. find all overlaps in multi-fasta sequences that have exact (k-1)-overlaps?Nevermind, just saw your sentence onabyss-overlap
! looks like it could be straightforward then. Thanks much for the pointers. -
regarding uneven buckets: each
*.glue.*
file is the result of all compactions done by a given thread (referenced by thread ID). At a high level, the algorithm does multiple passes. At each pass, it takes a group of buckets (1 bucket = 1 minimizer) and assigns a bucket to an available thread (implemented with a thread pool). Groups of buckets are balanced, overall: making sure that the number of all kmers across all threads is more or less constant. So, groups have the same size. However, within a group, some buckets may have more kmers than others (due to partitioning with minimizers, even when using frequency ordering). Hence, I think this is what explains the difference in*.glue.*
file sizes: some threads are unlucky and got assigned heavier buckets overall. But it turns out it actually doesn't matter :) The immediate next step is bglue, which reads all glues sequentially, then performs some other partitioning.
from bcalm.
- re: abyss-filtergraph. this would be neat! would there happen to also be a program that does fasta2gfa. I.e. find all overlaps in multi-fasta sequences that have exact (k-1)-overlaps? Nevermind, just saw your sentence on abyss-overlap! looks like it could be straightforward then. Thanks much for the pointers.
AdjList
finds exact k-1
overlaps much faster than abyss-overlap
. It uses a hash table for k-1
overlaps and a suffix array for m ≤ x < k-1
overlaps. abyss-overlap
uses an FM-index to find arbitrary overlaps, but does not do transitive reduction. For that, use sga index && sga overlap
and convert the ASQG file to GFA using abyss-todot --gfa
.
from bcalm.
Oh ok, then AdjList
it is then. Should remind myself that AdjList == Unitigs2Gfa
:) Thanks.
Do you actually look for shorter overlaps than (k-1) in abyss? Then it's not strictly a dBG..
from bcalm.
ABYSS the dBG assembler does not look for <k-1
overlaps. After the unitigs (terminology aside) are built by ABYSS dBG, AdjList
finds the m < x ≤ k-1
overlaps after the dBG assembly. At this point it's no longer a dBG but a sequence overlap graph.
from bcalm.
Name your unitigs file ${name}-1.fa
and you can run the entire abyss-pe
contigging/scaffolding pipeline. It assumes that tip-trimming, low-coverage contig removal has already been done by ABYSS
(dBG), so you'll need to tweak the parameters to abyss-filtergraph
. Try
k=63
abyss-pe k=$k FILTERGRAPH_OPTIONS="-t$((2*k)) -c5 --assemble" …
from bcalm.
47 GB maxRSS for bglue is abnormal.. oh, I think it's because of the number of threads (64). With 16 threads it'd take 4x more time but 4x less mem.
I thought that I had ran bglue
with one thread, by omitting -nb-cores
, but I had in fact used 128 threads, which is a reasonable default, I think. I reduced bglue
to 12 threads, and that reduced its memory usage by a factor of 9. Incidentally, it also reduced the wall-clock run time by a factor of 2, which may or may not be caused by external factors like NFS file performance.
bcalm -in pe400.in -out hsapiens-unitigs -k 63 -abundance 5 -nb-cores 64
133458.02s user 8514.68s system 426% cpu 9:15:18.60 total 4423 MB
bglue -in hsapiens-unitigs.h5 -out hsapiens-unitigs.fa -k 63 -nb-cores 128
9026.65s user 6561.02s system 1290% cpu 20:08.08 total 47627 MB
bglue -in hsapiens-unitigs.h5 -out hsapiens-unitigs.fa -k 63 -nb-cores 12
4106.23s user 197.08s system 654% cpu 10:57.89 total 5382 MB
from bcalm.
interesting, thanks for the follow-up investigation!
from bcalm.
now bcalm outputs the mean kmer abundance per unitig, can abyss-pe take it into account somehow?
from bcalm.
Yes, absolutely. That information either needs to be in the comment of each unitig in the FASTA file, or you can put it in the GraphViz .dot or GFA .gfa file. ABySS uses the sum k-mer abundance of the unitig by default (before dividing by the number of k-mers). It'd be easier to output that info for ABySS. abyss-todot
can convert from sum coverage (the default) to mean coverage, but I don't think it will convert in the opposite direction.
from bcalm.
oh, nice. For bcalm, it's currently in the comment of the FASTA file, in the form:
>[id] MA=[mean abundance]
, e.g. >1 MA=20.9
. What form should it be for abyss? (sure, I can add sum of abundances, it's just MA times number of kmers..)
from bcalm.
I wish ABySS had used some typed comment like that, but no luck. It's
>ID LENGTH COVERAGE
LENGTH is length in bp (not k-mers) and COVERAGE is sum coverage (not mean).
from bcalm.
ok! no problem, it's a matter of writing another wrapper :)
from bcalm.
I've thought of moving the FASTA comment to SAM-style tags.
>123 LN:i:100 KC:i:5678
from bcalm.
I'd heartily recommend outputting a GFA file though with the sum-coverage in the KC:i
tag.
from bcalm.
I agree, GFA would be useful. I'll keep it in mind. It isn't a straightforward patch because currently the glue step "forgets" about the graph structure.
from bcalm.
Yeah, the ABySS dBG assembler also forgets the graph structure, and it's recreated using AdjList
. Speaking of which, AdjList
extracts the KC
value from the FASTA header.
from bcalm.
so, probably a hacky fasta2gfa could be made for bcalm using AdjList
, I suppose?
from bcalm.
Certainly. It would be pretty easy I'd say. Would you consider changing the FASTA header output of bcalm
to
>123 LN:i:100 KC:i:5678 KM:f:113.6
There's no support in ABySS for that format yet, but I could add it for the upcoming ABySS 2.0.0 release, and if we both do it, it's standard! AdjList
would then work out of the box for bcalm
. Note: KC
is in the GFA spec. KM
is not, but it could be added. It would be useful, I think.
I'm thinking with ABySS 2.0.0 to move the default graph format from GraphViz to GFA. GFA is currently an option with abyss-pe graph=gfa
.
from bcalm.
sure thing, here's to a new standard!
from bcalm.
from bcalm.
That's great! GFA version 1 or 2?
from bcalm.
v1: L lines. Bandage accepts gfa2?
from bcalm.
@rrwick Any designs on implementing GFA 2 in Bandage? Thanks again for the indispensable tool, Ryan.
from bcalm.
Sorry for the late response... I missed this one!
I will someday (I promise!) make Bandage GFA2 compatible. Or at least for a subset of GFA2, as the GFA2 standard supports more stuff than Bandage can reasonably visualise. It's on my interminable to do list
from bcalm.
Related Issues (20)
- new release soon? HOT 2
- Running bcalm raises error: `cannot create a union-find data structure, too many elements.` HOT 6
- error: 'predecessors' following the 'template' keyword does not refer to a template HOT 8
- Error: libc++abi.dylib: terminating with uncaught exception of type std::out_of_range: basic_string HOT 5
- Compiling bcalm causes trouble when no .git repository is found HOT 1
- Output of bcalm2 similar to minia 3.2.1 unitigs? HOT 2
- Memory error with many reference genomes as input HOT 16
- High frequency of nodes with degree of two with length = 2k-1 HOT 2
- .fa files contain lowercase `km` instead of the documented uppercase `KM` HOT 3
- bcalm 2.2.1 on conda for Mac OS X produces incorrect cDBG (the zombie bug returns!!) HOT 12
- Add better input parsing HOT 1
- Improve error message for bad fasta input HOT 1
- Apparently bcalm crashes with very large k's (4096) HOT 1
- bcalm -version return code 1 to shell
- Glue is too strong, because glue files are not deleted.
- insert unitig disregarding the reverse complement HOT 1
- Option `-out-dir` is not used ? HOT 1
- bug report on a metagenomics dataset HOT 3
- Add header description with `-all-abundance-counts` HOT 2
- Is the output file reproducible? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bcalm.