Code Monkey home page Code Monkey logo

Comments (38)

rchikhi avatar rchikhi commented on May 31, 2024 1

closing this discussion because now bcalm handles GFA output, but let's keep in mind it was a great discussion :)

from bcalm.

sjackman avatar sjackman commented on May 31, 2024 1

I have a particular interest in support for gap G edges. For my current purposes, they could be visualized the same as regular overlap E edges. Support for displaying paths would be a cool new feature.

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

thanks for posting those bug issues! i guess it's limited to k<=63 then. will provide support for k>=64 shortly

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

No worries. Thanks for the quick response.

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

-k 63 worked just fine! Are the files named *.glue.* of any use? Their sum size is 18 GB and the final FASTA file is 4.9 GB.

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

these *.glue.* are intermediate filed used by bglue.
you need to run ./bglue [same arguments] after bcalm, (for now.. the software is splitted into 2 modules and there's no wrapper). after bglue, you can delete them. oh, since you already have a final fasta file, then indeed, delete them.

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

I've run bglue and got the final unitigs file. I'll delete those glue files now. Thanks for the clarification.

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

sounds good. a 4.8 GB unitigs file, looks like you've completed a human run :) how'd it go?

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

Easy to run. Fast and memory efficient. I'm impressed. You have no idea how many eukaryotic assemblers have segfaulted on me this week. DSK took 8.4h, bcalm took 37m and bglue took 18m with maxRSS of 4423 MB and 47708 MB respectively.

bcalm -in pe400.in -out hsapiens-unitigs -k 64 -abundance 5 -nb-cores 64

Contiguity stats

n n:500 L50 LG50 NG50 min N80 N50 N20 E-size max sum name
23.05e6 1413270 291114 589225 1195 500 943 1969 4622 3107 43099 2.183e9 hsapiens-unitigs.fa

Alignment to GRCh38

NGA50 Breakpoints
1191 236

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

Here are the sizes of the 64 buckets (.glue. files) in MB (sorted by size):
123 127 127 131 131 139 139 144 147 148 149 150 151 157 161 163 163 168 176 177 185 187 188 189 189 190 195 201 203 205 206 208 212 214 224 225 227 230 232 233 258 261 283 288 321 338 366 371 426 485 501 502 507 508 512 513 514 518 528 542 550 551 552 567

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

Stats for ABySS 1.9.0 with k=144 at the first output file (hsapiens-1.fa), which includes the error removal algorithms (erosion, tip trimming, low mean coverage unitig removal) and bubble popping of non-indel bubbles. It'd be fun to try with some of these algorithms disabled. The data set is the GIAB HG004 250-bp Illumina reads after error correction with BFC.

n n:500 L50 LG50 NG50 min N80 N50 N20 E-size max sum name
3590968 727484 98509 120204 6554 500 2921 7580 16969 11112 214335 2.782e9 hsapiens-1.fa

ABySS took 5h59m and 644 MB GB maxRSS.

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

thanks for the extensive feedback. Whoa, 644 MB maxRSS for Abyss is very impressive! just ~2 bits per kmer.. is it (1) a typo, (2) using the disk, (3) using the new bloom filter variant?

47 GB maxRSS for bglue is abnormal.. oh, I think it's because of the number of threads (64). With 16 threads it'd take 4x more time but 4x less mem.

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

Hah! Typo. Fixed above. Sorry. That's the standard hash table implementation. We're working on reporting the Bloom filter implementation.

Any thoughts on the uneven distribution of buckets, and whether it can be improved?

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

If you want to try some error removal of the BCALM unitigs, abyss-filtergraph takes a GFA file as input and can filter based on length/coverage/topology. ABySS PopBubbles takes a FASTA file and GFA file does what you'd expect. AdjList and abyss-overlap take a FASTA file and produce a GFA file. See their usage here: https://github.com/bcgsc/abyss/blob/master/bin/abyss-pe#L473-L494
Add --assemble to abyss-filtergraph to compact unambiguous edges.

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024
  1. re: abyss-filtergraph. this would be neat! would there happen to also be a program that does fasta2gfa. I.e. find all overlaps in multi-fasta sequences that have exact (k-1)-overlaps? Nevermind, just saw your sentence on abyss-overlap! looks like it could be straightforward then. Thanks much for the pointers.

  2. regarding uneven buckets: each *.glue.* file is the result of all compactions done by a given thread (referenced by thread ID). At a high level, the algorithm does multiple passes. At each pass, it takes a group of buckets (1 bucket = 1 minimizer) and assigns a bucket to an available thread (implemented with a thread pool). Groups of buckets are balanced, overall: making sure that the number of all kmers across all threads is more or less constant. So, groups have the same size. However, within a group, some buckets may have more kmers than others (due to partitioning with minimizers, even when using frequency ordering). Hence, I think this is what explains the difference in *.glue.* file sizes: some threads are unlucky and got assigned heavier buckets overall. But it turns out it actually doesn't matter :) The immediate next step is bglue, which reads all glues sequentially, then performs some other partitioning.

from bcalm.

sjackman avatar sjackman commented on May 31, 2024
  1. re: abyss-filtergraph. this would be neat! would there happen to also be a program that does fasta2gfa. I.e. find all overlaps in multi-fasta sequences that have exact (k-1)-overlaps? Nevermind, just saw your sentence on abyss-overlap! looks like it could be straightforward then. Thanks much for the pointers.

AdjList finds exact k-1 overlaps much faster than abyss-overlap. It uses a hash table for k-1 overlaps and a suffix array for m ≤ x < k-1 overlaps. abyss-overlap uses an FM-index to find arbitrary overlaps, but does not do transitive reduction. For that, use sga index && sga overlap and convert the ASQG file to GFA using abyss-todot --gfa.

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

Oh ok, then AdjList it is then. Should remind myself that AdjList == Unitigs2Gfa :) Thanks.

Do you actually look for shorter overlaps than (k-1) in abyss? Then it's not strictly a dBG..

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

ABYSS the dBG assembler does not look for <k-1 overlaps. After the unitigs (terminology aside) are built by ABYSS dBG, AdjList finds the m < x ≤ k-1 overlaps after the dBG assembly. At this point it's no longer a dBG but a sequence overlap graph.

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

Name your unitigs file ${name}-1.fa and you can run the entire abyss-pe contigging/scaffolding pipeline. It assumes that tip-trimming, low-coverage contig removal has already been done by ABYSS (dBG), so you'll need to tweak the parameters to abyss-filtergraph. Try

k=63
abyss-pe k=$k FILTERGRAPH_OPTIONS="-t$((2*k)) -c5 --assemble"

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

47 GB maxRSS for bglue is abnormal.. oh, I think it's because of the number of threads (64). With 16 threads it'd take 4x more time but 4x less mem.

I thought that I had ran bglue with one thread, by omitting -nb-cores, but I had in fact used 128 threads, which is a reasonable default, I think. I reduced bglue to 12 threads, and that reduced its memory usage by a factor of 9. Incidentally, it also reduced the wall-clock run time by a factor of 2, which may or may not be caused by external factors like NFS file performance.

bcalm -in pe400.in -out hsapiens-unitigs -k 63 -abundance 5 -nb-cores 64
133458.02s user 8514.68s system 426% cpu 9:15:18.60 total 4423 MB
bglue -in hsapiens-unitigs.h5 -out hsapiens-unitigs.fa -k 63 -nb-cores 128
9026.65s user 6561.02s system 1290% cpu 20:08.08 total 47627 MB
bglue -in hsapiens-unitigs.h5 -out hsapiens-unitigs.fa -k 63 -nb-cores 12
4106.23s user 197.08s system 654% cpu 10:57.89 total 5382 MB

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

interesting, thanks for the follow-up investigation!

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

now bcalm outputs the mean kmer abundance per unitig, can abyss-pe take it into account somehow?

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

Yes, absolutely. That information either needs to be in the comment of each unitig in the FASTA file, or you can put it in the GraphViz .dot or GFA .gfa file. ABySS uses the sum k-mer abundance of the unitig by default (before dividing by the number of k-mers). It'd be easier to output that info for ABySS. abyss-todot can convert from sum coverage (the default) to mean coverage, but I don't think it will convert in the opposite direction.

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

oh, nice. For bcalm, it's currently in the comment of the FASTA file, in the form:
>[id] MA=[mean abundance], e.g. >1 MA=20.9. What form should it be for abyss? (sure, I can add sum of abundances, it's just MA times number of kmers..)

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

I wish ABySS had used some typed comment like that, but no luck. It's

>ID LENGTH COVERAGE

LENGTH is length in bp (not k-mers) and COVERAGE is sum coverage (not mean).

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

ok! no problem, it's a matter of writing another wrapper :)

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

I've thought of moving the FASTA comment to SAM-style tags.

>123 LN:i:100 KC:i:5678

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

I'd heartily recommend outputting a GFA file though with the sum-coverage in the KC:i tag.

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

I agree, GFA would be useful. I'll keep it in mind. It isn't a straightforward patch because currently the glue step "forgets" about the graph structure.

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

Yeah, the ABySS dBG assembler also forgets the graph structure, and it's recreated using AdjList. Speaking of which, AdjList extracts the KC value from the FASTA header.

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

so, probably a hacky fasta2gfa could be made for bcalm using AdjList, I suppose?

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

Certainly. It would be pretty easy I'd say. Would you consider changing the FASTA header output of bcalm to

>123 LN:i:100 KC:i:5678 KM:f:113.6

There's no support in ABySS for that format yet, but I could add it for the upcoming ABySS 2.0.0 release, and if we both do it, it's standard! AdjList would then work out of the box for bcalm. Note: KC is in the GFA spec. KM is not, but it could be added. It would be useful, I think.

I'm thinking with ABySS 2.0.0 to move the default graph format from GraphViz to GFA. GFA is currently an option with abyss-pe graph=gfa.

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

sure thing, here's to a new standard!

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

Standards

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

That's great! GFA version 1 or 2?

from bcalm.

rchikhi avatar rchikhi commented on May 31, 2024

v1: L lines. Bandage accepts gfa2?

from bcalm.

sjackman avatar sjackman commented on May 31, 2024

@rrwick Any designs on implementing GFA 2 in Bandage? Thanks again for the indispensable tool, Ryan.

from bcalm.

rrwick avatar rrwick commented on May 31, 2024

Sorry for the late response... I missed this one!

I will someday (I promise!) make Bandage GFA2 compatible. Or at least for a subset of GFA2, as the GFA2 standard supports more stuff than Bandage can reasonably visualise. It's on my interminable to do list 😄

from bcalm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.