Code Monkey home page Code Monkey logo

Comments (6)

glennhickey avatar glennhickey commented on September 26, 2024

What's your vg clip command? Did you use -P to specify a reference path prefix(es)?

general options: 
    -P, --path-prefix STRING  Do not clip out alleles on paths beginning with given prefix (such references must be specified either with -P or -b). Multiple allowed

from vg.

Overcraft90 avatar Overcraft90 commented on September 26, 2024

@glennhickey thanks for answering. My vg clip commands are the following

For paths

vg clip -r <SNARL_coordinate_1>.pb -d 2 -m 1000 <graph_1>.gfa -P GRCh38_p14v2.0-snarls#1# -t 16 -v > <output_graph_1>.gfa

For SNARLs

vg clip -r <SNARL_coordinate_2>.pb -n 5000 -m 1000 <graph_2>.gfa -P GRCh38_p14v2.0-snarls#1# -t 16 -v > <output_graph_2>.gfa

I regenerated the the SNARLs coordinate between clipping, as otherwise the command wouldn't work properly. Also, I believe @adamnovak mentioned this could be related to the reference paths being visited multiple times which in turns messes up the way vg clip works. However, this is something inherent to the PGGB graph structure.

This same procedure has been successfully tested on my minigraph-CACTUS pangenome graph, leading to the recent run of Giraffe-DV by simply extracting the reference from the graph and using it for variant calling.

from vg.

glennhickey avatar glennhickey commented on September 26, 2024

OK, thanks for sharing the command line and data. To resume

# convert to usable format (todo: change vg chunk's default format)
vg convert public/groups/vg/mungaro/PGGB_giraffeDV/graph_ref/original/chunk_chr7.vg -p > chunk_chr7_original.vg

# get the snarls
vg snarls chunk_chr7_original.vg > chunk_chr7_original.snarls

# the paths are unclipped
vg paths -Lx chunk_chr7_original.vg -Q GRC
GRCh38_p14v2.0-snarls#1#chr7#0
GRCh38_p14v2.0-snarls#1#unpl_cont-KI270371.1#0
GRCh38_p14v2.0-snarls#1#chr7_gen_cont-KQ031388.1#0
etc.etc.

# but then after clipping the snarls
vg clip -r chunk_chr7_original.snarls -n 5000 -m 1000 chunk_chr7_original.vg -P GRCh38_p14v2.0-snarls#1# -t 16 -v > /dev/null 2> chunk_chr7_n5000.vg.log

# the paths are all clipped
vg paths -Lv chunk_chr7_n5000.vg -Q GRCh38_p14v2.0-snarls#1#chr7#0
GRCh38_p14v2.0-snarls#1#chr7#0[142659520-142659615]
GRCh38_p14v2.0-snarls#1#chr7#0[142638984-142640192]
etc etc.

# and we can even see it in the log
[vg-clip]: Creating 49 fragments from path GRCh38_p14v2.0-snarls#1#chr7#0
etc. etc.

This is indeed a bug where the snarl clipping code picks a single reference path through the snarl and removes everything else. Works fine on mc and construct graphs, but will clip reference nodes in pggb graphs.

There are some simple hacks to fix this (like just whitelisting every node in a reference path as is done for -d) but I think a bit more care needs to be taken than that in order for -L to work properly. I'll see what I can do BUT:

Are you sure you need the snarl filter? Just on this graph, the -d2 is getting rid of much more complexity (and does not chop reference paths):

vg clip chunk_chr7_original.vg -d2 -m 1000 -P GRCh38_p14v2.0-snarls#1# -t 16 -v > chunk_chr7_d2.vg

stats from vg stats -lz on the three graphs:

graph        nodes   edges     sequence
original     2594284  3582337  183316965
clip n5000   2348856  3217157  160978694
clip d2      2264445 2899950   160763753

Also, are you sure you want the unplaced contigs in your graph? They will survive the filters (the d2 one anyway) and probably only serve to add false complexity to your graph. I'm pretty sure Erik did not include them in his HPRC graph.

from vg.

Overcraft90 avatar Overcraft90 commented on September 26, 2024

Thanks @glennhickey very clear!

About the SNARLs filter, indeed you're right the simple -d 2 removes much of the sequence; however, when it comes to building the distance and minimizer indexes then vg complains about having oversized SNARLs... also, I've been doing so for the minigraph-CACTUS graph; therefore, for having a fair comparison between the two would be ideal to do so. Said that, it is definitely something I might not be doing in future builds, especially for small pangenomes.

Regarding the unplaced contings, it is something I will most likely get rid off; we were discussing exactly that with my supervisor a couple of weeks ago, and it seems they create more problems than being actually helpful... however, also in this case I kept them when building with minigraph-CACTUS and to have a fair comparison I should include them also in the PGGB build.

from vg.

glennhickey avatar glennhickey commented on September 26, 2024

This will be fixed once #3943 is merged in.

I think your work with PGGB has led to many useful fixes in the vg tools (such as this). Thanks for your patient reporting! And sorry for all the bugs!

for having a fair comparison between the two would be ideal to do so.

But I'm really doubtful, unfortunately, that you will end up with a useful, fair giraffe-based short-read mapping comparison between PGGB and Minigraph-Cactus anytime soon.

PGGB's strengths relative to minigraph-cactus (in my opinion) are in satellites and complex regions. I also think its collapsed representation may be more useful for copy number variation detection. As such, I don't think filtering out these complex regions then callings snps and indels (and no CNVs) with giraffe-deepvariant is going to show much value, especially relative to all the effort it is requiring.

from vg.

Overcraft90 avatar Overcraft90 commented on September 26, 2024

Thanks to you guys for all the effort during these months.

Fair points about what you're saying here @glennhickey

PGGB's strengths relative to minigraph-cactus (in my opinion) are in satellites and complex regions. I also think its collapsed representation may be more useful for copy number variation detection. As such, I don't think filtering out these complex regions then callings snps and indels (and no CNVs) with giraffe-deepvariant is going to show much value, especially relative to all the effort it is requiring.

I totally understand the situation; nonetheless, I can say that since we don't really have a reliable way to call new SVs from the graph at the moment this approach might work as a starting point to kind of fuel new ideas later on from the development team. Erik and Andrea are already working on diplotype inference from the graph, and you and the VG team are defining this new rGFA.

I trust all this will address current problems, although it will require some time before being implemented and usable and, unfortunately, my project is bound by time limits I have to deal with. On top of that, as you said, this has been a little nice thing I was able to work on — while learning lots of new stuff — during my time here in SC.

from vg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.