Comments (6)
What's your vg clip
command? Did you use -P
to specify a reference path prefix(es)?
general options:
-P, --path-prefix STRING Do not clip out alleles on paths beginning with given prefix (such references must be specified either with -P or -b). Multiple allowed
from vg.
@glennhickey thanks for answering. My vg clip
commands are the following
For paths
vg clip -r <SNARL_coordinate_1>.pb -d 2 -m 1000 <graph_1>.gfa -P GRCh38_p14v2.0-snarls#1# -t 16 -v > <output_graph_1>.gfa
For SNARLs
vg clip -r <SNARL_coordinate_2>.pb -n 5000 -m 1000 <graph_2>.gfa -P GRCh38_p14v2.0-snarls#1# -t 16 -v > <output_graph_2>.gfa
I regenerated the the SNARLs coordinate between clipping, as otherwise the command wouldn't work properly. Also, I believe @adamnovak mentioned this could be related to the reference paths being visited multiple times which in turns messes up the way vg clip
works. However, this is something inherent to the PGGB
graph structure.
This same procedure has been successfully tested on my minigraph-CACTUS
pangenome graph, leading to the recent run of Giraffe-DV
by simply extracting the reference from the graph and using it for variant calling.
from vg.
OK, thanks for sharing the command line and data. To resume
# convert to usable format (todo: change vg chunk's default format)
vg convert public/groups/vg/mungaro/PGGB_giraffeDV/graph_ref/original/chunk_chr7.vg -p > chunk_chr7_original.vg
# get the snarls
vg snarls chunk_chr7_original.vg > chunk_chr7_original.snarls
# the paths are unclipped
vg paths -Lx chunk_chr7_original.vg -Q GRC
GRCh38_p14v2.0-snarls#1#chr7#0
GRCh38_p14v2.0-snarls#1#unpl_cont-KI270371.1#0
GRCh38_p14v2.0-snarls#1#chr7_gen_cont-KQ031388.1#0
etc.etc.
# but then after clipping the snarls
vg clip -r chunk_chr7_original.snarls -n 5000 -m 1000 chunk_chr7_original.vg -P GRCh38_p14v2.0-snarls#1# -t 16 -v > /dev/null 2> chunk_chr7_n5000.vg.log
# the paths are all clipped
vg paths -Lv chunk_chr7_n5000.vg -Q GRCh38_p14v2.0-snarls#1#chr7#0
GRCh38_p14v2.0-snarls#1#chr7#0[142659520-142659615]
GRCh38_p14v2.0-snarls#1#chr7#0[142638984-142640192]
etc etc.
# and we can even see it in the log
[vg-clip]: Creating 49 fragments from path GRCh38_p14v2.0-snarls#1#chr7#0
etc. etc.
This is indeed a bug where the snarl clipping code picks a single reference path through the snarl and removes everything else. Works fine on mc and construct graphs, but will clip reference nodes in pggb graphs.
There are some simple hacks to fix this (like just whitelisting every node in a reference path as is done for -d
) but I think a bit more care needs to be taken than that in order for -L
to work properly. I'll see what I can do BUT:
Are you sure you need the snarl filter? Just on this graph, the -d2
is getting rid of much more complexity (and does not chop reference paths):
vg clip chunk_chr7_original.vg -d2 -m 1000 -P GRCh38_p14v2.0-snarls#1# -t 16 -v > chunk_chr7_d2.vg
stats from vg stats -lz
on the three graphs:
graph nodes edges sequence
original 2594284 3582337 183316965
clip n5000 2348856 3217157 160978694
clip d2 2264445 2899950 160763753
Also, are you sure you want the unplaced contigs in your graph? They will survive the filters (the d2 one anyway) and probably only serve to add false complexity to your graph. I'm pretty sure Erik did not include them in his HPRC graph.
from vg.
Thanks @glennhickey very clear!
About the SNARLs filter, indeed you're right the simple -d 2
removes much of the sequence; however, when it comes to building the distance
and minimizer
indexes then vg
complains about having oversized SNARLs... also, I've been doing so for the minigraph-CACTUS
graph; therefore, for having a fair comparison between the two would be ideal to do so. Said that, it is definitely something I might not be doing in future builds, especially for small pangenomes.
Regarding the unplaced contings, it is something I will most likely get rid off; we were discussing exactly that with my supervisor a couple of weeks ago, and it seems they create more problems than being actually helpful... however, also in this case I kept them when building with minigraph-CACTUS
and to have a fair comparison I should include them also in the PGGB
build.
from vg.
This will be fixed once #3943 is merged in.
I think your work with PGGB has led to many useful fixes in the vg tools (such as this). Thanks for your patient reporting! And sorry for all the bugs!
for having a fair comparison between the two would be ideal to do so.
But I'm really doubtful, unfortunately, that you will end up with a useful, fair giraffe-based short-read mapping comparison between PGGB and Minigraph-Cactus anytime soon.
PGGB's strengths relative to minigraph-cactus (in my opinion) are in satellites and complex regions. I also think its collapsed representation may be more useful for copy number variation detection. As such, I don't think filtering out these complex regions then callings snps and indels (and no CNVs) with giraffe-deepvariant is going to show much value, especially relative to all the effort it is requiring.
from vg.
Thanks to you guys for all the effort during these months.
Fair points about what you're saying here @glennhickey
PGGB's strengths relative to minigraph-cactus (in my opinion) are in satellites and complex regions. I also think its collapsed representation may be more useful for copy number variation detection. As such, I don't think filtering out these complex regions then callings snps and indels (and no CNVs) with giraffe-deepvariant is going to show much value, especially relative to all the effort it is requiring.
I totally understand the situation; nonetheless, I can say that since we don't really have a reliable way to call new SVs from the graph at the moment this approach might work as a starting point to kind of fuel new ideas later on from the development team. Erik and Andrea are already working on diplotype inference from the graph, and you and the VG team are defining this new rGFA.
I trust all this will address current problems, although it will require some time before being implemented and usable and, unfortunately, my project is bound by time limits I have to deal with. On top of that, as you said, this has been a little nice thing I was able to work on — while learning lots of new stuff — during my time here in SC.
from vg.
Related Issues (20)
- I have a question while using vg. HOT 6
- I have a question while using vg HOT 1
- Variant Calling from HPRC Pangenome HOT 3
- ERROR: Tag "transcript_id" not found in attributes (line 145). HOT 3
- Release vg v1.57.0 HOT 1
- vg mpmap mapping reads to pan-transcriptom too slow HOT 8
- Hello, I want to get some information from Snarl file HOT 1
- a problem about vg autoidex (v1.56.0) HOT 1
- Merge of different chromosome graph files HOT 2
- Program stuck at [IndexRegistry]: Chunking VCF(s) for days HOT 2
- vg alignments not reporting split reads HOT 1
- Mapping paired end reads with vg giraffe: "Falling back on single-end mapping" HOT 9
- Autoindex should parse tabix-indexed monolithic VCFs in parallel
- vg pack error HOT 3
- VCF file empty when calling SV on ONT data HOT 9
- vg map errors HOT 5
- Genotyping SVs in a minigraph-cactus graph yields many similar alleles in output vcf HOT 1
- How to align both long and paired end short reads using vg HOT 1
- Augmentation failed on one chromosome, but succesfull on other chromosomes HOT 3
- Can VG simulate the third-generations long reads? HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vg.