Code Monkey home page Code Monkey logo

Comments (7)

adamnovak avatar adamnovak commented on September 23, 2024

We generate the IDs by using the node numbers for the bounding nodes around the variable site in vg's graph, and giving their orientations using > and < characters. So >73711>73698 is the variable site between the end of node 73711 (read left to right) and the start of node 73698 (read left to right). We call a site like this a "snarl".

I'm not sure what you mean about locating the specific chromosomal position. This is VCF on a linear reference, so each variant has its chromosomal position in the CHROM and POS fields of the VCF record.

The vg call option -S is documented as:

-S, --ref-sample NAME   Call on all paths with given sample name (cannot be used with -p)

This is used to specify which linear reference in the graph you want the output VCF in terms of. If you have a graph that has the GRCh37, GRCh38, and CHM13 references all in it, you can use -S CHM13 to get calls against the CHM13 reference. Internally we treat named reference assemblies and named non-reference sample assemblies in a graph both as "samples" containing multiple contigs.

from vg.

Kled111 avatar Kled111 commented on September 23, 2024

We generate the IDs by using the node numbers for the bounding nodes around the variable site in vg's graph, and giving their orientations using > and < characters. So >73711>73698 is the variable site between the end of node 73711 (read left to right) and the start of node 73698 (read left to right). We call a site like this a "snarl".

I'm not sure what you mean about locating the specific chromosomal position. This is VCF on a linear reference, so each variant has its chromosomal position in the CHROM and POS fields of the VCF record.

The vg call option -S is documented as:

-S, --ref-sample NAME   Call on all paths with given sample name (cannot be used with -p)

This is used to specify which linear reference in the graph you want the output VCF in terms of. If you have a graph that has the GRCh37, GRCh38, and CHM13 references all in it, you can use -S CHM13 to get calls against the CHM13 reference. Internally we treat named reference assemblies and named non-reference sample assemblies in a graph both as "samples" containing multiple contigs.

Thank you for your response. The content previously shown represents the results obtained using -S CHM13. However,when I find the specific sequence of CHM13 on https://genome.ucsc.edu/, it was found that the content in the ALT field corresponds to the sequence of CHM13. Is this perhaps not entirely accurate?

from vg.

adamnovak avatar adamnovak commented on September 23, 2024

the content in the ALT field corresponds to the sequence of CHM13

That shouldn't happen; if you call against CHM13, the REF field should have the sequence that is in CHM13, and the ALT field should have different sequence(s).

If you have a variant where the ALT field has the sequence that's in CHM13 on the browser, what sequence is in the REF field?

Also, are you sure you are using the same CHM13 in your graph as is in the browser? The browser is on the 2.0 version, but there were 1.0 and 1.1. releases.

from vg.

adamnovak avatar adamnovak commented on September 23, 2024

Maybe you're looking at a tandem insertion?

If you look at your variant:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR15368033
chr1 292 >73711>73698 T TAACCCTAACCCT 1702.3 PASS AT=>73711>73698,>73711<73710<73709<73708<73706<73703<73702<73701<73699>73698;DP=98 GT:DP:AD:GL:GQ:GP:XD:MAD 0/1:98:12,86:-185.838884,-16.991398,-16.143265:2:-2.735818:118.137253:12

You can go to chr1:292-312 on the browser and get the sequence:

>hub_3671779_hs1_dna range=chr1:292-312 5'pad=0 3'pad=0 strand=+ repeatMasking=none
TAACCCTAACCCTAACCCTAA

And so you can see that the ALT sequence TAACCCTAACCCT is the same as the sequence in CHM13 at that position. But the REF sequence for this variant is just T, so the variant represents adding another copy of AACCCTAACCCT between the T and the existing copy; it's not a mistake in this case that the variant's ALT sequence is the same as the CHM13 reference's sequence at the variant's start position.

from vg.

Kled111 avatar Kled111 commented on September 23, 2024

Maybe you're looking at a tandem insertion?

If you look at your variant:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR15368033
chr1 292 >73711>73698 T TAACCCTAACCCT 1702.3 PASS AT=>73711>73698,>73711<73710<73709<73708<73706<73703<73702<73701<73699>73698;DP=98 GT:DP:AD:GL:GQ:GP:XD:MAD 0/1:98:12,86:-185.838884,-16.991398,-16.143265:2:-2.735818:118.137253:12

You can go to chr1:292-312 on the browser and get the sequence:

>hub_3671779_hs1_dna range=chr1:292-312 5'pad=0 3'pad=0 strand=+ repeatMasking=none
TAACCCTAACCCTAACCCTAA

And so you can see that the ALT sequence TAACCCTAACCCT is the same as the sequence in CHM13 at that position. But the REF sequence for this variant is just T, so the variant represents adding another copy of AACCCTAACCCT between the T and the existing copy; it's not a mistake in this case that the variant's ALT sequence is the same as the CHM13 reference's sequence at the variant's start position.

Thanks, I tried to examine many entries on chr1 and found that the situation is basically as you described. The 'ref' still corresponds to the sequence of CHM13, but many of the preceding contents of the vcf file happen to match the 'alt' sequence as well. Apart from this, I also want to ask, after using the '-S' parameter, is most of the content in the output VCF file actually the detection results of variations between other samples and CHM13 on the original pan-genome graph? (I used the file https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-chm13/hprc-v1.1-mc-chm13.gbz along with a short read sequencing fastq file (obtained through vg giraffe)). I noticed that the probability values calculated through the GL field are all very small. If I want to know the variations between the sequencing sample and the other samples that constitute the pan-genome, what should I do?"

from vg.

adamnovak avatar adamnovak commented on September 23, 2024

Apart from this, I also want to ask, after using the '-S' parameter, is most of the content in the output VCF file actually the detection results of variations between other samples and CHM13 on the original pan-genome graph?

A lot of the snarls in the graph will be places where other samples in the graph differ from CHM13 but your sample matches it. If you use --genotype-snarls I think vg snarls emits all of these. Otherwise I think it only talks about 0/0 calls if there was some plausible chance they wouldn't have been called as 0/0. So I think your VCF should by default only contain descriptions of where your particular sample might vary from the linear reference you chose, but those descriptions will be informed by what other samples do at those locations.

I noticed that the probability values calculated through the GL field are all very small.

I think this is normal. The GL probabilities are the probability of getting the sequencing data that you actually have, given that the genotype is true. There's a very low probability of observing any particular sequencing data. I think it is computing something like "if my genotype is A/T, what is the chance that I sequence exactly 12 reverse-strand As, 11 forward-strand As, 7 reverse-strand Ts, 5 forward-strand Ts, and one forward-strand G?".

If I want to know the variations between the sequencing sample and the other samples that constitute the pan-genome, what should I do?

I don't think you can plug each sample in the pangenome in as -S and get a VCF against it, because most of the samples in https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-chm13/hprc-v1.1-mc-chm13.gbz are diploid samples. They have two haplotypes in them, and so you can't call against the sample as a whole; you'd have to call against one haplotype or the other. And maybe @glennhickey knows something different, but I don't think that vg call lets you ask for that.

One option is to take your VCF from vg call, and use vg deconstruct to make a VCF describing the genotypes for all the samples in the graph, against the same reference. Then you could compare the two VCF files to see how your called sample relates to each of the samples already in the graph.

Another option would be to take a tool like the Sequence Tube Map and use it to visualize your sequencing reads as aligned to the graph, alongside the locally distinct haplotypes of samples stored in the graph. So if you have a VCF call for your sample you want to know more about, you could look at the reads that gave rise to it and determine whether they agree with any known variation or any haplotypes already in the graph. But this approach aggregates across the haplotypes stored in the graph and combines them when they match, so it won't let you get a list of, for example, the names of samples that have the same genotype as your sample does.

from vg.

adamnovak avatar adamnovak commented on September 23, 2024

Another tip: we've mostly been using the "AF-Filtered VG Indexes" version of the graph for mapping and calling: https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-chm13/hprc-v1.1-mc-chm13.d9.gbz. This throws out anything that doesn't appear at least a certain number of times (I think 9 times), and we think it gets better mapping and calling accuracy. When the probability that your sample actually has any given piece of sequence that's in the graph starts to get small enough that it is close to the probability that one of your sequencing reads has an error that would match that piece of sequence, including more rare variants in the graph can start to be counterproductive for read mapping.

from vg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.