iqbal-lab-org / jvcf-spec Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 0.0 458 KB

Makefile 1.53% TeX 98.47%

jvcf-spec's People

Contributors

Stargazers

Watchers

jvcf-spec's Issues

HAPG and GT consistency

As pointed out by @leoisl , no HAPG is currently encoded as an empty array [[]], but empty GT is encoded as [[null]]. Need consistency.

Graph restrictions

I'm trying to layout what the restrictions on graph imposed by jVCF are. I'll list criteria and what may happen if they are not met

Possibilities:

Graph is directed
If graph is not directed, what is the POS of the following SNP, 2 or 3?
adirected_graph.pdf
Graph is acyclic
In following cyclic graph, we have one site with alleles C vs G. But then also an infinite number of sites with arbitrarily long alleles?
cyclic_graph.pdf
Graph is non-crossing
By non-crossing, i mean two variant sites can only share a node if one is contained in the other
In following graph, this does not hold: a G node is part of several sites, (one os alleles CA,GA,GT, the other is alleles AT,GT, GA)
crossing_graph.pdf
Two things here:
1. gramtools requires sites to be 'non-crossing'. The reason is that variation is flanked with number IDs. It would not be possible to represent the G node in the two variant sites in one place only: A 5 T 7 C 8 **G** 8 A 6 A 9 **G** 10 A 10 6 T . I know this is horrible on the eyes but G node has to be duplicated. make_prg produces non-crossing, directed acyclic graphs.
2. I don't think non-crossing is required for jVCF. The example graph can be represented in jVCF; the Child_Map would have two different sites contain the A/T SNP as a child.

On that basis I'd say jVCF can represent variation on directed, acyclic graphs.

Possible new fields

These can be implemented, or none at all:

ID (of a site). I know the ID is implicit from the order of the objects in "Sites", but this means we restrict a lot the format. For example, if we filter the sites, and create a filtered jVCF with just a fraction of sites from the original jVCF, we will not be able to match the previous IDs with the original IDs, and the IDs will get messed up. We can't neither sort these sites in any way, as we will mess up the ID information. And ID here is very important, as it composes the Child_Map , differently from the VCF (where it is not very important). I think filtering VCFs (and thus jVCFs) is an essential operation, and this is really hard to do without an ID field. Note: we can mark sites as filtered without removing them, but I think giving ID to objects based on ordering of an array is a bit of a brittle design, at some point it will make some operations hard to do, or just impossible... the solution, which is giving an ID itself to objects is simple and solve any issue with this;
HAPG-REF: the haplogroup reference (POS is an offset of this). This is nice to have, as otherwise, everytime I see POS, I have to look at the graph and compute myself the haplogroup reference. This is fine if we have small examples and a drawing of the graph, but we won't have this in real jVCFs. Although HAPG-REF might get very long, so it is tricky to add...
QUAL: might be good to encourage users to provide a quality score for the call or a confidence... but this is not actually required, as users are free to add whatever they want. But having call quality or GT confidence is pretty standard, and a standard field for a common and important field is good to have

POS field refers to different linear references

POS field for sites inside a Lvl-1 site refer to different linear references. It depends on the haplogroup reference for each site, which can be different even if they are in the same Lvl-1 site. For example, in the graph below:

the 4 most inner sites (2, 3, 5, 6) will all have the same POS value (3), but these POS, although have the same value, refer to different linear references:

site 2: ref = TAA
site 3: ref = TTC
site 5: ref = CCA
site 6: ref = CGC

This might be confusing or not... it needs a careful thought if this should be changed or not. Whenever we think about POS as defined here, it is with respect to a linear reference (path) in the graph. I think it is more natural or easier to understand if all sites have a POS with respect to a single fixed linear reference. With a single fixed linear reference, we have a reference to transform jVCFs into VCF and then POS of the sites of jVCF == POS of the site of the VCF. A single linear reference, however, give us a representation problem (see Figure 3 of pandora paper). Having several linear references, as is done now, completely minimises the representation problem, but it also means that we might have several linear references for a single graph, which makes it hard to compare POS between different sites.

iqbal-lab-org / jvcf-spec Goto Github PK

jvcf-spec's People

Contributors

Stargazers

Watchers

jvcf-spec's Issues

HAPG and GT consistency

Graph restrictions

Possible new fields

POS field refers to different linear references

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent