jvcf-spec's People
jvcf-spec's Issues
HAPG and GT consistency
As pointed out by @leoisl , no HAPG is currently encoded as an empty array [[]]
, but empty GT is encoded as [[null]]
. Need consistency.
Graph restrictions
I'm trying to layout what the restrictions on graph imposed by jVCF are. I'll list criteria and what may happen if they are not met
Possibilities:
-
Graph is directed
If graph is not directed, what is the POS of the following SNP, 2 or 3?
adirected_graph.pdf -
Graph is acyclic
In following cyclic graph, we have one site with alleles C vs G. But then also an infinite number of sites with arbitrarily long alleles?
cyclic_graph.pdf -
Graph is non-crossing
By non-crossing, i mean two variant sites can only share a node if one is contained in the other
In following graph, this does not hold: aG
node is part of several sites, (one os allelesCA
,GA
,GT
, the other is allelesAT
,GT
,GA
)
crossing_graph.pdf
Two things here:gramtools
requires sites to be 'non-crossing'. The reason is that variation is flanked with number IDs. It would not be possible to represent theG
node in the two variant sites in one place only:A 5 T 7 C 8 **G** 8 A 6 A 9 **G** 10 A 10 6 T
. I know this is horrible on the eyes but G node has to be duplicated.make_prg
produces non-crossing, directed acyclic graphs.- I don't think non-crossing is required for jVCF. The example graph can be represented in jVCF; the
Child_Map
would have two different sites contain the A/T SNP as a child.
On that basis I'd say jVCF can represent variation on directed, acyclic graphs.
Possible new fields
These can be implemented, or none at all:
-
ID
(of a site). I know the ID is implicit from the order of the objects in "Sites", but this means we restrict a lot the format. For example, if we filter the sites, and create a filtered jVCF with just a fraction of sites from the original jVCF, we will not be able to match the previous IDs with the original IDs, and the IDs will get messed up. We can't neither sort these sites in any way, as we will mess up the ID information. And ID here is very important, as it composes the Child_Map , differently from the VCF (where it is not very important). I think filtering VCFs (and thus jVCFs) is an essential operation, and this is really hard to do without an ID field. Note: we can mark sites as filtered without removing them, but I think giving ID to objects based on ordering of an array is a bit of a brittle design, at some point it will make some operations hard to do, or just impossible... the solution, which is giving an ID itself to objects is simple and solve any issue with this; -
HAPG-REF
: the haplogroup reference (POS
is an offset of this). This is nice to have, as otherwise, everytime I seePOS
, I have to look at the graph and compute myself the haplogroup reference. This is fine if we have small examples and a drawing of the graph, but we won't have this in real jVCFs. AlthoughHAPG-REF
might get very long, so it is tricky to add... -
QUAL
: might be good to encourage users to provide a quality score for the call or a confidence... but this is not actually required, as users are free to add whatever they want. But having call quality or GT confidence is pretty standard, and a standard field for a common and important field is good to have
POS field refers to different linear references
POS
field for sites inside a Lvl-1 site refer to different linear references. It depends on the haplogroup reference for each site, which can be different even if they are in the same Lvl-1 site. For example, in the graph below:
the 4 most inner sites (2, 3, 5, 6) will all have the same POS
value (3), but these POS, although have the same value, refer to different linear references:
- site 2: ref = TAA
- site 3: ref = TTC
- site 5: ref = CCA
- site 6: ref = CGC
This might be confusing or not... it needs a careful thought if this should be changed or not. Whenever we think about POS
as defined here, it is with respect to a linear reference (path) in the graph. I think it is more natural or easier to understand if all sites have a POS
with respect to a single fixed linear reference. With a single fixed linear reference, we have a reference to transform jVCFs
into VCF
and then POS
of the sites of jVCF
== POS
of the site of the VCF. A single linear reference, however, give us a representation problem (see Figure 3 of pandora paper). Having several linear references, as is done now, completely minimises the representation problem, but it also means that we might have several linear references for a single graph, which makes it hard to compare POS
between different sites.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.