Code Monkey home page Code Monkey logo

jvcf-spec's People

Contributors

bricoletc avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

jvcf-spec's Issues

HAPG and GT consistency

As pointed out by @leoisl , no HAPG is currently encoded as an empty array [[]], but empty GT is encoded as [[null]]. Need consistency.

Graph restrictions

I'm trying to layout what the restrictions on graph imposed by jVCF are. I'll list criteria and what may happen if they are not met

Possibilities:

  • Graph is directed
    If graph is not directed, what is the POS of the following SNP, 2 or 3?
    adirected_graph.pdf

  • Graph is acyclic
    In following cyclic graph, we have one site with alleles C vs G. But then also an infinite number of sites with arbitrarily long alleles?
    cyclic_graph.pdf

  • Graph is non-crossing
    By non-crossing, i mean two variant sites can only share a node if one is contained in the other
    In following graph, this does not hold: a G node is part of several sites, (one os alleles CA,GA,GT, the other is alleles AT,GT, GA)
    crossing_graph.pdf
    Two things here:

    1. gramtools requires sites to be 'non-crossing'. The reason is that variation is flanked with number IDs. It would not be possible to represent the G node in the two variant sites in one place only: A 5 T 7 C 8 **G** 8 A 6 A 9 **G** 10 A 10 6 T . I know this is horrible on the eyes but G node has to be duplicated. make_prg produces non-crossing, directed acyclic graphs.
    2. I don't think non-crossing is required for jVCF. The example graph can be represented in jVCF; the Child_Map would have two different sites contain the A/T SNP as a child.

On that basis I'd say jVCF can represent variation on directed, acyclic graphs.

Possible new fields

These can be implemented, or none at all:

  • ID (of a site). I know the ID is implicit from the order of the objects in "Sites", but this means we restrict a lot the format. For example, if we filter the sites, and create a filtered jVCF with just a fraction of sites from the original jVCF, we will not be able to match the previous IDs with the original IDs, and the IDs will get messed up. We can't neither sort these sites in any way, as we will mess up the ID information. And ID here is very important, as it composes the Child_Map , differently from the VCF (where it is not very important). I think filtering VCFs (and thus jVCFs) is an essential operation, and this is really hard to do without an ID field. Note: we can mark sites as filtered without removing them, but I think giving ID to objects based on ordering of an array is a bit of a brittle design, at some point it will make some operations hard to do, or just impossible... the solution, which is giving an ID itself to objects is simple and solve any issue with this;

  • HAPG-REF: the haplogroup reference (POS is an offset of this). This is nice to have, as otherwise, everytime I see POS, I have to look at the graph and compute myself the haplogroup reference. This is fine if we have small examples and a drawing of the graph, but we won't have this in real jVCFs. Although HAPG-REF might get very long, so it is tricky to add...

  • QUAL: might be good to encourage users to provide a quality score for the call or a confidence... but this is not actually required, as users are free to add whatever they want. But having call quality or GT confidence is pretty standard, and a standard field for a common and important field is good to have

POS field refers to different linear references

POS field for sites inside a Lvl-1 site refer to different linear references. It depends on the haplogroup reference for each site, which can be different even if they are in the same Lvl-1 site. For example, in the graph below:

image

the 4 most inner sites (2, 3, 5, 6) will all have the same POS value (3), but these POS, although have the same value, refer to different linear references:

  • site 2: ref = TAA
  • site 3: ref = TTC
  • site 5: ref = CCA
  • site 6: ref = CGC

This might be confusing or not... it needs a careful thought if this should be changed or not. Whenever we think about POS as defined here, it is with respect to a linear reference (path) in the graph. I think it is more natural or easier to understand if all sites have a POS with respect to a single fixed linear reference. With a single fixed linear reference, we have a reference to transform jVCFs into VCF and then POS of the sites of jVCF == POS of the site of the VCF. A single linear reference, however, give us a representation problem (see Figure 3 of pandora paper). Having several linear references, as is done now, completely minimises the representation problem, but it also means that we might have several linear references for a single graph, which makes it hard to compare POS between different sites.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.