ga4gh / ga4gh-schemas Goto Github PK
View Code? Open in Web Editor NEWModels and APIs for Genomic data. RETIRED 2018-01-24
Home Page: http://ga4gh.org
License: Apache License 2.0
Models and APIs for Genomic data. RETIRED 2018-01-24
Home Page: http://ga4gh.org
License: Apache License 2.0
In the current schema, ReadGroup is an attribute of a Read only accessible with the tags
field of Read (line 145). It is not a first-class object and probably not indexed.
We should promote ReadGroup to a first-class data type for several reasons:
In SAM, a read may not belong to any read groups, but in ReadStore, we can require a read to belong to one and only one ReadGroup. If the input file lacks read groups, we can automatically generate a new read group for all the reads in the input file.
It is not clear to me the relationship between data types described in ga4gh.avdl. More specifically:
If we want to follow the SAM structure, perhaps it would be clearer to define Read as a part of ReadGroup and a ReadGroup as a part of ReadSet. A ReadSet could represent one SAM file which contains a single HeaderSection (in this case, we can merge HeaderSection into ReadSet). Another option is to skip the concept of ReadSet.
All coordinates of the GA4GH schema are currently 0-based. Some concern came up recently when I was discussing this with a colleague who is very familiar with the genetic research field. My impression and his, is that nearly all biologists (and many existing databases/tools/repositories) communicate in terms of 1-based genomic coordinates. For example, scientific papers typically refer to a specific locus by a position which is implicitly 1-based. I didn’t see a convincing amount of discussion of this topic in past GitHub issues (#5, #93), and I’m unsure as to what message the choice of a 0-based coordinate is sending. I can imagine the GA4GH taking one of three positions on the matter:
Computer scientists including myself dislike (1) for reasons of consistency. (2) seems very error-prone to me, as all clients must remember to add one before presenting results to the user. Off by one errors will be rampant; for instance, it might be difficult to pipe one GA4GH tool into another, as both would assume 0-based input and 1-based output. (3) would require a large amount of migration and community buy-in; I’m unqualified to assess the impact.
Currently, the GA4GH is either in camp (2) or (3). I’m trying to understand which.
Proposed topic for ASHG
Synchronisation tools. There are tools within the Hadoop ecosystem to pull down data one is missing. How do we synchronise data between distributed repositories? How do we cache data, handle local copies/remote copies in a multi-cloud world?
This is directed primarily at @lh3 and @richarddurbin, but I'm sure others are qualified to answer.
I noticed that when an assembly is selected from GenBank for RefSeq, it gets assigned a different accession ID - presumably some versions of these parallel accession IDs correspond to identical sequence data. I can't find enough information to determine whether repositories like ENA or DDBJ do the same kind of thing, but I assume there are many accession IDs which may exist for a given unique sequence of bases. It would be nice if my system could store as close to 1 ReferenceSet
per unique md5 as possible to reduce duplication; the actual reference bases are the important thing and I don't think I should need to create a new ReferenceSet
in my repository for each existing view of that data in other repositories.
Said differently, as far as I can tell it's meaningless to choose to align between two ReferenceSet
s which differ only in their accession ID, but it might be useful to provide all relevant accession IDs to a user who has data which is aligned to a given ReferenceSet
.
If my ReferenceSet
s are not unique in their MD5s, is the intention that they are unique by accession? Unique by {accession, md5} perhaps?
Proposed topic for ASHG
How do we work with accessioned entities. What's our pattern for referring to things that have some pre-existing global namespace and where do we want to use that. It would be helpful to have some principles in place.
/**
The orientation and the distance between reads from the fragment are
consistent with the sequencing protocol (extension to SAM flag 0x2)
*/
union { null, boolean } properPlacement = false;
@lh3 Can this be derived given only SAM 0x2 or is more information needed? My interpretation is that properPlacement
is a stronger assertion than the SAM 0x2 flag. Here is one possible mapping:
SAM 0x2 | properPlacement |
---|---|
set | null |
unset | false |
Hi Cassie, David, et al,
I just read the paper regarding Google's Mesa from the following links:
http://research.google.com/pubs/pub42851.html
http://research.google.com/pubs/archive/42851.pdf
I noticed it has several advantages such as online schema changes as well as query by function on sets of values, which can perform in near real-time. Another advantage is that it has petascale data-warehousing with ACID properties for transactions. The function on sets can be especially useful for the many-to-many relationships we have in our schema.
This seems to have some advantages over Megastore, Spanner, and F1 that we can try to leverage.
I was wondering if we can test the schema with data on a development area of Mesa.
Thank you,
Paul
All of our Avro files are namespaced under ga4gh, so a read alignment is currently org.ga4gh.GAReadAlignment
So the GA seems a little redundant from a packaging perspective, and is making all of the code/docs a bit harder to read (and type :)
What does everyone think about removing GA from everything?
We'd then have org.ga4gh.ReadAlignment
After much discussion (mostly in #24 with @richarddurbin, @lh3, @cassiedoll, and @fnothaft), I have a suggestion on a way forward that hopefully addresses all the requirements people have raised. This writeup is meant to replace #24, which I'm closing, so people don't have to wade through all the history of how we got here.
The design principles are:
To get there, I suggest we proceed in two logical steps:
I believe that will get us an API that helps callers by making “simple things simple and complex things possible”. See below for a sketch of the details -- if people like this direction, we can turn it into pull requests quickly, since most of the work is already at least partly in flight.
Please comment on whether you’re comfortable taking that next step and putting this in code. If folks like both step 1 and step 2, we can turn it all into pull requests at once, which will be a bit more efficient. If folks are comfortable with step 1 but not sure about step 2, we can do them one at a time. (And of course details like object and field names can be hashed out in the pull requests themselves.)
GARead changes:
/reads/search
method that takes an array of RGid’s (pending in #26)GAReadGroup changes:
/readgroups/search
method, with RG-level params (e.g. id, library, sample, tags)/readgroups/get
method, and require RGid to be uniqueNote on implementation-specific extensibility -- implementers:
GAReadSet object:
/readsets/search
methodUse of GAReadSet in other places:
Note on implementation-specific flexibility:
The ga4gh/Beacon repository is not empty - it has 3 markdown files describing what the project is, and what Beacons are currently available.
Is the plan to keep this repository around?
In which case we would have 3 repos: ga4gh/schemas, /Beacon and /ga4gh.github.io (docs)
Or, is there a plan to move these files somewhere else and then delete that repository?
The Beacon avdl files already live in this schemas repository: https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/beacon.avdl
Hi Everyone,
This is just a friendly suggestion, but could we have in each of the (non-method) data schemas (i.e. reads.avdl
, variants.avdl
, etc.), a record as follows that stores the URI to a the raw file. I have experienced too many times schemas dramatically change after a significant period of time, where portions of the data were later deemed important and access to the original raw files was needed - I won't even mention what it took to update the schema with the additional data. This way each read, variant, readgroup, etc. can reference their associated one. Below is a suggested record:
record GAOriginalData {
/* This is an ID to use in a data record to reference the original raw data */
string ID;
/* This stores a link to the original data - this can be an FTP, Google Cloud, etc. */
string URI;
}
Thanks,
Paul
I would strongly support moving development of the schema to a git 'development' branch. This is a preference of mine from working on shared projects where we adopted the following model to great success:
http://nvie.com/posts/a-successful-git-branching-model/
TL;DR = the clone of the default master of a repository should always be the working latest stable release.
I think for the purposes of this group, it would be sufficient to have just two branches of master
and development
with tags for the releases.
We already namespace by protocol, is the "GA4GH" a necessary (or useful) prefix for Avro and/or Protobuf?
Primarily this is directed towards @richarddurbin.
While not stated explicitly, the schema implies that a GAReference
belongs to exactly one GAReferenceSet
. The implication is made by GAReference
comments referring to its parent GAReferenceSet
to inherit values for various fields.
Would there be value in supporting a many:many relationship here? I can see it being useful if two GAReferenceSet
s have a moderate probability of being similar; for instance if they differed only in one contig or one were a subset of another. Furthermore, a many:many relationship would enable a repository to hold a 1 GAReference
per unique MD5 invariant, if it desired to do so. This is not feasible given the current 1:many semantics.
If #120 were accepted, the inheritance semantics of the current schema could be safely removed as getting a GAReference
would be a lightweight metadata-only operation. The last piece I'm not quite grasping is where isDerived
fits in.
Note that I'm not necessarily advising that the schema dictate a many:many relationship (this is a separate question on which I haven't formed an opionion), my primary desire is to shift the semantics so that it would not be disallowed.
We have good definitions of the objects ("nouns") in ga4gh.avdl, but we don't yet have definitions of access methods ("verbs"). Let's add them.
@massie -- what do method definitions look like in AVRO land? We might want to start with /reads/search and /readsets/search (as documented here), since they're sufficient to build basic useful interop.
Consider a researcher who writes a script against the GA4GH APIs, accesses data and publishes the results. The current APIs do not guarantee that subsequent researchers will get the same result when running the original script, therefore the published results are not assured to be reproducible.
If GA4GH APIs are to really going change the way bioinformatics is done they need to facilitate the reproducibility of results. In order for results to be reproducible one needs to be able obtain the exactly the same data and associated metadata that were used in an experiment. For the GA4GH APIs this means that every time a given data object is returned it is always the same. This means that APIs must present data as immutable. Data objects are never modified, instead new derived versions are created.
Mark Diekhans, David Haussler and I think this is important to address and that it would be relatively straightforward to implement immutability into an update of the v0.5 API. What do people think?
Interpreting GA4GHRead.baseQuality requires some funny string manipulation. It might be useful to turn this field into something other than a string.
(Issue split out from #3)
We might consider an alternative representation of the cigar that requires less regex usage while still being compact.
(maybe a more formal structure, like [{type: deletion, count: 10}, {type: match, count: 30}]? or anything else that might help with parsing and be easy for api providers to support)
The fileData on GA4GHReadSet should be cleaned up and pulled into top level ReadSet fields that have a cleaner structure and are easier for users to navigate.
We could possibly first delete it altogether and then add back in the crucial fields as we figure out what they are.
There are some variants and call info fields which are repeated in VCF. (e.g. any tag that is repeated per-allele - AC, AF, HQ, EC etc)
How should these fields be represented as GAKeyValue - which has a single string value for each key?
Should we:
a) smash values together in some well known way :(
b) make GAKeyValue::value
an array :(
c) some better idea I'm missing?
My reading of what a "fragment length" usually is disagrees from the definition of TLEN
in the SAM spec.
Current GAReadAlignment::fragmentLength
doc:
/** The observed length of the fragment, equivalent to TLEN in SAM. */
union { null, int } fragmentLength = null;
From the SAM spec:
TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a minus sign. The sign of segments in the middle is undefined. It is set as 0 for single-segment template or when the information is unavailable.
From my reading of the above, TLEN
is closer to what I believe is the 'insert size as aligned to the reference'. From looking at various docs, the fragment length is usually defined as the original size of the fragment, which includes the length of the adapters.
Here is the easiest visual documentation on this I've found, although this may be too specific to Illumina: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3906532/figure/F1/.
My guess is that this is just overloading of terminology, so the only request here may be for a small clarification to the docs, but I wanted to first verify that my understanding is correct.
It would be very useful (e.g in kmer based analyses) if the /reads/search
method provided a mechanism to find all reads that contain a given substring. In general this may be quite hard to do, but perhaps if a limited range of fixed substring sizes were supported, it might be feasible?
There's general (not universal) agreement that it's useful to have a Read id, to (among other things) allow searches to return a short handle that can be used to get full Read details later, and to allow humans to communicate to other humans about a particular read. We should decide and document the rules for assigning those ids. Some options:
a) read id is globally unique, across all instances of the API.
b) read id is API unique -- it's a persistent reliable identifier for callers of a given API instance
c) read id is session unique -- it's unique for callers of a given API, but may not persist between sessions
Personally, I'm in favor of ( b).
Note that (b) does not require storage of a long ID for each read -- API implementers can synthesize the unique ID however they want from existing data.
Proposed topic for ASHG
How to support Variant calling with the API?
I tried to improve this comment myself but so far I've come up with:
Genome assembly identifier
.Remaining unknowns:
I apologize if these are basic questions which are answered somewhere, but I've not found a definitive answer on the subject after searching for some time.
/** Public id of this reference set, such as `GRCh37`. */
union { null, string } assemblyId = null;
I really don't understand the purpose of this flag. The documentation reads:
A sequence X is said to be derived from source sequence Y, if X and Y
are of the same length and the per-base sequence divergence at A/C/G/T bases
is sufficiently small. Two sequences derived from the same official
sequence share the same coordinates and annotations, and
can be replaced with the official sequence for certain use cases.
Is this a common use? The only instance in which this would be possible would be a reference sequence which had only single base changes. This seems extremely unusual biologically.
Could someone please clarify?
GAReferenceSequenceSet
and all of its friends are quite wordy. Can we come up with a shorter name for these objects?
Possibly: ReferenceSet
(which would mean Reference
, /referencesets/search, etc)
or is there another option perhaps? (I kinda liked contig, but I know that probably won't work :)
This issue is moved over from comments on #52; there were two different interwoven discussions, and I think they'll both be clearer when separated
From @pgrosu:
Hi Everyone,
It took me some time to catch up to all the discussions, but I just wanted to add a few comments:
Even though everyone here is already very familiar with most sequence analysis approaches, I wanted to present a general workflow, since sometimes having a general view of the process and where generally one ends up, can help with formulating the data-structures needed for that.
As @richarddurbin's nice suggestion, I took the following diagram (at the link below) and converted it to a UML-written format. It can easily be converted to AVDL, and I used some of the records from the GAReads protocol:
http://www.ebi.ac.uk/ena/about/sra_submissions
Submission {
string Experiment_ID;
string Study_ID;
}
Study {
string ID;
string Name;
string Description;
ClinicalInformation CI;
}
Experiment {
string ID;
}
// Even if I used inherits (as API-format), this can easily
// be translated to a table, store or object.
ClinicalExperiment inherits Experiment {
//string ID is already inherited
array<GAKeyValue> tags;
}
//Even if I used inherits (as API-format), this can easily
// be translated to a table, store or object.
SequenceExperiment inherits Experiment {
//string ID is already inherited
array<Run> Runs;
array<SampleInformation> Samples;
}
Run {
string ID;
string Name;
string platformUnit;
string sequencingCenter;
string sequencingTechnology;
date Date;
}
SampleInformation {
string ID;
string Name;
date Date;
string Library;
ContactInformation Contact;
}
ContactInformation {
int ID;
string Name;
string Email;
string Address;
}
ClinicalInformation {
string ID;
array<GAKeyValue> CI; //collection of clinical information fields
}
Hope this might make it a little more flexible,
Paul
From @dglazer:
@pgrosu -- thanks for the input. I'm not clear on the context, though -- these sound like good general ideas for representing study/sample/... metadata, and as such would be great to include in the metadata discussion Tanya et al. are driving. If that's right, I suggest moving this comment to a new issue, so the people thinking about those topics can weigh in. (And so they don't get lost in the detailed discussion this issue about lower-level constructs.)
Note that the general sense I've gotten from the last few calls is that folks are leaning towards handling that metadata elsewhere than the Reads API. However that comes out, though, I think we should divide and conquer on the discussion.
From @pgrosu:
Hi David,
My next step was to find out from the web more about the deliverables, in order to sync with what the goals are. I found a document "Priorities 2014 04 28 Final for Posting.pdf" which detailed the following regarding the DWG:
This reinforced the need for referenced or embedded experimental information and/or metadata associated with a ReadGroup, especially for any queries and analysis workflows one might perform.
I understand @lh3's approach to having a recursive template collection for different types of information about a study. The only issue from a search perspective is the O() complexity of finding relevant information across many ReadGroups.
Thus I have three questions, and one comment:
Comment: I would be happy to continue to post to this group when I have relevant contributions. I promise not to make my posts so long in the future :)
From @lh3:
@pgrosu thanks for your comments. Re time complexity: if the tree exactly matches the current SRA hierarchy, the number of searches required to identify read groups is the same. In addition, the time spent on retrieving read groups should be negligible in comparison to pulling read data. Even if the performance became a concern, we could use some techniques internally to improvement the speed (e.g. duplicating the path to the root).
It's too confusing to find, and to navigate between:
(Thanks @haussler for reporting.)
Hi Guys,
I am Max Mikheev - CTO of Biodatomics. We are now using format with avro as internal for storing BAM/VCF/GFF files and running all analytic on Hadoop. I talk before with Matt about it.
I am following for your discussion for a while. I didn't find a way to join your group from web-site. Would you like to add me?
I would like to participate in discussions and your weekly calls. I am sure that Biodatomics can bring some additional expertise in formats and tools development.
Sincerely,
Max Mikheev
See discussion on #110; essentially, we need to fix our indentation so that our comments are indented by <= 3 spaces.
We currently have 2 APIs that are tangentially related to our main v0.5 - beacon and matchmaker. And in each of those APIs, there is a concept of a chromosome field, which is currently a string in both cases.
It would be good to help these other APIs benefit from GAReference and introduce some standardization.
Possible solutions:
http://www.googleapis.com/genomics/v1beta/references/{id}
or http://trace.ncbi.nlm.nih.gov/Traces/gg/references/{id}
)Thoughts?
In the reference variation git repository we have a pull request that proposes including a copy of a portion of the read task team schema:
https://github.com/ga4gh/RefVariationTaskTeam/pull/3
This is a small dependency at this stage, but this could quickly get out of hand. In particular, we'd like to make a small pull request to the read task team's schema to include support for cigar alignments to graphs - see:
But if we took the duplication approach we'd need to include lots of the ref var schema in the read task team schema. That's obviously not tenable.
Will somebody with more AVRO experience tell me what the right way to sort this out (and see Adam's comment in the referenced issue above).
A new repository should be created that serves up a minimum viable reference implementation of the API.
The schema definition says
/**
The 0-based offset from the start of the forward strand for that reference.
*/
long position;
So some people may assume these are non-negative (unsigned) numbers. Nevertheless
negative genomic coordinates are used often for circular genomes as shown in this example
from ENSEMBL's Escherichia_coli_o104_h4_str_2011c_3493:
$ zcat GCA_000299455.1.22/genes.gtf.gz | cut -f 2-5 | grep -e "-"
protein_coding exon -1567 1477
protein_coding CDS -1564 1477
protein_coding stop_codon -1567 -1565
protein_coding exon -236 240
protein_coding CDS -236 237
protein_coding start_codon -236 -234
How should we handle these cases? The word "offset" suggests that negative coordinates should be converted to non-negative by adding the total length, but I didn't see any explicit comment about it (sorry if I missed it). Should we clarify how to handle negative coordinates for circular genomes? If so, what's the preferred way?
i) Using negative coordinates is a pain
ii) Using positive coordinates introduces problems due intervals having 'start' position after 'end'
So I don't have a strong bias for/against either option.
Proposed topic for ASHG
What else do we need to do to build out our ecosystem?
Main goal: Given an API endpoint, output which of the API methods are passing, failing, or unimplemented.
Bonus:
This would be lovely as another repository alongside schemas.
From the SAM/BAM spec...
The MD field aims to achieve SNP/indel calling without looking at the reference. For
example, a string `10A5^AC6' means from the leftmost reference base in the
alignment, there are 10 matches followed by an A on the reference which is different
from the aligned read base; the next 5 reference bases are matches followed by a
2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are
matches. The MD field ought to match the CIGAR string.
The MD tag provides reference information and allows for calculation of edit distance. It's also more compact than the sequence (mis)match tags in the CIGAR.
So this stems from a discussion in #142. These phases of promotion would need to be controlled vocabulary, and should fit any form of data (record). All forms of data start at the "Deposited" level. I only put three levels for now for conciseness, but I am flexible on the name and number.
Because of the granularity, it also has the added feature of cascading a level (GAReadGroupSet
-> GAVariantSet
-> GAExperiment
), but that would be up to the developer that would be implementing it. I used the word "Content" to be more general, but as we evolve the API that can change. I also provided a few alternate names, in case people are more comfortable with some, rather than others. The enum GAContentStatus
will most likely reside in common.avdl
, and most data records should probably have it set to "Deposited":
/* Alternatives names are: GAContentStage, GAContentRank, GAContentLevel */
enum GAContentStatus {
Deposited,
Verified, /* Alternatives are: Validated */
Archived /* Alternatives are: Retired, Quarantined */
}
record GAVariantSet {
...
GAContentStatus contentStatus = Deposited
}
Let me know what you think.
Thanks,
Paul
If we all agree, then lets declare the current API to be ready for v0.5, tag it with git, and put up the final docs on ga4gh.org.
Please vote!
@massie started a discussion in the dwgreadtaskteam forum, recommending that we try out https://github.com/ept/avrodoc; opening this issue to track progress.
Richard Durbin's summary of proposals from the Reads Task Team email thread;
The full data model for reads is more complicated than I first realized -- it took me a while to understand the subtleties (thanks @lh3 for all the clarifications in various pull request threads), and I'm still not sure I have the details right. We'll want to make the model as clear as possible in our documentation -- here's a first crack at capturing my understanding, using the terminology from #60. Comments and corrections welcome.
Reads and alignments are arranged in a logical hierarchy:
GAReadgroupSet -< GAReadgroup -< fragment -< read -< alignment -< GAReadAlignment
GAReadgroupSet
is a logical collection of GAReadgroup
sGAReadgroup
is all the data that’s processed the same way by the sequencer. There are typically 1-10 GAReadgroup
s in a GAReadgroupSet
.GAReadgroup
. A fragment has a name (QNAME
in BAM spec), a length (TLEN
in BAM spec), and an array of reads.GAReadAlignment
is a linear alignment -- it maps a string of bases to an area of reference using a single CIGAR string. There’s always a representative alignment, and can also be one or more supplementary alignments in the case of a chimeric read, but that’s rare.EDIT (25-may): tweaked some definitions based on comments below
Proposed topic for ASHG
Execution environment - code touching the data - where do we stand on actually processing the data? Currently we have an REST API which is good for fetching and storing data. But in terms of actually running analysis, there are a lot of approaches. It would be good to tease out the overall vision of that - how do we envision that working.
Proposed topic for ASHG
So far we have been ignoring Auth. But we all know that in the real world a large fraction of. Are there parts of an auth model that should be standardised - or not - and how would we do that.
Proposed topic for ASHG
How do we interact with metadata?
I don't see contig (like 'chr20') represented on the read object, but I could be overlooking it.
The original goal of GA4GHRead.alignedBases was to reduce the need for users to parse the cigar field. When using the current implementation of Google's API, I haven't found this to be particularly useful.
We might consider an alternative representation of the cigar that requires less regex usage while still being compact.
(maybe a more formal structure, like [{type: deletion, count: 10}, {type: match, count: 30}]? or anything else that might help with parsing and be easy for api providers to support)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.