ga4gh / ga4gh-schemas Goto Github PK

Models and APIs for Genomic data. RETIRED 2018-01-24

License: Apache License 2.0

Python 97.33% Makefile 2.67%

ga4gh-schemas's Issues

ReadGroup should be a first-class data type

In the current schema, ReadGroup is an attribute of a Read only accessible with the tags field of Read (line 145). It is not a first-class object and probably not indexed.

We should promote ReadGroup to a first-class data type for several reasons:

A read group is the smallest set of homogeneous reads. Although there is still ambiguity in this definition, ReadGroup is in practice more clearly defined and more stable than ReadSet (see #16) and DataSet. It is a first-class concept in SAM widely used in data processing. Many downstream tools, such as GATK, samtools and Picard, focus on read groups and frequently ignore the concept of file (or ReadSet). As a matter of fact, all ReadGroups from 1000g are associated with SRA/ERA accession numbers. Most 1000g BAM files are not for now, because ReadSet is harder to define than ReadGroup.
Sample and experimental info are attached to ReadGroup in SAM, not to a physical file or a ReadSet in the schema. When we query by sample, by library or by sequencing technology, we are directly searching for ReadGroups, not for ReadSets. In addition, it is possible that a ReadSet may contain ReadGroups from different samples - SAM is the same. When we query by sample, we will have to break a ReadSet. ReadGroup is the minimal unit.

In SAM, a read may not belong to any read groups, but in ReadStore, we can require a read to belong to one and only one ReadGroup. If the input file lacks read groups, we can automatically generate a new read group for all the reads in the input file.

The relationship between data types in ga4gh.avdl

It is not clear to me the relationship between data types described in ga4gh.avdl. More specifically:

A ReadSet seems a part of dataset (line 76), but dataset is not defined.
A Read seems a part of ReadSet (line 103). But how is it related to a ReadGroup? Are we supposed to get the ReadGroup information by decoding tags (line 145)?
A ReadGroup is part of HeaderSection (line 61) and a HeaderSection is part of ReadSet (line 86) and can appear multiple times in a ReadSet (also line 86). Then a ReadSet effectively represents an ensemble of multiple BAM files. Is this the intention? If so, is it possible to identify a single BAM?
What is the purpose of GA4GHHeader (line 42)? Why is it allowed to appear multiple times in HeaderSection (line 55)?

If we want to follow the SAM structure, perhaps it would be clearer to define Read as a part of ReadGroup and a ReadGroup as a part of ReadSet. A ReadSet could represent one SAM file which contains a single HeaderSection (in this case, we can merge HeaderSection into ReadSet). Another option is to skip the concept of ReadSet.

What's the GA4GH position on 0 vs 1 based positions?

All coordinates of the GA4GH schema are currently 0-based. Some concern came up recently when I was discussing this with a colleague who is very familiar with the genetic research field. My impression and his, is that nearly all biologists (and many existing databases/tools/repositories) communicate in terms of 1-based genomic coordinates. For example, scientific papers typically refer to a specific locus by a position which is implicitly 1-based. I didn’t see a convincing amount of discussion of this topic in past GitHub issues (#5, #93), and I’m unsure as to what message the choice of a 0-based coordinate is sending. I can imagine the GA4GH taking one of three positions on the matter:

Our API is 1-based. Biologists and existing databases will use a 1-based coordinate system.
Our API is 0-based. Clients of our API will add 1 downstream, thereby preserving the existing 1-based coordinate system used by biologists.
Our API is 0-based. Biologists should shift towards using a 0-based coordinate system.

Computer scientists including myself dislike (1) for reasons of consistency. (2) seems very error-prone to me, as all clients must remember to add one before presenting results to the user. Off by one errors will be rampant; for instance, it might be difficult to pipe one GA4GH tool into another, as both would assume 0-based input and 1-based output. (3) would require a large amount of migration and community buy-in; I’m unqualified to assess the impact.

Currently, the GA4GH is either in camp (2) or (3). I’m trying to understand which.

Synchronisation tools

Proposed topic for ASHG
Synchronisation tools. There are tools within the Hadoop ecosystem to pull down data one is missing. How do we synchronise data between distributed repositories? How do we cache data, handle local copies/remote copies in a multi-cloud world?

Should a Reference{,Set} support multiple accessions?

This is directed primarily at @lh3 and @richarddurbin, but I'm sure others are qualified to answer.

I noticed that when an assembly is selected from GenBank for RefSeq, it gets assigned a different accession ID - presumably some versions of these parallel accession IDs correspond to identical sequence data. I can't find enough information to determine whether repositories like ENA or DDBJ do the same kind of thing, but I assume there are many accession IDs which may exist for a given unique sequence of bases. It would be nice if my system could store as close to 1 ReferenceSet per unique md5 as possible to reduce duplication; the actual reference bases are the important thing and I don't think I should need to create a new ReferenceSet in my repository for each existing view of that data in other repositories.

Said differently, as far as I can tell it's meaningless to choose to align between two ReferenceSets which differ only in their accession ID, but it might be useful to provide all relevant accession IDs to a user who has data which is aligned to a given ReferenceSet.

If my ReferenceSets are not unique in their MD5s, is the intention that they are unique by accession? Unique by {accession, md5} perhaps?

How to work with accessioned entities

Proposed topic for ASHG
How do we work with accessioned entities. What's our pattern for referring to things that have some pre-existing global namespace and where do we want to use that. It would be helpful to have some principles in place.

Computing `GAReadAlignment::properPlacement` from a SAM record

  /**
  The orientation and the distance between reads from the fragment are
  consistent with the sequencing protocol (extension to SAM flag 0x2)
  */
  union { null, boolean } properPlacement = false;

@lh3 Can this be derived given only SAM 0x2 or is more information needed? My interpretation is that properPlacement is a stronger assertion than the SAM 0x2 flag. Here is one possible mapping:

SAM 0x2	`properPlacement`
set	`null`
unset	`false`

Testing the schema with data on Mesa at Google

Hi Cassie, David, et al,

I just read the paper regarding Google's Mesa from the following links:

http://research.google.com/pubs/pub42851.html

http://research.google.com/pubs/archive/42851.pdf

I noticed it has several advantages such as online schema changes as well as query by function on sets of values, which can perform in near real-time. Another advantage is that it has petascale data-warehousing with ACID properties for transactions. The function on sets can be especially useful for the many-to-many relationships we have in our schema.

This seems to have some advantages over Megastore, Spanner, and F1 that we can try to leverage.

I was wondering if we can test the schema with data on a development area of Mesa.

Thank you,
Paul

hange prompt

Do we need the GA prefix on all objects?

All of our Avro files are namespaced under ga4gh, so a read alignment is currently org.ga4gh.GAReadAlignment

So the GA seems a little redundant from a packaging perspective, and is making all of the code/docs a bit harder to read (and type :)

What does everyone think about removing GA from everything?
We'd then have org.ga4gh.ReadAlignment

first-class ReadGroups + convenient ReadSets

After much discussion (mostly in #24 with @richarddurbin, @lh3, @cassiedoll, and @fnothaft), I have a suggestion on a way forward that hopefully addresses all the requirements people have raised. This writeup is meant to replace #24, which I'm closing, so people don't have to wade through all the history of how we got here.

The design principles are:

for flexibility, allow query and analysis over any ad hoc collection of RG’s
for convenience, provide a default collection of RG’s that callers can choose to use

To get there, I suggest we proceed in two logical steps:

agree on a core API that has ReadGroups only (no ReadSets), to ensure it supports all the ad hoc mix-and-match-RG use cases that people have raised
agree on how to add a lightweight "default collection" concept of ReadSets, to make callers’ lives easier for the other use cases that people have raised

I believe that will get us an API that helps callers by making “simple things simple and complex things possible”. See below for a sketch of the details -- if people like this direction, we can turn it into pull requests quickly, since most of the work is already at least partly in flight.

Please comment on whether you’re comfortable taking that next step and putting this in code. If folks like both step 1 and step 2, we can turn it all into pull requests at once, which will be a bit more efficient. If folks are comfortable with step 1 but not sure about step 2, we can do them one at a time. (And of course details like object and field names can be hashed out in the pull requests themselves.)

Step 1: ReadGroups only; no ReadSets

Mental model:

RG’s are the fundamental unit of grouping
you can query over any ad hoc collection of RG’s
there are no default or persistent collections of RG’s

GARead changes:

add readgroupId (pending in #25)
remove readsetId (note: it might come back in step 2)
require every Read to belong to exactly one RG (and create a default RG if needed)
add /reads/search method that takes an array of RGid’s (pending in #26)

GAReadGroup changes:

add the GAHeaderSection fields (currently in GAReadSet)
- finish the GAHeaderSection cleanup (pending in #21)
add tags field (to allow individual repositories to support not-yet-standard metadata)
add /readgroups/search method, with RG-level params (e.g. id, library, sample, tags)
PROBABLY: add /readgroups/get method, and require RGid to be unique

Note on implementation-specific extensibility -- implementers:

SHOULD NOT add new fields to standard objects
MAY use the tags fields on GARead and GAReadGroup to add extra info
MAY take advantage of /search support for tag fields to filter by that extra info
MAY support extra methods, as long as they don’t conflict with the standard methods

Step 2: introduce ReadSets

Mental model:

RGs are still the fundamental unit of grouping
you can still query over any ad hoc collection of RG’s
you can also use a new default collection, which is the ‘suggested’ unit of analysis

GAReadSet object:

introduce a new GAReadSet object. Every RG belongs to exactly 1 ReadSet
fields are id, name, list of RGs
add /readsets/search method
MAYBE: expose all homogeneous RG fields (e.g. library, sample) for easy browsing

Use of GAReadSet in other places:

add [readsetId, …] as a query param to /readgroups/search and /reads/search
add readsetId to the GARead and GAReadGroup objects

Note on implementation-specific flexibility:

Implementations are presumably in the best position to understand their data, and therefore can choose how to assign RG’s to ReadSets.
- One possible heuristic is “all the RGs that were imported from a single BAM file, and share the same sample name, go into a single ReadSet”.
- Another possible heuristic, if there is no known preferred grouping of the data, is “every RG goes into its own ReadSet”. (Which would be easy to implement on the fly with zero storage overhead.)

Decide on plan for the Beacon repository

The ga4gh/Beacon repository is not empty - it has 3 markdown files describing what the project is, and what Beacons are currently available.

Is the plan to keep this repository around?
In which case we would have 3 repos: ga4gh/schemas, /Beacon and /ga4gh.github.io (docs)

Or, is there a plan to move these files somewhere else and then delete that repository?

The Beacon avdl files already live in this schemas repository: https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/beacon.avdl

Raw (original) data record for all data schemas

Hi Everyone,

This is just a friendly suggestion, but could we have in each of the (non-method) data schemas (i.e. reads.avdl, variants.avdl, etc.), a record as follows that stores the URI to a the raw file. I have experienced too many times schemas dramatically change after a significant period of time, where portions of the data were later deemed important and access to the original raw files was needed - I won't even mention what it took to update the schema with the additional data. This way each read, variant, readgroup, etc. can reference their associated one. Below is a suggested record:

record GAOriginalData {

  /* This is an ID to use in a data record to reference the original raw data */  
  string ID;

  /* This stores a link to the original data - this can be an FTP, Google Cloud, etc. */
  string URI;

}

Thanks,
Paul

Move to developing on a 'development' branch, `master` as latest stable release

I would strongly support moving development of the schema to a git 'development' branch. This is a preference of mine from working on shared projects where we adopted the following model to great success:

http://nvie.com/posts/a-successful-git-branching-model/

TL;DR = the clone of the default master of a repository should always be the working latest stable release.

I think for the purposes of this group, it would be sufficient to have just two branches of master and development with tags for the releases.

Is "GA4GH" prefix really necessary

We already namespace by protocol, is the "GA4GH" a necessary (or useful) prefix for Avro and/or Protobuf?

The Reference : ReferenceSet relationship

Primarily this is directed towards @richarddurbin.

While not stated explicitly, the schema implies that a GAReference belongs to exactly one GAReferenceSet. The implication is made by GAReference comments referring to its parent GAReferenceSet to inherit values for various fields.

Would there be value in supporting a many:many relationship here? I can see it being useful if two GAReferenceSets have a moderate probability of being similar; for instance if they differed only in one contig or one were a subset of another. Furthermore, a many:many relationship would enable a repository to hold a 1 GAReference per unique MD5 invariant, if it desired to do so. This is not feasible given the current 1:many semantics.

If #120 were accepted, the inheritance semantics of the current schema could be safely removed as getting a GAReference would be a lightweight metadata-only operation. The last piece I'm not quite grasping is where isDerived fits in.

Note that I'm not necessarily advising that the schema dictate a many:many relationship (this is a separate question on which I haven't formed an opionion), my primary desire is to shift the semantics so that it would not be disallowed.

Add method definitions.

We have good definitions of the objects ("nouns") in ga4gh.avdl, but we don't yet have definitions of access methods ("verbs"). Let's add them.

@massie -- what do method definitions look like in AVRO land? We might want to start with /reads/search and /readsets/search (as documented here), since they're sufficient to build basic useful interop.

GA4GH APIs need to address scientific reproducibility (propose immutable datatypes)

Consider a researcher who writes a script against the GA4GH APIs, accesses data and publishes the results. The current APIs do not guarantee that subsequent researchers will get the same result when running the original script, therefore the published results are not assured to be reproducible.

If GA4GH APIs are to really going change the way bioinformatics is done they need to facilitate the reproducibility of results. In order for results to be reproducible one needs to be able obtain the exactly the same data and associated metadata that were used in an experiment. For the GA4GH APIs this means that every time a given data object is returned it is always the same. This means that APIs must present data as immutable. Data objects are never modified, instead new derived versions are created.

Mark Diekhans, David Haussler and I think this is important to address and that it would be relatively straightforward to implement immutability into an update of the v0.5 API. What do people think?

Should GA4GHRead.baseQuality be something other than a string?

Interpreting GA4GHRead.baseQuality requires some funny string manipulation. It might be useful to turn this field into something other than a string.

Should GA4GH.cigar be something other than a string?

(Issue split out from #3)

We might consider an alternative representation of the cigar that requires less regex usage while still being compact.

(maybe a more formal structure, like [{type: deletion, count: 10}, {type: match, count: 30}]? or anything else that might help with parsing and be easy for api providers to support)

We should clean up GA4GHReadSet.fileData

The fileData on GA4GHReadSet should be cleaned up and pulled into top level ReadSet fields that have a cleaner structure and are easier for users to navigate.

We could possibly first delete it altogether and then add back in the crucial fields as we figure out what they are.

How do repeated info tags in variants fit with GAKeyValue?

There are some variants and call info fields which are repeated in VCF. (e.g. any tag that is repeated per-allele - AC, AF, HQ, EC etc)

How should these fields be represented as GAKeyValue - which has a single string value for each key?

Should we:
a) smash values together in some well known way :(
b) make GAKeyValue::value an array :(
c) some better idea I'm missing?

TLEN == fragment length?

My reading of what a "fragment length" usually is disagrees from the definition of TLEN in the SAM spec.

Current GAReadAlignment::fragmentLength doc:

/** The observed length of the fragment, equivalent to TLEN in SAM. */
union { null, int } fragmentLength = null;

From the SAM spec:

TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a minus sign. The sign of segments in the middle is undefined. It is set as 0 for single-segment template or when the information is unavailable.

From my reading of the above, TLEN is closer to what I believe is the 'insert size as aligned to the reference'. From looking at various docs, the fragment length is usually defined as the original size of the fragment, which includes the length of the adapters.

Here is the easiest visual documentation on this I've found, although this may be too specific to Illumina: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3906532/figure/F1/.

My guess is that this is just overloading of terminology, so the only request here may be for a small clarification to the docs, but I wanted to first verify that my understanding is correct.

Feature request: /reads/search to support searching by substring

It would be very useful (e.g in kmer based analyses) if the /reads/search method provided a mechanism to find all reads that contain a given substring. In general this may be quite hard to do, but perhaps if a limited range of fixed substring sizes were supported, it might be feasible?

how unique should GA4GH.Read id's be?

There's general (not universal) agreement that it's useful to have a Read id, to (among other things) allow searches to return a short handle that can be used to get full Read details later, and to allow humans to communicate to other humans about a particular read. We should decide and document the rules for assigning those ids. Some options:
a) read id is globally unique, across all instances of the API.
b) read id is API unique -- it's a persistent reliable identifier for callers of a given API instance
c) read id is session unique -- it's unique for callers of a given API, but may not persist between sessions

Personally, I'm in favor of ( b).

I think (a) is impractical, since it requires coordination between all API implementors. And callers can get the equivalent by namespacing with the API URI.
I think (c) is fragile, especially since there isn't a real notion of a "session" in the API (and IMO shouldn't be).
I could see a middle ground where an API implementation tries for (b), but doesn't guarantee that IDs persist across API releases. That way callers can treat IDs as API-unique, as long as they don't try to remember them permanently.

Note that (b) does not require storage of a long ID for each read -- API implementers can synthesize the unique ID however they want from existing data.

How to support Variant calling

Proposed topic for ASHG
How to support Variant calling with the API?

GAReferenceSet::assemblyId definition needs more rigor

I tried to improve this comment myself but so far I've come up with:

It might look like 'GRCh37'
NCBI uses this as the title when displaying information for an assembly.
It probably corresponds to SAM's @sq AS value: Genome assembly identifier.

Remaining unknowns:

Is this actually intended to be a stable ID, or is it more of a display name?
Who is allocating these IDs? Where would I look them up?
Are there any guarantees of uniqueness?
What is the relation to an accession ID and why have both?

I apologize if these are basic questions which are answered somewhere, but I've not found a definitive answer on the subject after searching for some time.

/** Public id of this reference set, such as `GRCh37`. */
union { null, string } assemblyId = null;

Clarify GAReference::isDerived flag

I really don't understand the purpose of this flag. The documentation reads:

A sequence X is said to be derived from source sequence Y, if X and Y
are of the same length and the per-base sequence divergence at A/C/G/T bases
is sufficiently small. Two sequences derived from the same official
sequence share the same coordinates and annotations, and
can be replaced with the official sequence for certain use cases.

Is this a common use? The only instance in which this would be possible would be a reference sequence which had only single base changes. This seems extremely unusual biologically.

Could someone please clarify?

Can we come up with a shorter name for reference sequences?

GAReferenceSequenceSet and all of its friends are quite wordy. Can we come up with a shorter name for these objects?

Possibly: ReferenceSet (which would mean Reference, /referencesets/search, etc)

or is there another option perhaps? (I kinda liked contig, but I know that probably won't work :)

thoughts on metadata schema (from @pgrosu)

This issue is moved over from comments on #52; there were two different interwoven discussions, and I think they'll both be clearer when separated

From @pgrosu:
Hi Everyone,

It took me some time to catch up to all the discussions, but I just wanted to add a few comments:

Regarding in how to keep it more as an API, I wanted to present a possible vision-concept of analysis for sequence analysis. This could help in seeing how in the end, sequence analysis might be performed seamlessly by users. Then that analysis can drive the schema.
In connecting the API-approach with general data submissions (including clinical data), I wanted to illustrate a UML diagram (in written form).

Concept of analysis from a user's perspective for Sequence Analysis:

Even though everyone here is already very familiar with most sequence analysis approaches, I wanted to present a general workflow, since sometimes having a general view of the process and where generally one ends up, can help with formulating the data-structures needed for that.

The user - through a web-browser (or application) - begins with an empty analysis view.
The user then selects a species (and at a later time also can load their data for analysis) to begin performing the analysis.
The user performs one or more of these analysis approaches, which are general in scope to encompass most analysis workflows - even though there are many specific variations stemming from these:
- Users can select a genomic region - by mouse or typing in ranges - which will provide a list of available ReadGroups (and later ReadSets, DataSets and other forms of collections), for that location.
- Users can filter on Read and ReadGroup information (metadata) - and their statistics in the case of ReadSet or DataSet (i.e. RNASeq differential expression), to highlight and drill down on interesting results for their analysis.
- In filtering users can have available: information (metadata), standard functions (as well as custom ones). Information can come from the Run, Read, ReadGroup, ReadSet/ReadCollection, DataSet/Studies. Function can be data processing on the information that are in the above, such as differential expression (i.e. RNASeq).
- Users can filter and highlight by the criteria above to save these as a ReadSet or DataSet.
- Users can generate plots using the highlighted or saved information.

A UML diagram (in written form) to support submissions of clinical data

As @richarddurbin's nice suggestion, I took the following diagram (at the link below) and converted it to a UML-written format. It can easily be converted to AVDL, and I used some of the records from the GAReads protocol:

http://www.ebi.ac.uk/ena/about/sra_submissions

UML:

Submission {
  string Experiment_ID;
  string Study_ID;
}

Study {
  string ID;
  string Name;
  string Description;
  ClinicalInformation CI;
}

Experiment {
  string ID;
}

// Even if I used inherits (as API-format), this can easily 
// be translated to a table, store or object.
ClinicalExperiment inherits Experiment {
  //string ID is already inherited
  array<GAKeyValue> tags;
}

//Even if I used inherits (as API-format), this can easily 
// be translated to a table, store or object.
SequenceExperiment inherits Experiment {
  //string ID is already inherited
  array<Run> Runs;
  array<SampleInformation> Samples;
}

Run {
  string ID;
  string Name;
  string platformUnit;
  string sequencingCenter;
  string sequencingTechnology;
  date Date;
}

SampleInformation {
  string ID;
  string Name;
  date Date;
  string Library;
  ContactInformation Contact;
}

ContactInformation {
  int ID;
  string Name;
  string Email;
  string Address;
}

ClinicalInformation {
  string ID;
  array<GAKeyValue> CI; //collection of clinical information fields
}

Hope this might make it a little more flexible,
Paul

From @dglazer:

@pgrosu -- thanks for the input. I'm not clear on the context, though -- these sound like good general ideas for representing study/sample/... metadata, and as such would be great to include in the metadata discussion Tanya et al. are driving. If that's right, I suggest moving this comment to a new issue, so the people thinking about those topics can weigh in. (And so they don't get lost in the detailed discussion this issue about lower-level constructs.)

Note that the general sense I've gotten from the last few calls is that folks are leaning towards handling that metadata elsewhere than the Reads API. However that comes out, though, I think we should divide and conquer on the discussion.

From @pgrosu:

Hi David,

I was not familiar with the Metadata group and all the groups until recently. I got to this group by reference of Cassie Doll and the Google Genomics project. Noticing how far everyone was along, I thought it would be best to review all the issues and pull requests, which took some time.
Based on the trend of the different PRs/Issues in [schemas], the data-model became Read- and ReadGroup-centric, as noted in "Read Task Team" deliverables. Lately the trend seemed to shift to a more flexible Data Model that can accommodate to both type-of-data (i.e. clinical data, sequence data, etc.) and analysis-layout grouping formats (ReadSet, etc.) with a focus to an API rather than file-based, which is understandable.
Based on the recent shift to a more flexible data-model, what I posted was two-fold:

Since the number of experiments - and thus ReadGroups - will grow quite large it will be difficult for an analyst to know which studies are relevant. Also taking into account your nice example of additional experimental information regarding the readgroups per individual for the dataset you received from the "Illumina's UYG program", I provided this proof-of-concept:
- In a web-based genome browser, by selecting a genomic region, one can have the ReadGroups and Studies show up automatically - with additional filtering capabilities to clean up the data to its most relevant for a specific analysis. Then these relevant remaining ReadGroups can be saved for later analysis as their own Study.
Based on the above, then the need for such an additional experimental information is required in the data model, which is what I was proposing with my UML diagram. I chose UML to simplify things, since to do it with AVRO I would need to create different namespaces - or types - and thus reference to them to simulate inheritance, and also generate aliases for keeping additional naming conventions that would satisfy different nomenclatures.

My next step was to find out from the web more about the deliverables, in order to sync with what the goals are. I found a document "Priorities 2014 04 28 Final for Posting.pdf" which detailed the following regarding the DWG:

"Therefore over the next two years, the Data Working Group (DWG) will work with the international community to develop formal data models, APIs and reference implementations of those APIs for representing, submitting, exchanging, querying, and analyzing genomic data in a scalable and potentially distributed fashion, consistent with the security model developed by the SWG, the clinical data framework developed by the CWG, and the International Code of Conduct for Genomic and Health-Related Data Sharing developed by the REWG."

This reinforced the need for referenced or embedded experimental information and/or metadata associated with a ReadGroup, especially for any queries and analysis workflows one might perform.

I understand @lh3's approach to having a recursive template collection for different types of information about a study. The only issue from a search perspective is the O() complexity of finding relevant information across many ReadGroups.

Thus I have three questions, and one comment:

Where should I post such a question or open such an issue? Is is under schemas or another place?
To be more in sync with what is happening, how would I go about signing up for the Google Group "Data Working Group Read Task Group"?
What are the discussion groups, github locations, etc. for the other working groups?

Comment: I would be happy to continue to post to this group when I have relevant contributions. I promise not to make my posts so long in the future :)

From @lh3:

@pgrosu thanks for your comments. Re time complexity: if the tree exactly matches the current SRA hierarchy, the number of searches required to identify read groups is the same. In addition, the time spent on retrieving read groups should be negligible in comparison to pulling read data. Even if the performance became a concern, we could use some techniques internally to improvement the speed (e.g. duplicating the path to the root).

clean up top-level site navigation (for docs and for code)

It's too confusing to find, and to navigate between:

the top-level site at https://github.com/ga4gh
the top-level documentation at http://ga4gh.github.io/
documentation for individual teams like http://ga4gh.github.io/apis/reads/v0.1/
individual code repos like https://github.com/ga4gh/ReadTaskTeam

(Thanks @haussler for reporting.)

Add me to a team

Hi Guys,

I am Max Mikheev - CTO of Biodatomics. We are now using format with avro as internal for storing BAM/VCF/GFF files and running all analytic on Hadoop. I talk before with Matt about it.

I am following for your discussion for a while. I didn't find a way to join your group from web-site. Would you like to add me?

I would like to participate in discussions and your weekly calls. I am sure that Biodatomics can bring some additional expertise in formats and tools development.

Sincerely,
Max Mikheev

Update line spacing

See discussion on #110; essentially, we need to fix our indentation so that our comments are indented by <= 3 spaces.

Standardize GAReference.name for the benefit of other APIs

We currently have 2 APIs that are tangentially related to our main v0.5 - beacon and matchmaker. And in each of those APIs, there is a concept of a chromosome field, which is currently a string in both cases.

It would be good to help these other APIs benefit from GAReference and introduce some standardization.

Possible solutions:

We add comments to GAReference.name explaining exactly how the name should be formatted (with a 'chr' prefix or number-only, etc)
GAReference.name becomes a different GA object which beacon/matchmaker/others can reuse
Beacon/matchmaker somehow use real fully defined GAReference.ID fields, which we teach people how to understand (example value: http://www.googleapis.com/genomics/v1beta/references/{id} or http://trace.ncbi.nlm.nih.gov/Traces/gg/references/{id})

Thoughts?

referencing schemas between repositories - having a sensible hierarchical namespace

In the reference variation git repository we have a pull request that proposes including a copy of a portion of the read task team schema:

https://github.com/ga4gh/RefVariationTaskTeam/pull/3

This is a small dependency at this stage, but this could quickly get out of hand. In particular, we'd like to make a small pull request to the read task team's schema to include support for cigar alignments to graphs - see:

https://github.com/ga4gh/RefVariationTaskTeam/blob/master/src/main/resources/avro/referenceAlignments.avdl

But if we took the duplication approach we'd need to include lots of the ref var schema in the read task team schema. That's obviously not tenable.

Will somebody with more AVRO experience tell me what the right way to sort this out (and see Adam's comment in the referenced issue above).

The GA4GH API needs a reference implementation

A new repository should be created that serves up a minimum viable reference implementation of the API.

As a possible implementation detail - it could serve data from the usual file formats (BAM, VCF, etc), but we could choose another route.
The compliance suite should be able to use this reference implementation as an example of how to get all API methods passing.
It should be easy for casual users to read the code and get a sense of how they might write a real API implementation of their own

Are zero-based positions non-negative? Context: Circular genomes

The schema definition says

/**
  The 0-based offset from the start of the forward strand for that reference.
*/
long position;

So some people may assume these are non-negative (unsigned) numbers. Nevertheless
negative genomic coordinates are used often for circular genomes as shown in this example
from ENSEMBL's Escherichia_coli_o104_h4_str_2011c_3493:

$ zcat GCA_000299455.1.22/genes.gtf.gz | cut -f 2-5 | grep -e "-"
protein_coding exon -1567 1477
protein_coding CDS -1564 1477
protein_coding stop_codon -1567 -1565
protein_coding exon -236 240
protein_coding CDS -236 237
protein_coding start_codon -236 -234

How should we handle these cases? The word "offset" suggests that negative coordinates should be converted to non-negative by adding the total length, but I didn't see any explicit comment about it (sorry if I missed it). Should we clarify how to handle negative coordinates for circular genomes? If so, what's the preferred way?

i) Using negative coordinates is a pain

ii) Using positive coordinates introduces problems due intervals having 'start' position after 'end'

So I don't have a strong bias for/against either option.

What else do we need to do to build out our ecosystem?

Proposed topic for ASHG
What else do we need to do to build out our ecosystem?

The GA4GH API needs a compliance test suite

Main goal: Given an API endpoint, output which of the API methods are passing, failing, or unimplemented.

Bonus:

add failure reasons/suggested fixes for the non-passing methods
output an overall score
keep track of passing implementations somewhere so that new users can easily find implementors
periodically re-run the tests to make sure regressions don't occur

This would be lovely as another repository alongside schemas.

Define an MD tag analog

From the SAM/BAM spec...

The MD field aims to achieve SNP/indel calling without looking at the reference. For
example, a string `10A5^AC6' means from the leftmost reference base in the
alignment, there are 10 matches followed by an A on the reference which is different
from the aligned read base; the next 5 reference bases are matches followed by a
2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are
matches. The MD field ought to match the CIGAR string.

The MD tag provides reference information and allows for calculation of edit distance. It's also more compact than the sequence (mis)match tags in the CIGAR.

Content (Data) management lifecycle

So this stems from a discussion in #142. These phases of promotion would need to be controlled vocabulary, and should fit any form of data (record). All forms of data start at the "Deposited" level. I only put three levels for now for conciseness, but I am flexible on the name and number.

Because of the granularity, it also has the added feature of cascading a level (GAReadGroupSet -> GAVariantSet -> GAExperiment), but that would be up to the developer that would be implementing it. I used the word "Content" to be more general, but as we evolve the API that can change. I also provided a few alternate names, in case people are more comfortable with some, rather than others. The enum GAContentStatus will most likely reside in common.avdl, and most data records should probably have it set to "Deposited":

/* Alternatives names are: GAContentStage, GAContentRank, GAContentLevel */
enum GAContentStatus {  
  Deposited,
  Verified, /* Alternatives are: Validated */
  Archived  /* Alternatives are: Retired, Quarantined */
}

record GAVariantSet {
  ...
  GAContentStatus contentStatus = Deposited
}

Let me know what you think.

Thanks,
Paul

Launch GA4GH API v0.5

If we all agree, then lets declare the current API to be ready for v0.5, tag it with git, and put up the final docs on ga4gh.org.

Please vote!

Auto-generate API doc from Avro schema

@massie started a discussion in the dwgreadtaskteam forum, recommending that we try out https://github.com/ept/avrodoc; opening this issue to track progress.

Should GA4GHReferenceSequence name and assemblyId be required to conform to the GRC/INSDC standard?

Richard Durbin's summary of proposals from the Reads Task Team email thread;

Ewan's and Jim's to make standard using a public accessioned reference, but permit other references
to require checksums (MD5 is fine) for sequences that are not standard public accessioned references, with a name and MD5sum for each added sequence
to allow a mask to be defined on top of a public accessioned reference
to support extensions to public accessioned references which add extra named sequences (with MD5sums I guess)

document fragments, reads, alignments, etc.

The full data model for reads is more complicated than I first realized -- it took me a while to understand the subtleties (thanks @lh3 for all the clarifications in various pull request threads), and I'm still not sure I have the details right. We'll want to make the model as clear as possible in our documentation -- here's a first crack at capturing my understanding, using the terminology from #60. Comments and corrections welcome.

Reads and alignments are arranged in a logical hierarchy:

GAReadgroupSet -< GAReadgroup -< fragment -< read -< alignment -< GAReadAlignment

a GAReadgroupSet is a logical collection of GAReadgroups
a GAReadgroup is all the data that’s processed the same way by the sequencer. There are typically 1-10 GAReadgroups in a GAReadgroupSet.
a fragment is a single stretch of a DNA molecule. There are typically many (e.g. millions) of fragments in a GAReadgroup. A fragment has a name (QNAME in BAM spec), a length (TLEN in BAM spec), and an array of reads.
a read is a contiguous sequence of bases. There are typically only one or two reads in a fragment. (If there are two reads, they’re known as mate pairs.) A read has an array of base values, an array of base qualities, and alignment information.
an alignment is the mapping-to-reference assigned by alignment software to a read. There’s always a primary alignment, and can also be one or more secondary alignments, representing alternate possible mappings, but that’s rare.
a GAReadAlignment is a linear alignment -- it maps a string of bases to an area of reference using a single CIGAR string. There’s always a representative alignment, and can also be one or more supplementary alignments in the case of a chimeric read, but that’s rare.

EDIT (25-may): tweaked some definitions based on comments below

Execution environment

Proposed topic for ASHG
Execution environment - code touching the data - where do we stand on actually processing the data? Currently we have an REST API which is good for fetching and storing data. But in terms of actually running analysis, there are a lot of approaches. It would be good to tease out the overall vision of that - how do we envision that working.

e.g. here is a proposed new API, that when given a readSet identifier, will call variants and do something with the output.
Perhaps the GA4GH github should have more pieces of example code/client libraries where one of them would be a whole processing stack.
Need to figure out mechanism to delegate to distributed execution engines such that that one could have some consistency between them.
Reproducibility, Versioning. Consistent name-spacing and how do we execute against data.