Consider a researcher who writes a against the GA4GH APIs, accesses data and pu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

GA4GH APIs need to address scientific reproducibility (propose immutable datatypes),about ga4gh/ga4gh-schemas

Comments (65)

haussler commented on August 11, 2024

Yes, very important and fundamental to support reproducible scientific and
medical analysis.

On Wed, Sep 10, 2014 at 1:10 PM, Benedict Paten [email protected]
wrote:

Consider a researcher who writes a script against the GA4GH APIs, accesses
data and publishes the results. The current APIs do not guarantee that
subsequent researchers will get the same result when running the original
script, therefore the published results are not assured to be reproducible.

If GA4GH APIs are to really going change the way bioinformatics is done
they need to facility the reproducibility of results. In order for results
to be reproducible one needs to be able obtain the exactly the same data
and associated metadata that were used in an experiment. For the GA4GH APIs
this means that every time a given data object is returned it is always the
same. This means that APIs must present data as immutable. Data objects are
never modified, instead new derived versions are created.

Mark Diekhans, David Haussler and I think this is important to address and
that it would be relatively straightforward to implement immutability into
an update of the v0.5 API. What do people think?

—
Reply to this email directly or view it on GitHub
#142.

from ga4gh-schemas.

pgrosu commented on August 11, 2024

Absolutely agree! This is a given axiom of science, and we must have this as a requirement. I can't count how many times I had to take a paper, and reconstruct the steps to try to get the same results, if possible. Needless to say, it was usually a painful process. In industry, we had a more stringent set of criteria that was part of our QA/validation process, which guaranteed that the data, analysis and any processing remained consistent between versions. Any changes required to satisfy a very detailed set of written criteria and pass an agreed-upon set of tests that was accompanied with quite a lot of documentation.

from ga4gh-schemas.

cassiedoll commented on August 11, 2024

I do not agree that this is a good idea for Variants.
Read and Reference data is fairly fixed, but Variant data should be allowed to change for at least some period of time.

One of our best use cases over here is that we will help users take a continuous stream of per-sample VCF files and merge them into one logical set of Variants - which will make population analysis much easier. (Imagine you are sequencing and calling 10 samples a week over the course of a year or something)

Eventually I agree that you might want to say "this data is all done now - go ahead and depend on it forever", but the time at which that occurs is not always == creation time.

from ga4gh-schemas.

fnothaft commented on August 11, 2024

@cassiedoll I get what you're saying and both agree and disagree. For variants, I think some things should be immutable. Specifically, once you've got a final (recalibrated) read set, you should be able to generate "canonical" genotype likelihoods from those reads. I agree with you that final genotype calls for a single sample will depend on the sample set that you joint call against, but fundamentally, that's not changing the genotype likelihoods, it's just changing the prior.

The correct approach IMO is to ensure immutability per program run/lineage; e.g., if I process a data set (with a specific toolchain and settings), I can't go back and reprocess part of that data with a new toolchain, or new settings, and overwrite the data. If I reprocess the data, I should wholly rewrite my dataset with new program group/lineage information.

from ga4gh-schemas.

benedictpaten commented on August 11, 2024

@cassiedoll: You want the flexibility behind the API to create variant sets
that are transient? - i.e. derived from a source dataset but not stored and
therefore having no permanent UUID? What happens when the user wants to
publish and reference this dataset? While convenient, I think this is
antithetical to the goals of a storage API.

On Wed, Sep 10, 2014 at 1:59 PM, Frank Austin Nothaft <
[email protected]> wrote:

@cassiedoll https://github.com/cassiedoll I get what you're saying and
both agree and disagree. For variants, I think some things should be
immutable. Specifically, once you've got a final (recalibrated) read set,
you should be able to generate "canonical" genotype likelihoods from those
reads. I agree with you that final genotype calls for a single sample will
depend on the sample set that you joint call against, but fundamentally,
that's not changing the genotype likelihoods, it's just changing the prior.

The correct approach IMO is to ensure immutability per program
run/lineage; e.g., if I process a data set (with a specific toolchain and
settings), I can't go back and reprocess part of that data with a new
toolchain, or new settings, and overwrite the data. If I reprocess the
data, I should wholly rewrite my dataset with new program group/lineage
information.

—
Reply to this email directly or view it on GitHub
#142 (comment).

from ga4gh-schemas.

mcvean commented on August 11, 2024

There's clearly a need for versioned immutable data for reproducability. However, variant calls are an inference that may well change. Presumably we want to be able to support requests such as 'give me the genotype for this individual that would have been returned on this date'.

from ga4gh-schemas.

delagoya commented on August 11, 2024

@benedictpaten I don't think that is what @cassiedoll is getting at, but I'll let her reply.

The underlying alignments and variant calls for a given genomic sequence set will be context dependent and will change over time as it is re-analyzed. These changed result sets are new data sets. Data provenance is always an issue, but there are efforts in the use of runtime metadata to track data analysis workflows. I think that these other frameworks are sufficient for this request, and should be specified outside of the datastore API.

I am also a bit hard pressed to see how this can be easily implemented as part of the API without significant interface and use case changes. For example, how you would implement this as a formal part of the API (e.g. not just in the documentation) without requiring some time-based component into all of the API calls? Here time/date parameters are acting as a proxy for runtime metadata, so why not rely on metadata queries to get the proper result set?

from ga4gh-schemas.

benedictpaten commented on August 11, 2024

On Wed, Sep 10, 2014 at 2:19 PM, Gil McVean [email protected]
wrote:

There's clearly a need for versioned immutable data for reproducability.
However, variant calls are an inference that may well change.

Yes, they are inference, but that does not stop one from wanting to refer
concretely to a set of inferences, even if subsequently they are
changed/improved - it helps to untangle, as Paul Grosu nicely points out,
the ingredients that led to a conclusion.

Presumably we want to be able to support requests such as 'give me the
genotype for this individual that would have been returned on this date'.

Yes! - we could support that very easily by moving to an immutable system.
—

Reply to this email directly or view it on GitHub
#142 (comment).

from ga4gh-schemas.

richarddurbin commented on August 11, 2024

I think this conversation is confusing the API and the data store.

It may well be good practice to have data stores that store immutable objects. GA4GH can encourage that and the API should definitely support it.

But of course I should be allowed to use the API over transient representations that I make locally for exploratory or other purposes.
We do this sort of thing all the time. Telling me that the fact of accessing a data set through the API means that it has to be permanent and
immutable is crazy. Maybe I want to transiently consider alternative alignments and make new calls from them using standard GA4GH
calling software - I should not be bound to store everything I ever do for ever.

So, I think Benedict's reasonable request concerns long term data stores, not the API as such.

Richard

On 10 Sep 2014, at 22:30, Benedict Paten [email protected] wrote:

On Wed, Sep 10, 2014 at 2:19 PM, Gil McVean [email protected]
wrote:

There's clearly a need for versioned immutable data for reproducability.
However, variant calls are an inference that may well change.

Yes, they are inference, but that does not stop one from wanting to refer
concretely to a set of inferences, even if subsequently they are
changed/improved - it helps to untangle, as Paul Grosu nicely points out,
the ingredients that led to a conclusion.

Presumably we want to be able to support requests such as 'give me the
genotype for this individual that would have been returned on this date'.

Yes! - we could support that very easily by moving to an immutable system.
—

Reply to this email directly or view it on GitHub
#142 (comment).

—
Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

from ga4gh-schemas.

diekhans commented on August 11, 2024

We believe that immutablility is essential for all data. The variant use case Cassie describes doesn't related to mutability but versioning. That is when and how long do you keep a given version of a data set.

One of the main tasks one does when they re-run an analysis is compare to a previous result, or maybe several previous runs. Each run would create a new set of immutable objects with unique ids. Once one decides on a final version, the previous versions could be deleted. Queries for those previous versions would return an error, possible with the unique id of the newest version.

This allows support for as many versions of data as need without the confusion of what version one is working with.

The huge advantage of Immutable is a computer science concept dating back to the 1950s that many of us are relearning. It greatly simplifies data management for both the producer and consumer of the data.

Not following the principle of all data being immutable and having a unique id is one of the major reasons behind the current bioinformatics data mess. The only way to make an experiment reproducible is to save all of the data files used and become the distributor of the data.

from ga4gh-schemas.

cassiedoll commented on August 11, 2024

I think I'm just coming from a slightly different world. Some of our customers over here don't have all of their data right now. Let's pretend that they are receiving one new BAM file a day. They might do the following:

upload that BAM file into our API.
-- the read data is all immutable.
-- if they want to realign the data, they would need to make a new readGroupSet.
-- I think we all agree on this
call variants on that BAM file, which basically results in a new CallSet
merge that CallSet data into all the other CallSet data they have
-- so every day, their VariantSet grows by one sample. So a bunch of Variants that had n Calls might now have n + 1 calls
-- by 'merge' - I'm not talking anything fancy right now. let's just pretend we are using an exact equivalence here. if you have 2 CallSets which share exactly the same parent Variant (same name, pos, contig, etc etc etc) - so don't get distracted by this point :)
-- @fnothaft - I think I agree with you here in that for a particular Call, its data isn't changing - but the Variant.calls field may get a new Call

I think generally though, our customers should have the right to do whatever they want to the data. What if they want to delete an entire VariantSet? In a perfectly immutable world, they wouldn't be able to. It may possibly ruin one of those provenance chains.

That's not okay with us though - that choice should be in the hands of our users. If I made some VariantSet, realized I had a small bug and called everything incorrectly - I should be allowed to delete it without having to prove that there aren't any users of that data. As a user, its my responsibility to insure that I'm not screwing up some downstream dependency - this should not be a burden on the API provider.

Let's additionally pretend that I had some new info tag I was messing around with. I should be able to run some analysis on my Variants, come up with my snazzy info tag, and store it back into the API. I shouldn't have to have a whole new VariantSet while I'm just running a bunch of test analysis on my data - and I also shouldn't have to resort to storing that test analysis in some random text file.

I could come up with many more examples here - but basically, this is the user's responsibility and should not be the job of API implementors who do not have all the necessary context.

from ga4gh-schemas.

cassiedoll commented on August 11, 2024

+1 to @richarddurbin

from ga4gh-schemas.

fnothaft commented on August 11, 2024

I think generally though, our customers should have the right to do whatever they want to the data. What if they want to delete an entire VariantSet? In a perfectly immutable world, they wouldn't be able to. It may possibly ruin one of those provenance chains.

That's not okay with us though - that choice should be in the hands of our users. If I made some VariantSet, realized I had a small bug and called everything incorrectly - I should be allowed to delete it without having to prove that there aren't any users of that data. As a user, its my responsibility to insure that I'm not screwing up some downstream dependency - this should not be a burden on the API provider.

I agree here; this may have been unclear in my earlier email, but I envision the data being immutable with the exception of delete. In-place update is disallowed, but delete is OK. Practically, you can't forbid delete; that just makes your data management problems worse...

To be realistic, we're not going to solve the "reproducibility crisis" by mandating immutability. However, we will significantly reduce our implementation flexibility, and as @cassiedoll is pointing out, this enforces pretty strict limitations on how the users of our system can manage their data. If you're using the GA4GH APIs to implement an archival datastore, sure, immutability makes sense: archival implies write once, read many times. If you're using the GA4GH APIs to access a scratch space (@cassiedoll's example of n + 1 calling), immutability may not be what you want.

from ga4gh-schemas.

diekhans commented on August 11, 2024

Hi Richard, immutability doesn't mean keeping data forever, it can be deleted, just like immutable object in memory can be garbage collected. It simply means that once an object is published with a unique id, it never changes. Any changes results in a new logical object with a new id

from ga4gh-schemas.

benedictpaten commented on August 11, 2024

+1 for @diekhans comment.

Concretely, consider adding a UUID to each of the container types, e.g. readGroup, readGroupSet, etc. The only rule is that the UUID is updated anytime the container changes in anyway.

For persistent storage APIs the UUID acts as a way of referencing a specific instance of a dataset. For transient stores the UUID could be NULL, if no mechanism for subsequent retrieval is provided, or it could be provided for caching purposes, with no guarantee that the instance will be retrievable for the long term.

To implement simple versioning, as with version control, a function could be provided which takes a UUID for a given container and returns any UUIDs that refer to containers directly derived from that container. An inverse of this function could also be provided. Given versioning, support for @mcvean's query would be a straight forward extension.

For sanity, we would probably want to add a query to distinguish between APIs instances that attempt to provide persistent storage from those that are naturally transient.

If we don't have reproducibility supported into our APIs we'll be either relying on convention, or worse, leading people to download all the datasets they use for publication and host them as files(!) for later research.

from ga4gh-schemas.

delagoya commented on August 11, 2024

I am in support of UUID meaning static data set. I am not in support of requiring stronger data versioning capabilities such as date/time parameters or requiring version tracking

from ga4gh-schemas.

cassiedoll commented on August 11, 2024

@benedictpaten your uuid could kinda be handled by the updated field we already have on our objects.

the updated field is always updated whenever an object is changed (just like your uuid). and an API provider could definitely provide a way to look up an object at some past time - but I'm with Angel in that we shouldn't require this functionality

from ga4gh-schemas.

fnothaft commented on August 11, 2024

I am in support of UUID meaning static data set. I am not in support of requiring stronger data versioning capabilities such as date/time parameters or requiring version tracking

+1 @delagoya; we should implement a genomics API, and avoid going down the rathole of building a version control system. If the presence of a UUID means that the dataset is static, our API is fine. UUID assignment is a metadata problem and we shouldn't tackle it in the reads/variants APIs.

from ga4gh-schemas.

cassiedoll commented on August 11, 2024

@fnothaft - metadata objects are coming to an API near you real soon now. #136 will definitely affect us.

so while I agree with your conclusion :) I do think we can't simply call it a metadata problem - cause metadata problems are now our problems, lol

from ga4gh-schemas.

benedictpaten commented on August 11, 2024

On Wed, Sep 10, 2014 at 5:26 PM, Angel Pizarro [email protected]
wrote:

I am in support of UUID meaning static data set. I am not in support of
requiring stronger data versioning capabilities such as date/time
parameters or requiring version tracking

Great: static = immutable. I would not require version control either -
just the potential for it as an optional, simple function. Consider
@cassiedoll's example - where a user wants to make a small change to a
dataset. A simple derivation function could be very useful to understanding
data provenance and avoiding mess. I am not arguing we should go any
further, or mandate it.

—
Reply to this email directly or view it on GitHub
#142 (comment).

from ga4gh-schemas.

fnothaft commented on August 11, 2024

@cassiedoll definitely. My point is just that as long as our read/variant API defines clear semantics for what it means for a UUID to be set/null (dataset is immutable/mutable), we can delegate (dataset level) UUID assignment to the metadata API.

from ga4gh-schemas.

pgrosu commented on August 11, 2024

I agree that the UUID approach, with a standard set of tests and data for validation/timing would suffice for now, but I have a simple question :) What if a drug company is required to keep an audit trail for everything that went into the development of the drug over many years. Part of that would be the whole NGS processing and analysis platform. This would mean reads, variants, pipelines, analysis processes (including results) and all validation along the way. This can mean repeated reprocessing of the same data through variations of the same pipeline - with different settings - on different dates for comparison and validation purposes. I know versioning is not something we want to explore now, but many commercial products support Versioned Data Management for good reason (i.e. Google's Mesa, Oracle, etc.). Is this something to be handled at a later time with a different version of the schema, or would it be beyond the scope of what we want to deliver?

from ga4gh-schemas.

fnothaft commented on August 11, 2024

What if a drug company is required to keep an audit trail for everything that went into the development of the drug over many years. Part of that would be the whole NGS processing and analysis platform. This would mean reads, variants, pipelines, analysis processes (including results) and all validation along the way. This can mean repeated reprocessing of the same data through variations of the same pipeline - with different settings - on different dates for comparison and validation purposes. I know versioning is not something we want to explore now, but many commercial products support Versioned Data Management for good reason (i.e. Google's Mesa, Oracle, etc.). Is this something to be handled at a later time with a different version of the schema, or would it be beyond the scope of what we want to deliver?

Others may disagree, but IMO, way beyond the scope of what we should want to deliver. It is difficult enough to build an application-specific data and processing environment end-to-end replication framework; I don't think it is tractable to build a "one-size-fits-all" end-to-end pipeline reproducibility solution that is blind to:

Compute environment (machine, OS, envar, local storage, application deployment configuration)
Source control environment for code
Versioning environment for data
Machine generated metadata required to be kept for auditing purposes
Authentication
Etc.

from ga4gh-schemas.

lh3 commented on August 11, 2024

Back to the very beginning of the thread (as I am reading it now). In the current non-API world, reproducibility is somehow achieved by data releases/freezes. Ensembl, UCSC, 1000g, hapmap and GRC, among the many other databases and projects, keep different releases of processed data for several years on FTP (PS: UCSC/Ensembl also keep complete live web interfaces to older data). In a paper, we just say what release number we are using. This approach is not ideal, but usually works well in practice.

From the discussion above, the scenario I am imagining is: a user can submit different releases of data over the time and request each release to be static/readonly. We keep different releases as independent CallSets, with versioning or not, for some time and then drop old ones gradually. Each released CallSet is referenced by UUID (or by accession number). Some call sets may be dynamic. Then they do not have UUID. In addition, for processed data like variants, we can afford to keep multiple releases. For raw data like read alignments, probably we wouldn't want to keep more than one release.

Is this what people are thinking of?

from ga4gh-schemas.

fnothaft commented on August 11, 2024

From the discussion above, the scenario I am imagining is: a user can submit different releases of data over the time and request each release to be static/readonly. We keep different releases as independent CallSets, with versioning or not, for some time and then drop old ones gradually. Each released CallSet is referenced by UUID (or by accession number). Some call sets may be dynamic. Then they do not have UUID. In addition, for processed data like variants, we can afford to keep multiple releases. For raw data like read alignments, probably we wouldn't want to keep more than one release.

+1, I generally agree.

In a paper, we just say what release number we are using. This approach is not ideal, but usually works well in practice.

Agreed, reproducibility is complex; I think documenting your setup/workflow, and ensuring that you can get the correct data gets you 95% of the way there. Documenting your setup/workflow is largely human factors engineering, so that's out of the scope of our API, but I think the UUID approach that was suggested above will address the getting the correct data problem.

<digression>
I think a lot of people are doing good work with container based approaches, but smart folks are also making good points about the limitations of these approaches.

If you want to make it to 100% reproducibility, it is a hard but doable struggle. In a past life, I implemented an end-to-end reproducibility system for semiconductor design. Alas, it's not genomics, but there's a fair bit of cross-over between. We were able to build this system because we had complete control over:

The computing environment (OS version/installation, environment variable setup, disk mount points and network setup, etc)
The way users accessed data and scripts
Version control for:
** Scripts to run tools
** Tool installations
** All of the data

The system took over a year and a half to build with about 3-4 FTEs, took several prototypes, and was very application/environment specific. It was massive undertaking, but we could reproduce several year old protocols on several year old datasets with full concordance. Extreme reproducibility is a great goal, but is really hard to achieve, and a reads/variant access API is the wrong place to implement it.
</digression>

from ga4gh-schemas.

richarddurbin commented on August 11, 2024

I am sympathetic about some of these ideas (see the PS), but still think that people are thinking in terms of properties of a data store rather than an API to compute with.

I'd like to think about the consequences of this proposal. If I want to calculate a new quality control metric for each call in a call set, and add it as a new key:value
attribute in the info map, what would have to change? Will I end up with two complete call sets, and if I query will I get one by default, or if not, how will I know which
one to use? How far up the chain does this go - do I get a new study when I add a new call set to the study?

What will be the time and memory implementation cost of this proposal? I am a bit concerned that we are losing sight of the fact that we need to deal at scale. Real
systems need to handle within a year 100,000 full genome sequences - yesterday I was on a pair of calls where we have 32000 full genome sequences and are planning
to impute into over 100,000 in the next 6 months. We won't switch to GA4GH unless it works better at that scale than what we have. I'd like some guidance from Google.
To what extent when thinking about computing on petabyte scale data structures do you think about formal desirability like putting uuids on each object, and requiring
immutability, and to what extent do you think about the implementation being lean and restricted to what is required to deliver the goals.

My current position is still to think that this should be an optional add-on, not a required part of the design. Our primary goal should be to as cleanly and efficiently as
possible access and compute on genomic sequence data. Other things should be optional extensions.

Richard

PS As it happens, I worked on a non-standard database system for genomic data 20 or so years ago called Acedb that also supported the ability to retrieve objects
from arbitrary times in the past. It kept data in low level (typed) tree structures, and rather than deleting old or changed branches kept them in shadow mode that meant
they were ignored in normal operations, but could be retrieved on request. Functionality a bit like TimeMachine on Macs, but I presume implemented differently.
Anyway, it supported complete history within objects and was lightweight on normal function. (Interestingly, it was also used for a time by Intel for chip testing data.)

On 11 Sep 2014, at 05:07, Frank Austin Nothaft [email protected] wrote:

From the discussion above, the scenario I am imagining is: a user can submit different releases of data over the time and request each release to be static/readonly. We keep different releases as independent CallSets, with versioning or not, for some time and then drop old ones gradually. Each released CallSet is referenced by UUID (or by accession number). Some call sets may be dynamic. Then they do not have UUID. In addition, for processed data like variants, we can afford to keep multiple releases. For raw data like read alignments, probably we wouldn't want to keep more than one release.

+1, I generally agree.

In a paper, we just say what release number we are using. This approach is not ideal, but usually works well in practice.

Agreed, reproducibility is complex; I think documenting your setup/workflow, and ensuring that you can get the correct data gets you 95% of the way there. Documenting your setup/workflow is largely human factors engineering, so that's out of the scope of our API, but I think the UUID approach that was suggested above will address the getting the correct data problem.
I think a lot of people are doing good work with container based approaches, but smart folks are also making good points about the limitations of these approaches.
If you want to make it to 100% reproducibility, it is a hard but doable struggle. In a past life, I implemented an end-to-end reproducibility system for semiconductor design. Alas, it's not genomics, but there's a fair bit of cross-over between. We were able to build this system because we had complete control over:

The computing environment (OS version/installation, environment variable setup, disk mount points and network setup, etc)
The way users accessed data and scripts
Version control for: ** Scripts to run tools ** Tool installations ** All of the data
The system took over a year and a half to build with about 3-4 FTEs, took several prototypes, and was very application/environment specific. It was massive undertaking, but we could reproduce several year old protocols on several year old datasets with full concordance. Extreme reproducibility is a great goal, but is really hard to achieve, and a reads/variant access API is the wrong place to implement it.

—
Reply to this email directly or view it on GitHub.

from ga4gh-schemas.

pcingola commented on August 11, 2024

Immutability is a sufficient condition for reproducibility, but not a necessary one. I prefer a 'data freeze' (as mentioned by @lh3) which seems leaner and faster for the scales we have in mind.

Create immutable duplicates of all variants just because we decided to add "Allele Frequency" information seems way too much burden, not to mention the fact that we have to either re-calculate or copy all data depending on each variant record (such as functional annotations).

The idea of "setting UUID" = "data freeze" could be implementable, but only for some "main" objects. As @richarddurbin mentioned, adding UUIDs to all objects seems unfeasible for the scales we have in mind. Setting UUID for each variant might be doable, but setting a UUID for each call in each variant is not efficient.

Pablo

P.S.: I also had the painful experience of implementing fully reproducible (financial) systems. My advice is the same as @fnothaft: Full reproducibility is a massive undertaking, don't go there.

from ga4gh-schemas.

fnothaft commented on August 11, 2024

+1 to @richarddurbin

@pcingola

P.S.: I also had the painful experience of implementing fully reproducible (financial) systems. My advice is the same as @fnothaft: Full reproducibility is a massive undertaking, don't go there.

Indeed; I'd also note that in the fully reproducible system I was working on, data wasn't immutable. The reason we had a fully reproducible system was so we could easily make critical engineering changes to products that had not been modified in several years. All data was versioned and had no delete option, and we had to eat the cost of keeping lots of extra disk around. So, a single snapshot of the database in time was immutable, but it needed to be easy to branch and update from any point in the database.

from ga4gh-schemas.

lh3 commented on August 11, 2024

I'd like to think about the consequences of this proposal. If I want to calculate a new quality control metric for each call in a call set, and add it as a new key:value attribute in the info map, what would have to change?

Adding new key-value pairs without touching the rest of data is complicated and fairly infrequent for released variant data.

Will I end up with two complete call sets, and if I query will I get one by default, or if not, how will I know which one to use? How far up the chain does this go - do I get a new study when I add a new call set to the study?

This is a good point. Some projects/databases solve this by providing a "latest-release" symbolic link on FTP. We could mimic this behavior in GA4GH such that older releases are not retrieved unless the user asks so by explicitly specifying UUIDs. This might need some lite versioning, though perhaps there are better solutions.

To what extent when thinking about computing on petabyte scale data structures do you think about formal desirability like putting uuids on each object, and requiring immutability, and to what extent do you think about the implementation being lean and restricted to what is required to deliver the goals.

Personally, I am thinking to just generate a UUID for a complete released CallSet (or equivalently VCF), not for smaller objects. For projects with a continuous flow of new samples, the current practice is still to set a few milestones for data releases. In publication, we do not often use transient data.

from ga4gh-schemas.

ekg commented on August 11, 2024

We should leave the versioning for the data storage layer. For instance, see the dat project, which aims to provide revision control and distributed data synchronization for arbitrarily-large files, databases, and data streams. We do not have to solve this problem. It already has a huge amount of attention in a more general context than genomics.

from ga4gh-schemas.

lh3 commented on August 11, 2024

No, dat is a completely different layer. I doubt ga4gh ever wants to go into this complexity. From my own point view, all I need is something roughly equivalent to data freezes that I can reference in my papers. It would be a disaster if the data used in my manuscript are dramatically changed and lead to a different conclusion during the review or immediately after the publication.

from ga4gh-schemas.

cassiedoll commented on August 11, 2024

@lh3 could the updated date work well enough for that?
As long as the updated date is older than your paper, you'll know your data is still in its stale state.

(And if it isn't, you'll know someone changed something - and an API backend could choose to help you recover from this if they wanted to - in one of a hundred different ways :)

from ga4gh-schemas.

massie commented on August 11, 2024

Science requires reproducibility. We all agree on that.

Reproducibility is a hard problem that we don't need to tackle completely now. However, we don't want our APIs to get in the way when others (or we) decide to work on this later -- and they will.

We can also tackle this problem in pieces focusing on data verification first. Since we all know and use git, I'll use it as an example. Git is a content-addressable filesystem. If you have a git repo and I have a git repo checked out with the same SHA-1, we know we're both looking at the source code. That guarantee was designed into the system from the start.

While I'm not advocating that we build a full revision control system, I do think defining a standard for hashing over the (sub)content of our objects make sense. That hash should be stored and exposed (instead of a random UUID) to make it easy (and fast) to create hashes over sets of objects (since we don't want to recalculate them when sets are created).

This design would also allow developers to create tools (similar to git-fsck) to verify the connectivity and validate of our data. It also answers @richarddurbin question about how to handle reproducibility at scale. If your 100,000 genomes have the same GA4GH hash as mine, we know that we're operating on the same data.

@ekg, as an aside, another project similar to dat project is git-annex enable you to use git to track large binary files without checking them into git (just a symlink is used).

from ga4gh-schemas.

diekhans commented on August 11, 2024

Richard Durbin [email protected] writes:

I am sympathetic about some of these ideas (see the PS), but still think that
people are thinking in terms of properties of a data store rather than an API
to compute with.

This has to do with the semantics of the data model presented by
API. How does the data change and what kind of life-cycle can
one expect? There doesn't need to be a single policy for
life-cycle for all data sets, however the API needs to be able to
implement and express the behaviors.

To me, this is something that differentiates an API from a schema.

I think we confused things a bit in our description. Reproducibility
is built on both immutability and versioning. Immutability gives
a functional programming view of the data where all layers of the
system can assume that the data doesn't change in arbitrary ways.
This greatly simplifies programming and data management tasks.

Versioning is useful for more than archives, especially in an
environment where one is experimenting with algorithms.
However, the policy can vary on levels of persistence. For a
lot of environments, only keeping the latest version make sense.

The API defining the unit of immutability is required for
implementing versioning. For instance, it would be an insane
about of overhead to version every read and it would of almost
no value. Read group is a very logical immutable unit.
Normally, it never changes.

Even if a give data source only keeps one version, it's simpler
to have one API model that supports 1 or N versions rather than
having it diverge. Even if it needs to diverge for efficiency
reasons, the semantics needs to be defined as part of the API.

I'd like to think about the consequences of this proposal. If
I want to calculate a new quality control metric for each call
in a call set, and add it as a new key:value attribute in the
info map, what would have to change?

A new version of the call set would be created. In a system
that only supports one version, this just means assigning a new
UUID and maybe recording that the old UUID has been replaced.

Will I end up with two complete call sets,

That depends on the underlying implementation

and if I query will
I get one by default, or if not, how will I know which one to
use?

There would be different types of queries. Probably the most
common just returns the latest version. You can ask for specific
versions via UUIDs, which might return an error if the version
is not retained.

How far up the chain does this go - do I get a new study when I add
a new call set to the study?

That depends on the data model, but I think it would be a bad design
to have study -> callset be a strict containment relationship, rather than a
relation that is queried. What happens now? If you add a new call set,
does the modification time on the study change?

What will be the time and memory implementation cost of this proposal?

For a system that doesn't keep multiple versions, it should be very
little difference from updating modification time. For a systems that
does keep multiple versions, the immutability requirement facilitates
copy-on-write operations, which makes new version cheap.

I am a bit concerned that we are losing sight of the fact that
we need to deal at scale.

I am very concerned that the scalability of the read API in
general. I have seen no performance analysis of the current API
design. JSON encoding, while better than XML, is not efficient.
Single-thread performance still matters.

My current position is still to think that this should be an optional add-on,
not a required part of the design.

The important thing is the design needs to facility versioning
by having an immutability as part of semantics. The implementation
of versioning should be optional, but would not require a different
API.

PS As it happens, I worked on a non-standard database system for genomic data
20 or so years ago called Acedb that also supported the ability to retrieve
objects

Nice story!! Glad someone was thinking of it.

from ga4gh-schemas.

lh3 commented on August 11, 2024

@cassiedoll Typically we take a data freeze and work on that for months before the publication. The data are often analyzed multiple times in different but related ways. If I want to use APIs to access data, as is opposed to storing the data locally, I need to get the exact data every time during these months and preferably after the publication. As long as ga4gh can achieve this, I am fine. I don't really care about how it is implemented. Static CallSet is just a simple solution among the many possibilities.

from ga4gh-schemas.

pgrosu commented on August 11, 2024

@lh3, I understand all publications are precious to their respective authors, but if you look at the collection of data across all of them, then the publications are just blips across this gigantic, ever-growing, yet critical set of data/processed results - especially for clinical studies.

So a couple of years ago there was a publication on the comparison of 1000 Genomes with HapMap. By now this would probably be considered a small study. As @richarddurbin mentioned, what about 100,000 Genomes, what about 1 billion genomes? Will you freeze that for every variant that will be published for a specific study?

That's why I keep mentioning petascale (or larger) data-processing APIs with parallel algorithms and data-structures. Yes, we can have an API for sharing data, but will it scale? I posted #131 for a reason, referring to Google Mesa, Pregel, etc. to expand our approach. Many places either have that in-house, AWS, or some other "cloud" approach which seems to handle such throughput. Will we have this API targeted for the web, or just cloud-based data-centers where the transfer is "local"? So using this approach, will new key:values pairs - or a settings-change in the QC pipeline - propagate across all selected studies, thus generating a duplicate version of the studies within hours for comparison? Having silos of data-freezes might not always be conducive to fully integrated online, updated large-studies. What if the studies become so large that you have to duplicate variants that were made from a collection of reads across 10+ years. Which published version(s) of the silos at different sites should we select, and how should we integrate/update the data in a global variant dataset/study/project for a specific disease? I imagine, that something like the T2D (Type 2 Diabetes) studies at the Broad must be quite large. We're not talking about just a large data-store duplication, but an API that might have trouble handling the throughput to share that data, which should be ready to stream into processing/analysis pipelines.

from ga4gh-schemas.

diekhans commented on August 11, 2024

Yes, this is precisely the issue. Given a repository that saves versions of data,
I have no idea how I could go about retrieving the data matching a given freeze
using the GA4GH APIs. I don't thinks it's possible.

One is back to making snapshots of data and sharing them.

Heng Li [email protected] writes:

@cassiedoll Typically we take a data freeze and work on that for months before
the publication. The data are often analyzed multiple times in different but
related ways. If I want to use APIs to access data, I need to get the exact
data every time during these months and preferably after the publication. As
long as ga4gh can achieve this, I am fine. I don't really care about how it is
implemented. Static CallSet is just a simple solution among the many
possibilities.

—
Reply to this email directly or view it on GitHub.*

from ga4gh-schemas.

vadimzalunin commented on August 11, 2024

Let me remind about the existing archives (yes, I work for one, therefore biased) that need to be compatible with the API, or indeed the other way around. If the existing SRA model is not drastically bad then maybe it should be used as the bases. To me the problem is two-fold:

Reads etc.
some (most) objects are immutable, and others are provisional, for example pre-publication data.
In rare cases the archives must be able to suppress/replace/kill data. I can't remember cases of replaced data but suppress and kill do happen. Shouldn't this propagate into the API?
Calls etc.
incremental updates exposed as a separate (virtual) object linked to the origin. Implementations may choose to make copies instead but it should be abstracted from the API. Alternatively some may prefer to flip a series of increments into decrements but again this is implementation details.

TLDR:

enum status {DRAFT, FINAL, SUPPRESSED}
incremental VCF updates.

from ga4gh-schemas.

diekhans commented on August 11, 2024

Erik Garrison [email protected] writes:

We should leave the versioning for the data storage layer. For instance, see
the dat project, which aims to provide revision control and distributed data
synchronization for arbitrarily-large files, databases, and data streams. We do
not have to solve this problem. It already has a huge amount of attention in a
more general context than genomics.

Dat looks like an interesting system and I completely agree that
GA4GH should not solve the problem. That isn't goal of this
issue. GA4GH is defining APIs and the APIs should specify
semantics that allow implementers to solve the versioning problem.

The current API definition is defining file system-like
semantics that will make this harder.

from ga4gh-schemas.

ekg commented on August 11, 2024

@diekhans, would you elaborate on this a little?

The current API definition is defining file system-like semantics that will make this harder.

What specifically is the problem?

from ga4gh-schemas.

richarddurbin commented on August 11, 2024

I still don't understand.

I agree with Heng that in practice what we need is to have data freezes or snapshots.

Currently we do this by making a fixed copy of the data at the freeze. Clearly that works, but is inefficient. There can be more complex solutions that share unchanged objects. But I don't see how these change the user API. In either case I say right at the start when I open my connection that I want to access a named version of the data, then after that I just use the interface we have. I don't need to know additional uuids, or the semantics of the storage solution. It seems to me that all this discussion about immutable objects belongs in a layer that should be hidden from the user. The user I'd equally happy for snapshots to be copies of the whole data set, or things maintained by other solutions at the level of whole objects, or parts of objects. The API shouldn't care.

Richard

Sent from my iPhone

On 12 Sep 2014, at 17:31, Erik Garrison [email protected] wrote:

@diekhans, would you elaborate on this a little?

The current API definition is defining file system-like semantics that will make this harder.

What specifically is the problem?

—
Reply to this email directly or view it on GitHub.

from ga4gh-schemas.

massie commented on August 11, 2024

We have different users here. We have end-users that want to use access a simple "named" version of the data without any knowledge of UUIDs, hashes or implementation details. We also have developers that want to build interesting tools for data management, syncing, verifying, and sharing (which end-users will ultimately use).

We need APIs with both group in mind: end-users and developers.

Currently we do this by making a fixed copy of the data at the freeze.

This is one problem that we want to solve since copying doesn't scale. Freezing petabytes of data isn't realistic. By having a content-addressable layer, we solve the issue of versioning/verification and minimize data movement between GA4GH teams that want to replicate data (and results).

This would be a great topic to discuss in person at our October meeting.

from ga4gh-schemas.

lh3 commented on August 11, 2024

I say right at the start when I open my connection that I want to access a _named version_ of the data, then after that I just use the interface we have.

@richarddurbin Currently ga4gh objects do not have stable names. They have IDs internal to each backend, but these IDs are not required to be stable by the schema. My understanding is that UUIDs proposed by others serve as stable names, though I more like the accession system nearly all biological databases are using.

from ga4gh-schemas.

massie commented on August 11, 2024

@lh3 Correct! We want to standardized the way we generate addresses for data content (at some fixed point in time).

A UUID is not the right tool to use here. While UUIDs are unique, they have no connection to content. Hashes (like SHA-1, SHA-2) are not just unique but also provide guarantees about the content of the underlying data (of course, hash collisions are possible but extremely rare).

End users wouldn't need to worry about the SHA-1 for their data (that's an implementation detail). They could just use names from a bio accession system that are then translated into content addresses (e.g. SHA-1). It is also imperative that these content addresses are the same across GA4GH backend implementations.

from ga4gh-schemas.

benedictpaten commented on August 11, 2024

On Fri, Sep 12, 2014 at 11:56 AM, Matt Massie [email protected]
wrote:

@lh3 https://github.com/lh3 Correct! We want to standardized the way we
generate addresses for data content (at some fixed point in time).

A UUID is not the right tool
http://en.wikipedia.org/wiki/Universally_unique_identifier to use here.
While UUIDs are unique, they have no connection to content. Hashes (like
SHA-1, SHA-2) are not just unique but also provide guarantees about the
content of the underlying data (of course, hash collisions are possible but
extremely rare).

I agree this would enforce the connection, but computing the hashes might
be computationally expensive?, hence the compromise to use UUIDs and the
convention that each such id map to a unique, static version of the
dataset. I am no expert here - and am not wedded to UUIDs.

End users wouldn't need to worry about the SHA-1 for their data (that's an
implementation detail). They could just use names from a bio accession
system that are then translated into content addresses (e.g. SHA-1). It is
also imperative that these content addresses are the same across GA4GH
backend implementations.

I like this idea. Quoting IDs (whatever form - even hashes/UUIDs) however
is also a very precise, succinct way of referring to objects that does not
require centralisation.

—
Reply to this email directly or view it on GitHub
#142 (comment).

from ga4gh-schemas.

mbaudis commented on August 11, 2024

@lh3

the accession system nearly all biological databases are using

But this is not a database. it is a format recommendation for an API; implementations, local naming schemas etc. may hugely differ.
At least for metadata, we make the differences & suggest the use of UUID (all objects), localID, and accession.

@massie

... and for immutable objects (that is, most likely raw data, reads ... ?) you can add a hashedID. Many of the metadata objects (e.g. GAIndividual) will change content over time.

from ga4gh-schemas.

lh3 commented on August 11, 2024

For the purpose of referencing a "named version", I don't mind whether the stable name is a UUID or an accession. Nonetheless, accessions do have some advantages with some down sides. For example, when I see GOC0123456789.3, I would know this is the 3rd freeze (3; if we keep the version) of a Google (GO) GACallSet (C). It is more informative and more user friendly than 123e4567-e89b-12d3-a456-426655440000. It may also be more flexible.

I realize that perhaps I have not understood the UUID proposal when I see @benedictpaten talking about computational cost. I thought UUIDs for GACallSets are computed once and then stored in the database as stable names. I know we cannot store UUIDs for very small objects, but I do not care whether they have stable names or not.

from ga4gh-schemas.

diekhans commented on August 11, 2024

Heng Li [email protected] writes:

For example, when I see GOC0123456789.3, I
would know this is the 3rd freeze (3; if we keep the version) of a Google (GO)
GACallSet (C). It is more informative and more user friendly than
123e4567-e89b-12d3-a456-426655440000. It may also be more flexible.

Human readable ids do have value to human intuition and this
shouldn't be ignored. UUIDs have value to computer algorithms
managing data. They are more akin to a pointer or foreign key
than a name. It leads to a lot of complexity to try combine the
two into one value.

TCGA created a huge mess by trying to use barcodes, which encode
metadata about the sample, as the primary, unique key. Barcodes
are incredible valuable for the humans, they can scan lists of
them quickly. However, it turned out that the metadata encoded
in the barcodes was sometimes wrong and had to be changed, which
you don't want to do with your unique identifier.

TCGA switch to using UUIDs as the primary key, with barcodes keep
as a human readable description. This fixed a lot of problem.

For the details of TCGA barcode: https://wiki.nci.nih.gov/display/TCGA/TCGA+Barcode

The GA4GH APIs should provide for both a GUID and a name.

I realize that perhaps I have not understood the UUID proposal when I see
@benedictpaten talking about computational cost. I thought UUIDs for GACallSets
are computed once and then stored in the database as stable names.

This comes from Matt's proposal to use SHA1 hashes as a GUID as
instead of UUIDs. Either approach provides a unique 128-bit
number. SHA1s can be recomputed from the data and use to
validated the data against the GUID. UUIDs are very easy and cheap
to create. It's entirely possible to use both, depending on the
type of object.

It is important that the API not impose implementation details
on the data provider. One really wants creating a new version
to be implementable with very fast copy-on-write style of
algorithms. Needing to compute a hash may preclude this
implementation.

Defining the API to have opaque 128-bit GUIDs allows the data
providers to trade off UUIDs vs SHA1s as the implementation.

from ga4gh-schemas.

massie commented on August 11, 2024

One thing we need to keep in mind: composability.

UUIDs are not composable whereas hashes are. In addition, hashes are helpful to prevent duplication of data (the same data has the same hash) whereas UUIDs do not (the same data could be stored with the same UUID).

For example, let's say that we have a set object that contains 10 "foo" objects and each "foo" has a calculated hash field. To create the hash for the set only requires a quick merge of the 10 hashes (instead of rerunning the hash over all the 10 "foo" objects data). This is the power of composability. The hash that is calculated would be the same across all GA4GH repositories.

Performance is something we need to consider, of course.

Here's a code snippet for people to play with if you like (of course, you'll want to change the FileInputStream path)...

package  example
import java.io.FileInputStream
import java.security.MessageDigest

object Sha1Example {

  def main(args: Array[String]): Unit = {
    val start = System.currentTimeMillis()
    val sha1 = MessageDigest.getInstance("SHA1")
    val bytes = new Array[Byte](1024 * 1024)
    val fis = new FileInputStream("/workspace/data/ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz")
    Stream.continually(fis.read(bytes)).takeWhile(-1 !=).foreach(sha1.update(bytes, 0, _))
    val hashBytes = sha1.digest()

    // covert the hash bytes into a more human-readable form...
    val sb = new StringBuffer()
    hashBytes.foreach { a => sb.append(Integer.toString((a & 0xff) + 0x100, 16).substring(1))}
    val end = System.currentTimeMillis()
    println(sb.toString)
    println("%d ms".format(end - start))
  }

Output on my MacBook:

1a50d065799c4d32637dbe11eb66e5f1e8b35b89
9570 ms

On my MacBook Pro, I was able to hash a ~2GB file at about ~180MB/s (single-threaded, single flash disk). This is just a very rough example and shouldn't be seen as a real benchmark. I just wanted to explain with working code since it's a language we all understand. Note: I also confirmed the hash using the shasum commandline utility.

Since hashes are composable, it's very easy to distribute the processing for performance too. Keep in mind, we will never have to recalculate a hash of data. It is calculated once, stored and composed for sets of objects.

from ga4gh-schemas.

delagoya commented on August 11, 2024

Caveat emptor: hashes composed of other hashes are highly dependent on the order of supplied component hashes. There will be no guarantee that a particular data store will implement the hash ordering in exactly the same way as others.

This may seem trivial, but I've met enough sorting problems in bioinformatics in my time that it should not be treated as a trivial concern.

On Sep 13, 2014, at 7:14 PM, Matt Massie [email protected] wrote:

One thing we need to keep in mind: composability.

UUIDs are not composable whereas hashes are. In addition, hashes are helpful to prevent duplication of data (the same data has the same hash) whereas UUIDs do not (the same data could be stored with the same UUID).

For example, let's say that we have a set object that contains 10 "foo" objects and each "foo" has a calculated hash field. To create the hash for the set only requires a quick merge of the 10 hashes (instead of rerunning the hash over all the 10 "foo" objects data). This is the power of composability. The hash that is calculated would be the same across all GA4GH repositories.

Performance is something we need to consider, of course.

Here's a code snippet for people to play with if you like (of course, you'll want to change the FileInputStream path)...

package example
import java.io.FileInputStream
import java.security.MessageDigest

object Sha1Example {

def main(args: Array[String]): Unit = {
val start = System.currentTimeMillis()
val sha1 = MessageDigest.getInstance("SHA1")
val bytes = new Array[Byte](1024 * 1024)
val fis = new FileInputStream("/workspace/data/ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz")
Stream.continually(fis.read(bytes)).takeWhile(-1 !=).foreach(sha1.update(bytes, 0, _))
val hashBytes = sha1.digest()
// covert the hash bytes into a more human-readable form...
val sb = new StringBuffer()
hashBytes.foreach { a => sb.append(Integer.toString((a & 0xff) + 0x100, 16).substring(1))}
val end = System.currentTimeMillis()
println(sb.toString)
println("%d ms".format(end - start))
}
Output on my MacBook:

1a50d065799c4d32637dbe11eb66e5f1e8b35b89
9570 ms
On my MacBook Pro, I was able to hash a ~2GB file at about ~180MB/s (single-threaded, single flash disk). This is just a very rough example and shouldn't be seen as a real benchmark. I just wanted to explain with working code since it's a language we all understand. Note: I also confirmed the hash using the shasum commandline utility.

Since hashes are composable, it's very easy to distribute the processing for performance too. Keep in mind, we will never have to recalculate a hash of data. It is calculated once, stored and composed for sets of objects.

—
Reply to this email directly or view it on GitHub.

from ga4gh-schemas.

massie commented on August 11, 2024

@delagoya Agree.

For the data content, we can use a position-independent hash like CRC32 where order doesn't matter (you can feed bytes in any order you like). The CRC32 will allow us to validating data integrity but, being only 32 bits, will not provide the unique identifier we need since collisions are fairly common.

from ga4gh-schemas.

diekhans commented on August 11, 2024

Michael Baudis [email protected] writes:

... and for immutable objects (that is, most likely raw data, reads ... ?) you
can add a hashedID. Many of the metadata objects (e.g. GAIndividual) will
change content over time.

Ah, but that is the point of making all object immutable.
Objects don't change, new ones, with new ids are created from
the old ones. If solves version tracking and state management.

Getting the metadata correspond to a particular experiment is as
important as getting the data. For example, trying to
understand why you get different than a previous experiment
might hinge on the fact that the metadata was wrong in the
previous experiment.

We have spent an incredible amount of time trying to straighten
out a metadata mess for only ~6200 sequencing runs because the
metadata was mutable and modified with no way to track what was
done.

from ga4gh-schemas.

pgrosu commented on August 11, 2024

One small request if possible. Since some aligners such as SNAP use seed strings, can we have our API automatically generate/update a variety of inverted indices using the genome seed strings, to store information about the reads/variants/annotation/etc. for faster searches. This concepts is used in Information Retrieval, and would help out tremendously with a lot of the later analysis - plus the update step would be fairly fast. Here are a couple of examples:

Or for annotations we can have it reversed:

These can be distributed in parallel using Parquet like in ADAM, or we can adapt to other possibilities depending on what processing we want to perform. This can be extended to updates in parallel for variant calling and annotations, though some changes would need to be implemented. Also since genome assemblies of the same species would have minor variations compared to the whole genome, few small new seeds would be required to be added with their appropriate information updated.

from ga4gh-schemas.

lh3 commented on August 11, 2024

For a more concrete proposal, I suggest we add union {string,null} stableName=null to GAVariantSet (sorry that I was saying GACallSet but I really meant GAVariantSet) and allow it to be requested. If stableName is not null, the whole VariantSet should not be updated, though the data associated with it may be completely deleted later. If the data is deleted, the stableName should not be reused in future. The stableName could be accessions or UUIDs, entirely up to the implementors to decide. Hashes and UUIDs for all objects are useful, but we may discuss these in another thread.

For other objects, we can add stableNames to objects currently accessioned by SRA.

Alternatively we can add two fields string stableName; bool released; to GAVariantSet. This may be cleaner and more flexible.

from ga4gh-schemas.

pgrosu commented on August 11, 2024

@lh3, I know what you're trying to say but if it's considered stable then that's a tautology. If the data can be erased then that's a contradiction. It is either is locked or otherwise it is not stable. Usually there are several levels of promotion to a dataset. Again this can lead to a whole mess from what I've seen in the past. You need an organizational layer of structure with oversight.

from ga4gh-schemas.

vadimzalunin commented on August 11, 2024

@pgrosu agreed, this should be a one-way road DRAFT->FINAL->SUPPRESSED. There is no need to imply stability just from name. Since this is so important why not have a separate status for this? The API should be explicit where it matters.
PS Suppressed SRA objects still have stable names and must be accessible.

from ga4gh-schemas.

pgrosu commented on August 11, 2024

@vadimzalunin, I agree with the one-way approach, and to capture it and the rest, I posted it as a new issue as #143.

from ga4gh-schemas.

adamnovak commented on August 11, 2024

Integrating over everyone, it looks like we need:

IDs we can use to retrieve data sets and be sure we got the same data as was used in a publication (if we get it at all).
The ability to cheaply update/replace data sets when we're testing pipelines or adding samples to experiments or adding annotations to existing data sets.

I agree with the idea of "don't give global IDs to data sets that you want to update". Whether those IDs should be hashes or not is not really clear to me. I'm not sure the cost of rehashing on updates will be high in practice, since you don't really take a freeze, give it a minor update, and declare it a new freeze. However, the finnickyness of getting everything down to identical bits to be hashed is a hard problem.

from ga4gh-schemas.

fnothaft commented on August 11, 2024

Moving over from #135, cc @benedictpaten @cassiedoll @pgrosu @diekhans, also CC @massie who I know is interested.

Hi people,

This is something we will discuss at ASHG (Stephen marked it as an ASHG topic, thanks!). Gil and I think it would be good to get a point person for each ASHG topic. I nominate (he can disagree) Mark Diekhans as a person for this reproducibility issue. He created some nice slides on his views (which I share) both of the issue and how we might tackle it:

https://www.dropbox.com/s/v8gu5rlo9yaeack/ga4gh-functional-objects.pdf?dl=0

From a quick glance, this looks reasonable; one concern with the pointing approach from the last slide arises with respect to deletions. E.g., if you point by reference and delete GAReadGroupSet 00100, do you then recursively try to delete GAReadGroup 00200 & 00300? If you don't, do you then need to manually reclaim blocks, do you "garbage collect" un-referenced blocks, etc? This decision will impact the API semantics/implementation.

from ga4gh-schemas.

cassiedoll commented on August 11, 2024

(I'm removing the task team labels, as this is now covered under the ASHG topic umbrella)

from ga4gh-schemas.

benedictpaten commented on August 11, 2024

Thanks Frank, deleted the post from the other topic.

from ga4gh-schemas.

tetron commented on August 11, 2024

Content addressing (identifying data by a hash of the contents) is the a very powerful technique, forming the basis for systems such as Git and Arvados Keep. Deriving the identifier from the content, as opposed to assigning a random database ID, enables third party verification that the content and identifier match. When it is necessary to assign a human readable name and logically update a dataset, one can use techniques like git branches or Arvados collections which use a updatable name record which simply points to a specific content hash.

Even if the underlying database does not support versioning, so past versions are not stored and thus inaccessible, providing a content hash field at least provides knowledge that content has changed substantively in a way that is not captured by a simple timestamp field.

One challenge with hashing is that it is essential to define a bit-for-bit precise "normalized form" for a given data record so that different implementations will produce the same identifier given the same data. When using structured text formats such as JSON, this is tricky because differences in whitespace and object key ordering don't change the semantics of the actual record but will change the computed hash identifier.

Hash identifiers can be computed and provided alongside existing database identifiers, so it is not necessary to choose to use one or the other (although users may need to be educated when to use one or the other).

from ga4gh-schemas.

larssono commented on August 11, 2024

I realize that I am joining the conversation rather late. But scanning through the discussion it seems that there are many ideas that overlap and are related and will drop my 2c. In order to be able to record provenance (and by provenance I don't necessarily mean being able to reproduce a result identically to double precision but reproduce in principle) it is necessary to store versions and if you are storing versions it is no longer enough to only reference elements by a globally unique identifier the relationship between identifiers as different versions of each other is important. Furthermore it becomes useful to publish combinations of versions into freezes - much the same way that software commits can be tagged for a release.

In Synapse we have taken the approach that every item is referencable by three methods: an accession id, a globally unique identifier i.e. a hash of the data, and the provenance that generated it. One piece of data has an accession and multiple versions of each has version numbers. So for example a piece of data might have accession syn123 and version 2 of this would be accessible by syn123.2 (not specifying a version returns the latest version). These versions might also be retrieved by a md5 hash of the data or can be retrieved by observing a graph of the provenance as specified by the W3C spec (http://www.w3.org/TR/2013/REC-prov-dm-20130430/)

from ga4gh-schemas.

delagoya commented on August 11, 2024

Closing this issue for lack of PR or recent comments. It seems that #167 takes precedence for this issue.

from ga4gh-schemas.

awz commented on August 11, 2024

@delagoya is the Containers and Workflows task-team working on this issue? Maybe wait until they make progress before closing this issue? It seems to be pretty important and have quite a lot of content that is referenced by #167. Also pinging @fnothaft and @tetron since they may feel happy to close this.

from ga4gh-schemas.

GA4GH APIs need to address scientific reproducibility (propose immutable datatypes) about ga4gh-schemas HOT 65 CLOSED

Comments (65)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent