Comments (7)
from ga4gh-schemas.
The Treehouse compendium is ~11k samples by 30k genes/features. Storing on disk in an hdf5 file take about 1GB and loads into a dataframe (R or Python) in < 600msec. Differential expression analysis typically requires getting a subset of the 'columns' - say 500 expression vectors after subsetting by diseases. This works out to ~500 * 30k features rows. I suspect the emerging single cell world will be much the same but with the addition of very sparse data lending itself to compression. The current GA4GH reference server database schema being single row per level doesn't get any of the optimizations of row nor column orientation, significantly expands the data (float into ~128 bytes) and can't be compressed. The protocol buffer schema has the same issue. Suggest this should be re-considered towards storing the levels in an array, either single row blob, or a column database, or external file with meta data in the existing relational database so that it can be usable for current clinical differential use cases as well as emerging single cell research cases.
from ga4gh-schemas.
@saupchurch We were going to bring up this use case on the call today.
from ga4gh-schemas.
To bootstrap the effort of modeling the Genomics domain the way that we have, we've picked up a lot of assumptions of the underlying file types. Our variant representation is weighed down by VCF, read alignments by BAM, etc. The community moves slowly, and the schemas are our opportunity to decouple the database/file layer from the interaction layer.
As we evolve the API, in addition to adding the methods and fields that biologists find practical based on existing usage patterns, we should work to remove the legacy of the file representation and move more of that logic to standardized ETL pipelines, allowing biologists to spend more time reasoning about their domain.
An important concern Rob raises (thanks @rcurrie) is that we don't present useful methods for analysis. The same objection has been raised of the variants API. You can get everything back, and be able to reason about single documents well, but you always get more than you need.
I can imagine an interesting experiment where we provide an alternative way of querying expressions. Instead of splitting across ~5 requests to get an expression level you pass in a list of sample identifiers and feature names.
Add an endpoint called "expressionlevels/select" that takes a Select message containing a list of sample_id and feature_id (or name).
message RNASelectRequest {
repeated string sample_ids = 1;
repeated string feature_ids = 2;
}
It then returns a table with quantifications that match the requested sample_ids in the columns and the requested feature_ids as rows. Cells simply contain the expression level.
message ExpressionVector {
string biosample_id = 1;
string rna_quantification_id = 2;
repeated float expression = 3;
}
message RNASelectResponse {
// All expression vectors in the response are of the same length
repeated ExpressionVector = 1;
// A list of feature ids that matches the length of the expression
// vector.
repeated string feature_ids = 2
}
I believe this is tractable based on the data model we currently present and would demonstrate a valuable analysis use case. It reduces the transfer required for the most common access pattern (show me these samples against these genes) down to a minimum. The returned response is essentially a table where each row is tagged with the sample it came from, and the experiment it came from.
Constructing a select request will require iterating over metadata to get the list of sample IDs one is interested in. With a list of genes and samples, one should be able to query arbitrarily large stores using different slicing techniques.
I would note that having the raw_read_count
might also be useful as I believe that it is common to recalculate normalization when aggregating results. In that case we could provide a flag in the SelectRequest, or include both values in the vector.
This same pattern could be used with the variants API, I believe, where a list of sample IDs and variant IDs could be used to assemble a list of call vectors.
One of the problems with this approach is that it assumes that the data can be easily queried in the way the method presents. In practice, it requires that API implementors keep their data in structures that can be joined and merged. This assumption was presented for the Reads API, which has a method for assembling alignments from multiple BAMs in the ReadsSearchRequest
. The problem is that there are few practical implementations that provide the ability to fetch across multiple BAMs. Because of this, the reference server has never implemented that feature fully.
Part of the value of the API we present is that it aims to be as low cost to implement over existing stores, tries to only present the methods required for interchange, and presents documents that can be reasoned about alone. We don't want everyone to need to have hadoop, or all their RNA in one table, or to look up in documentation/external metadata to see what a value means.
I believe a service could use the API as it presents itself to create the Select
application described. Considering that, one can imagine optimizing the lower layer to cache API results that would satisfy various requests, eventually storing them in a database.
We should definitely work to make the API efficient, but I imagine you can think of this layer of the API like the FASTQ of read alignment. It is a low layer protocol that analysis applications will work on top of, and those analysis applications take the responsibility of filtering/removing keys, and making sure only the data needed to drive inquiry is transferred to you, the analyst.
from ga4gh-schemas.
The idea of a select query is a good one - this would essentially recreate an earlier proposed pattern where the ExpressionLevel
objects were associated with a "featureSet" and retrievable by that set. Your idea @david4096 would do a similar thing with the added flexibility of making the "feature set" definable at the query side.
from ga4gh-schemas.
+1 on this idea. It would be great to have @david4096 's approach implemented
from ga4gh-schemas.
from ga4gh-schemas.
Related Issues (20)
- Package for CRAN
- Rename repository HOT 2
- Update Release notes for the v0.6.0a10 release
- Remove created and updated timestamps from API HOT 4
- Add peer service human readable docs HOT 1
- Document maven release process HOT 1
- Move datamodel to its own repo
- Improve development.rst
- Content Type Negotiation
- Implement updated transcript effects protocol
- Deprecate reference ID in favor of reference name or accession ID HOT 1
- Recreate assay metadata HOT 2
- Update Java Protobuf Dependency to 3.1+
- protobuf java square write code-gen HOT 3
- Change booleans to enums
- Assay Metadata for Analysis object table is broken in documentation...
- GeoLocation attributes names HOT 1
- ListReferenceBasesRequest GET or POST HOT 1
- AnalysisResult scores
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ga4gh-schemas.