From email chain with Rob: <div class="snippet-clipboard-content notranslate posit

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

+1 on this idea. It would be great to have <a class="user-mention notranslate" data-ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RNA expression data structure is inefficient about ga4gh-schemas HOT 7 OPEN

kozbo commented on August 11, 2024

RNA expression data structure is inefficient

from ga4gh-schemas.

Comments (7)

kozbo commented on August 11, 2024

@rcurrie

from ga4gh-schemas.

rcurrie commented on August 11, 2024

The Treehouse compendium is ~11k samples by 30k genes/features. Storing on disk in an hdf5 file take about 1GB and loads into a dataframe (R or Python) in < 600msec. Differential expression analysis typically requires getting a subset of the 'columns' - say 500 expression vectors after subsetting by diseases. This works out to ~500 * 30k features rows. I suspect the emerging single cell world will be much the same but with the addition of very sparse data lending itself to compression. The current GA4GH reference server database schema being single row per level doesn't get any of the optimizations of row nor column orientation, significantly expands the data (float into ~128 bytes) and can't be compressed. The protocol buffer schema has the same issue. Suggest this should be re-considered towards storing the levels in an array, either single row blob, or a column database, or external file with meta data in the existing relational database so that it can be usable for current clinical differential use cases as well as emerging single cell research cases.

from ga4gh-schemas.

kozbo commented on August 11, 2024

@saupchurch We were going to bring up this use case on the call today.

from ga4gh-schemas.

david4096 commented on August 11, 2024

To bootstrap the effort of modeling the Genomics domain the way that we have, we've picked up a lot of assumptions of the underlying file types. Our variant representation is weighed down by VCF, read alignments by BAM, etc. The community moves slowly, and the schemas are our opportunity to decouple the database/file layer from the interaction layer.

As we evolve the API, in addition to adding the methods and fields that biologists find practical based on existing usage patterns, we should work to remove the legacy of the file representation and move more of that logic to standardized ETL pipelines, allowing biologists to spend more time reasoning about their domain.

An important concern Rob raises (thanks @rcurrie) is that we don't present useful methods for analysis. The same objection has been raised of the variants API. You can get everything back, and be able to reason about single documents well, but you always get more than you need.

I can imagine an interesting experiment where we provide an alternative way of querying expressions. Instead of splitting across ~5 requests to get an expression level you pass in a list of sample identifiers and feature names.

Add an endpoint called "expressionlevels/select" that takes a Select message containing a list of sample_id and feature_id (or name).

message RNASelectRequest {
  repeated string sample_ids = 1;
  repeated string feature_ids = 2;
}

It then returns a table with quantifications that match the requested sample_ids in the columns and the requested feature_ids as rows. Cells simply contain the expression level.

message ExpressionVector {
  string biosample_id = 1;
  string rna_quantification_id = 2;
  repeated float expression = 3;
}

message RNASelectResponse {
  // All expression vectors in the response are of the same length
  repeated ExpressionVector = 1;

  // A list of feature ids that matches the length of the expression   
  // vector.
  repeated string feature_ids = 2
}

I believe this is tractable based on the data model we currently present and would demonstrate a valuable analysis use case. It reduces the transfer required for the most common access pattern (show me these samples against these genes) down to a minimum. The returned response is essentially a table where each row is tagged with the sample it came from, and the experiment it came from.

Constructing a select request will require iterating over metadata to get the list of sample IDs one is interested in. With a list of genes and samples, one should be able to query arbitrarily large stores using different slicing techniques.

I would note that having the raw_read_count might also be useful as I believe that it is common to recalculate normalization when aggregating results. In that case we could provide a flag in the SelectRequest, or include both values in the vector.

This same pattern could be used with the variants API, I believe, where a list of sample IDs and variant IDs could be used to assemble a list of call vectors.

One of the problems with this approach is that it assumes that the data can be easily queried in the way the method presents. In practice, it requires that API implementors keep their data in structures that can be joined and merged. This assumption was presented for the Reads API, which has a method for assembling alignments from multiple BAMs in the ReadsSearchRequest. The problem is that there are few practical implementations that provide the ability to fetch across multiple BAMs. Because of this, the reference server has never implemented that feature fully.

Part of the value of the API we present is that it aims to be as low cost to implement over existing stores, tries to only present the methods required for interchange, and presents documents that can be reasoned about alone. We don't want everyone to need to have hadoop, or all their RNA in one table, or to look up in documentation/external metadata to see what a value means.

I believe a service could use the API as it presents itself to create the Select application described. Considering that, one can imagine optimizing the lower layer to cache API results that would satisfy various requests, eventually storing them in a database.

We should definitely work to make the API efficient, but I imagine you can think of this layer of the API like the FASTQ of read alignment. It is a low layer protocol that analysis applications will work on top of, and those analysis applications take the responsibility of filtering/removing keys, and making sure only the data needed to drive inquiry is transferred to you, the analyst.

from ga4gh-schemas.

saupchurch commented on August 11, 2024

The idea of a select query is a good one - this would essentially recreate an earlier proposed pattern where the ExpressionLevel objects were associated with a "featureSet" and retrievable by that set. Your idea @david4096 would do a similar thing with the added flexibility of making the "feature set" definable at the query side.

from ga4gh-schemas.

kozbo commented on August 11, 2024

+1 on this idea. It would be great to have @david4096 's approach implemented

from ga4gh-schemas.

david4096 commented on August 11, 2024

@apoliakov

from ga4gh-schemas.

RNA expression data structure is inefficient about ga4gh-schemas HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent