eriksjolund / st_exp_protobuf Goto Github PK

0.0 2.0 0.0 58 KB

File format for spatial gene data. The file contains a small header, gene expression data and image tiles from a high resolution photo.The header serves as a table of contents. A single image tile can thus be retrieved without having to read the whole file.

License: Other

JavaScript 3.36% CMake 9.98% Protocol Buffer 4.26% C++ 80.57% C 1.01% Shell 0.82%

file-format gene-expression image-tiles demo protobuf

st_exp_protobuf's Introduction

Note: The file format might change in any way.

This experimental github project is a just test bed for demonstrating how spatial gene data could be stored together with some microscope photos.

To try the file format out, first install this software and then create st_exp_protobuf files by downloading and converting online research data. The shell script sh/download_and_convert_example_data.sh automates the download and conversion.

st_exp_protobuf files can be viewed with a web viewer from osd-spot-viewer.

Design goal:

The file format should be efficient for both local file usage and usage over a network. Network usage is a bit different as then the size of the downloaded data and also the number of read round-trips get to be important. If only information of one gene or one spot is needed, preferably only few reads (few round-trips) should be needed and preferably only the required information would be downloaded. In other words, you shouldn't have to download the whole file if you are only interested in one specific gene. (Consider that AWS charge per downloaded Gb)
The header should be small. That is important if a web browser would like to download headers from many experiments to be able to present an overview of all experiments. (Improve this, the current header could get to be a bit smaller)
c++, python and javascript parsing and serialization examples should be included

A note about the choice of serialization library:

Protobuf version 2 was chosen as it is a well-proven library, used by many for a long time. Two newer competitors, capnproto and flatbuffers were considered but they were not chosen because

capnproto, although being very high quality software, is being labeled as beta software and the javascript project capnp-js has had no commits since Feb 27 2015 (today's date is May 19 2016).
the javascript support for flatbuffers is not quite there yet. The javascript fuzz test fails as of May 19 2016. See support for different programming languages.

Installation (on Ubuntu 16.04)

sudo apt-get install libyajl-dev libyajl2
sudo apt-get install cmake ninja-build

Download and install Qt5.7.

mkdir /tmp/build
cd /tmp/build
/home/user/cmake-3.4.0-Linux-x86_64/bin/cmake "-DCMAKE_PREFIX_PATH=/home/user/Qt5.6.0/5.6/gcc_64" /path/to/st_exp_protobuf
make

or if you use Ninja as make-tool:

mkdir /tmp/build
cd /tmp/build
cmake -G Ninja "-DCMAKE_PREFIX_PATH=/home/user/Qt/5.7/gcc_64"  ~/st_exp_protobuf/
make

st_exp_protobuf's People

Contributors

Watchers

st_exp_protobuf's Issues

Store arrays of byte ranges more efficiently

Instead of storing byte ranges like this

message FileRegion {
   required uint64 regionOffset = 1;
   required uint64 regionSize = 2;
}

message InterestingByteRanges {
    repeated FileRegion fileRegions = 1;
}

I think we could store them like

message FileRegions {
    // A start position could define the end position of the previous entry.
    //  To define the endPosition of the last entry we would have to
    // add an extra StartPosition in the end of the array.
    repeated uint64 startPositions = 1;
}

message InterestingByteRanges {
    required FileRegions fileRegions = 1;
}

That should reduce the size of the header.

Should the barcode be included or not?

Right now the barcode is included in the st_exp_protobuf format.
Maybe we should leave it out of the format?

To remove it, the file st_exp.proto would have to be modified here:

message Spot {
  // We limit the total number of spots to (2^32-1) as the spot id is uint32_t.
  // Therefore we limit the grid coordinates to be max (2^16-1).
  required uint32 xCoordGrid = 1;  // integer in range: 0 ... (2^16-1) 
  required uint32 yCoordGrid = 2;  // integer in range: 0 ... (2^16-1)
  required float xCoordPhyscial = 3;  // millimeter
  required float yCoordPhyscial = 4;  // millimeter
  required string barcode = 5;
}

Use correct image alignment information and not the identity matrix

Right now the crickconvert command does not store a correct imageAlignment (that relates
to micrometers in the physical world).

$ grep "imageAlignment =" protobuf_schema/st_exp_protobuf/fullsize_image.proto 
  repeated float imageAlignment = 5;
$

It just uses an identity matrix.

$ grep identity_matrix{ c++/serialize_to_st_exp_protobuf/serialize_to_st_exp_protobuf.cc
  const std::array<float, 4> identity_matrix{1.0, 0.0, 0.0, 1.0};
$

The reason for this is that the crick file format does not contain any information about how the photo pixels relate to the physical real world. This should be fixed.

As long as there is incorrect information in the imageAlignment, it doesn't make sense to introduce
any scalebar (e.g. https://pages.nist.gov/OpenSeadragonScalebar/) in a viewer.

What information should be stored in the header?

Information we want to store can either be placed inside the header or as a byte range
outside of the header.
This is sort of an open question: What information should be stored inside the header?
The choice of what to put inside the header will influence the number of bytes read and the number of read operations that it will take for retrieving some information out of the file.

Right now the gene names and the spot coordinate information are stored outside the header but maybe we should put them inside the header?
Advantage: The osd-spot-viewer
(https://github.com/eriksjolund/osd-spot-viewer/blob/master/from_layout/index.html)
would make fewer read calls. The parsing code would be also be somewhat simpler.

Drawback: The header would get bigger. There might be times when that information is not needed
and someone retrieving other information out of the file would have to pay an extra cost.

eriksjolund / st_exp_protobuf Goto Github PK

st_exp_protobuf's Introduction

Note: The file format might change in any way.

This experimental github project is a just test bed for demonstrating how spatial gene data could be stored together with some microscope photos.

Installation (on Ubuntu 16.04)

st_exp_protobuf's People

Contributors

Watchers

st_exp_protobuf's Issues

Store arrays of byte ranges more efficiently

Should the barcode be included or not?

Use correct image alignment information and not the identity matrix

What information should be stored in the header?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent