The h5wasm and jsfive libraries look valuable for processing HDF5 files in web browser

See new release v0.4.8 on <a href="https://www.npmjs.com/package/h5wasm" rel="nofollow

Hi all, I might have a solution for this based on jsfive here: <a href="https://gith

Reading HDF5 file that is larger than client memory,about usnistgov/h5wasm

Comments (22)

bmaranville commented on May 18, 2024 3

I haven't made a new release yet, but if you want to try it out, here is a zipped version of the library with WORKERFS support built in: h5wasm-0.4.7-extra.tar.gz It also has an IIFE build, which you might want for playing around with workers (though you could bundle your own worker script of course). Here is a working demo based on the SO post before (I put it in a folder within the unpacked h5wasm package, so that the path ../dist/iife/h5wasm.js made sense. YMMV)

<html>
<head>
<script>
    const worker = new Worker("worker.js");
    function onClick() {
        const f = document.getElementById("in-file").files[0];
        worker.postMessage([ f ]);
    }
</script>
</head>
<body>
    <input type="file" id="in-file" />
    <input type="button" onClick="onClick()" value="ok" />
</body>
</html>

// worker.js
onmessage = async function(e) {
    const { FS } = await h5wasm.ready;
    
    const f_in = e.data[0];

    FS.mkdir('/work');
    FS.mount(FS.filesystems.WORKERFS, { files: [f_in] }, '/work');

    const f = new h5wasm.File(`/work/${f_in.name}`, 'r');
    console.log(f);
}

self.importScripts('../dist/iife/h5wasm.js');

from h5wasm.

bmaranville commented on May 18, 2024 2

For local files, I'm working on a web worker proxy that exposes most of the h5wasm API (though all of it becomes async) through Comlink

See the new PR at: #70

from h5wasm.

bmaranville commented on May 18, 2024 1

Already jsfive does loading of datasets (and groups, for that matter) "on demand", in the sense that calling Dataset.value triggers a read of the relevant bytes (either chunked or contiguous) to construct the desired output value. The Group and Dataset classes hold a reference to a Dataobjects instance once they are loaded, which has a bunch of addressing information similar to what you describe. If the underlying buffer for jsfive.File is changed to a random-access async system I think it will already be pretty efficient.

from h5wasm.

bmaranville commented on May 18, 2024 1

That's a good catch! Maybe this is already supported out of the box with WORKERFS... it's worth a try.

from h5wasm.

bmaranville commented on May 18, 2024 1

See new release v0.4.8 on npm and github

from h5wasm.

jrobinso commented on May 18, 2024 1

Hi all, I might have a solution for this based on jsfive here: https://github.com/jrobinso/hdf5-indexed-reader. We are now remotely accessing individual datasets from HDF5 file ~ 180 GB in size in the spacewalk project. @bmaranville had done most of the work in the async branch of jsfive. For now I have forked jsfive and added additions to (1) use range byte requests for just the portions of the file needed, and (2) support an optional index to find dataset offsets without walking the tree of ancestors and siblings. The tree walking to build hdf5's internal b-tree container index turns out to be very expensive over the web, as this metadata can be anywhere in the file. You end up generating an http request for each individual container visited. In our use case this can be in the thousands, thus the need for the index. However in some schemas the index might not be needed, it is optional.

from h5wasm.

bmaranville commented on May 18, 2024

This is an interesting question! The combination you want: local files, accessed through the browser, larger than 2GB (or even larger than the system memory) presents several challenges...

For h5wasm, you could indeed use lazyFileLRU to deal with the large size issue, but local filesystem access is only asynchronous and (at the moment) the emscripten emulated filesystem can only be used in a synchronous mode. This is why lazyFile (and lazyFileLRU) have to run in a worker, because there they can take advantage of synchronous fetch calls that are no longer allowed in the main JS thread.

For jsfive, the library was written to be synchronous and load the whole file into an ArrayBuffer on instantiation of the root jsfive.File object. For your use case, this is bad because the maximum size of an ArrayBuffer is set by the browser and is e.g. 2GB in the current Chrome version on OSX, which is much smaller than the files you are wanting to process.

The only possibility I can see for solving your problem is to create a new version of jsfive ('jsfive-async') that replaces all buffer.slice operations with async function calls that read a local file through the javascript File API, which seems to allow random access once a file is picked through an <input type="file" /> element. That access is always async though, so the ArrayBuffer used as the main "storage" of jsfive would be replaced with a wrapper of the File object...

class AsyncBuffer {
  constructor(file_obj) {
    this.file_obj = file_obj;
  }
  async slice(start, stop) {
    return (await this.file_obj.slice(start, stop)).arrayBuffer();
  }
}

On top of that, all the classes that are instantiated by reading from the file (which is most of the classes in jsfive) would have to be rewritten to have an async init() method in addition to the constructor, that would have to be awaited after each construction.

It's probably doable, but it would be a bit of work.

from h5wasm.

bmaranville commented on May 18, 2024

Note that loading HDF5 requires random access to the file, not sequential "streaming" access, as the internal structure of the file is not linear (and is in fact fractal at times! - see FractalHeap)

from h5wasm.

eweitz commented on May 18, 2024

Thanks Brian, that's rich and insightful guidance. Your outline broadly makes sense to me.

Recap

My main take away from your comments above is to use the File API for random access without loading the whole file into memory as an ArrayBuffer. Your suggestions to use work workers and refactor to async functions also seem on point.

Values -> pointers

Beyond async, I think your suggestions entail that a new jsfive-async library would benefit from using pointers instead of directly loading values that contain large amounts of data. So, for example, I might read the whole file in small chunks via the File API, and track the byte offset of where large datasets in the HDF5 file start and stop. Then, in subsequent operations, I could quickly look up addresses for datasets A, B, C and so forth in the source HDF5 file byte array stored outside memory. It'd let me load only dataset A after the initial whole scan, rather than needing to stream-read the whole file more than once.

Such an instantiated jsfive-async HDF5 File object would be more of an index than a file with directly-useful content.

HDF5 index file companion

I could also see making that jsfive-async HDF5 index object into a file itself. The index file would be much smaller than the source HDF5 file, but enable fast retrieval and random access. The index file might even travel alongside the larger source HDF5 file. That'd help memory-constrained clients. The index file would aid my particular use case, but I suspect it'd be even more valuable for fast random-access retrieval from remote servers via HTTP range requests. I imagine that'd be a more prevalent use case.

This idea underpins BAM and BAI files, which are common in genomics. This new index file would be like BAI files for HDF5. HDF5 files seem more structurally complex than BAMs, which I think would be the main barrier here.

Questions

Beyond the considerable effort, do you see any fundamental issues with the outline above?

Also, it's worth noting that I'm an HDF5 novice, so I may well be overlooking something. Briefly researching, the closest construct I found to an HDF5 index file is HDF5 virtual datasets (VDS). However, the VDS reference doesn't mention "stream" and has no relevant hits for "memory", so at a glance my hunch is that VDS does not address the use cases that an HDF5 index file would. Is there an existing solution in the HDF5 community for what HDF5 index files would solve?

from h5wasm.

bmaranville commented on May 18, 2024

I think I have something that works... you can extract the compiled esm/index.mjs from the attachment at the bottom (had to fix a problem with the filter pipeline, it's all working now...), and then in your page do like below. I tested on a 16 GB local file and was able to load and browse the file just fine.

import * as jsfive_async from './jsfive/dist/esm/index.mjs';

class AsyncBuffer {
  constructor(file_obj) {
    this.file_obj = file_obj;
  }
  async slice(start, stop) {
    return (await this.file_obj.slice(start, stop)).arrayBuffer();
  }
}

const file_input = document.getElementById("file_input");
file_input.onchange = async function() {
  const file = file_input.files[0];
  const async_buf = new AsyncBuffer(file);
  const f = new jsfive_async.File(async_buf);
  await f.ready;
  // ... then do stuff with the file, e.g. :
  window.f = f; // now you can play with it in the console
  console.log(f.keys);
  // if you have a group called 'entry':
  let entry = await f.get('entry');
  let dataset = await entry.get('data');
  // shape, dtype and value are all async now:
  console.log(await dataset.shape); // shape, 
  console.log(await dataset.dtype);
  console.log(await dataset.value); // don't do this if your dataset is big!
}

(this is built from the async branch of jsfive, that I just pushed)
dist.zip

from h5wasm.

axelboc commented on May 18, 2024

For h5wasm, you could indeed use lazyFileLRU to deal with the large size issue, but local filesystem access is only asynchronous and (at the moment) the emscripten emulated filesystem can only be used in a synchronous mode. This is why lazyFile (and lazyFileLRU) have to run in a worker, because there they can take advantage of synchronous fetch calls that are no longer allowed in the main JS thread.

Would it be realistic for h5wasm to provide a way to bypass Emscripten and let JavaScript take care of random file access with await file.slice(start, end).arrayBuffer() or HTTP range requests?

from h5wasm.

bmaranville commented on May 18, 2024

Yes, @axelboc you can use HTTP range requests (lazyFile, lazyFileLRU etc.) with h5wasm, but only in synchronous mode, not async. For the moment I think there is not support for async filesystems in emscripten, even if you write your own FS driver (the interface on the emscripten side is synchronous).
Note that because the access has to be synchronous, you have to make sync fetch calls with the range requests, and sync fetch calls are only permitted in a worker.

from h5wasm.

axelboc commented on May 18, 2024

Right okay, so await file.slice(start, end).arrayBuffer() is just not possible because file system calls have to be synchronous.

While taking a closer look at Emscripten's file system API, I noticed that they provide a WORKERFS file system that supposedly "provides read-only access to File and Blob objects inside a worker without copying the entire data into memory and can potentially be used for huge files."

Would building h5wasm with WORKERFS and mounting it instead of the default MEMFS solve the issue of reading huge local files?

EDIT: found this StackOverflow thread that might be of help: https://stackoverflow.com/questions/59128901/reading-large-user-provided-file-from-emscripten-chunk-at-a-time

from h5wasm.

turner commented on May 18, 2024

Hi,
I just found this thread. I am a bit unclear on strategy for reading a file that exceeds client memory. My files range from a few hundred MEG to a few GIG. I will need the ability to retrieve a user selected chunk from the larger file. The chunks are of a size that will fit in client memory.

Can I use h5wasm for this?

Thanks

from h5wasm.

turner commented on May 18, 2024

@Carnageous the issue is h52wasm - as an intermediate step - immediately creates an arraybuffer which then gets written to disk:

        const { FS } = await ready
        FS.writeFile(name, new Uint8Array(arrayBuffer))
        const hdf5 = new h5wasmFile(name, 'r')

This clearly makes it impossible to use a file that exceed client memory, effectively removing a key feature of HDF5: the ability to work easily with humungous files.

What are my options here?

from h5wasm.

bmaranville commented on May 18, 2024

You don't have to use the emscripten MEMFS filesystem if you don't want to. You'll get a synchronous, MEMFS "traditional" file backed by an ArrayBuffer if you invoke "FS.writeFile" as in the comment above, and in the h5wasm example docs.

There are other virtual filesystems to choose from: see https://emscripten.org/docs/api_reference/Filesystem-API.html#file-systems. The one that might solve the problem here is the WORKERFS filesystem. I haven't found any examples of usage yet, but the documentation suggests it does exactly what you are looking for.

The h5wasm library would have to be compiled with support for the WORKERFS, which is not happening right now. I am building it with support for IDBFS (allowing persisting files to browser IndexedDB storage between sessions) and can easily add support for WORKERFS. I don't think that will cause any conflicts or add much size to the library.

EDIT: here is an example of using WORKERFS I found: https://stackoverflow.com/questions/59128901/reading-large-user-provided-file-from-emscripten-chunk-at-a-time

EDITED again: I can't believe I missed that @axelboc had already posted the same stackoverflow link earlier in this thread.

from h5wasm.

turner commented on May 18, 2024

You don't have to use the emscripten MEMFS filesystem if you don't want to. You'll get a synchronous, MEMFS "traditional" file backed by an ArrayBuffer if you invoke "FS.writeFile" as in the comment above, and in the h5wasm example docs.

There are other virtual filesystems to choose from: see https://emscripten.org/docs/api_reference/Filesystem-API.html#file-systems. The one that might solve the problem here is the WORKERFS filesystem. I haven't found any examples of usage yet, but the documentation suggests it does exactly what you are looking for.

The h5wasm library would have to be compiled with support for the WORKERFS, which is not happening right now. I am building it with support for IDBFS (allowing persisting files to browser IndexedDB storage between sessions) and can easily add support for WORKERFS. I don't think that will cause any conflicts or add much size to the library.

EDIT: here is an example of using WORKERFS I found: https://stackoverflow.com/questions/59128901/reading-large-user-provided-file-from-emscripten-chunk-at-a-time

@bmaranville thanks for the rapid response. I'll check out your example. For more context. My use case involves surper-resolution microscopy for 3D imaging of chromosomes. So, thousands of raw high res. images (video frames) that get processed downstream into 3D models and Hi-C maps. Until I found HDF5 these files have all been separate entities. HDF5 could be a game changer for us.

from h5wasm.

turner commented on May 18, 2024

See new release v0.4.8 on npm and github

Very cool. Thanks Brian.

from h5wasm.

turner commented on May 18, 2024

I will start experimenting with this worker approach to handling large files (greater then available RAM). I am a bit unclear on how to interactively retrieve various datasets from within my app. Once mounted, how is this file made available to the hosting app?

from h5wasm.

bmaranville commented on May 18, 2024

The downside of the worker is that you have to indirectly access the h5wasm object through the worker interface. With a service worker you can intercept fetch requests and define a REST API in the worker that is accessed through the main page script, or you can use an add-on like 'promise-worker' with a regular webworker if you want to get responses to messages sent to a worker.

from h5wasm.

turner commented on May 18, 2024

Brian, this is a bit off topic but I have a basic question about h5wasm regarding these larger-than-memory files. As a sanity check I threw together a Jupyter notebook to play with 6.5GB file I am working with. The notebook uses h5py and works perfectly. Is there some fundamental limitation of the JS implementation that prevents using large files directly (without resorting to a worker)? Or is it just an issue of this being early in the development of h5wasm and it is in the roadmap for sometime in the future? Thanks.

from h5wasm.

bmaranville commented on May 18, 2024

The issue is really two things - the emscripten file system does not allow async access right now, and all major browsers forbid running synchronous file API access from the main page (thread). You are only allowed to run synchronous (blocking) file access (or URL fetch!) from a web worker, so that you don't block the main page (javascript thread).

The second thing is not likely to ever change - they are probably not going to allow sync file/fetch access from the main javascript thread again. The first thing might change - emscripten might support async file access at some point. I don't completely understand all the discussions along this topic but you can see emscripten-core/emscripten#15041

I don't usually recommend jsfive over h5wasm, but for this particular use case it is possible to build jsfive in async mode, and make all accessors (getting data, attributes etc.) async as well, and then you can use it in the main thread. See #40 (comment) above. If there is demand for this, I will release this async version of jsfive, probably as a separate package.

EDIT: to answer your question more directly, the reason it works in jupyter is because the HDF5 file is being read by the python process running directly on your OS, while the h5wasm code is running in your browser. You can also run h5wasm in nodejs, and it can do random-access on files in the OS just fine - just like h5py!

from h5wasm.

Reading HDF5 file that is larger than client memory about h5wasm HOT 22 OPEN

Comments (22)

Recap

Values -> pointers

HDF5 index file companion

Questions

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent