Comments (5)
There are indeed very few instances of raw image data in datasets
- usually it's better to keep them as encoded in jpeg/png files to save disk space. The common way to have image/audio data is to simply have the path to the image/audio file
from dataset-viewer.
I agree: we can provide JPEG or PNG files from the endpoint, which are then linkable, cacheable, easier to manage on the frontend side. We will have to manage a special type for them: see #25 (maybe return both the raw values + the image URL)
re audio: from https://observablehq.com/@huggingface/types-of-the-datasets-columns, I understand that all the audio datasets have columns with a path to the audio file. Tensors are just used for images (Array2D for mnist: black and white, and Array3D for cifar: color)
from dataset-viewer.
After chatting with @lhoestq, I think that for all the types of images (see https://github.com/huggingface/moon-landing/pull/1040), I will provide the image inside the /rows
endpoint response as:
image: {
data: base64
type: str (mimetype)
filename: str (optional)
}
This would allow showing the images with https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URIs. And to allow downloading them as files with https://developer.mozilla.org/en-US/docs/Web/API/File/File.
This would help to avoid having to maintain a complex backend (have a cache of all the images, manage the uniqueness of the image URLs, have an internet-facing API, etc)
In some cases, the data will come directly from the datasets
library, in other cases like mnist or cifar10, I will have to generate an image (one of https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types#image_types) then provide the bytes.
Same idea for the audio files.
Note: https://github.com/ahupp/python-magic might be used to detect the mimetype
from dataset-viewer.
Finally: I store the data in a file, then serve the file as a static file, and I give its URL in the data. For example, see:
- the JSON: https://datasets-preview.huggingface.tech/rows?dataset=mnist
- the image asset: https://datasets-preview.huggingface.tech/assets/mnist/___/mnist/train/0/image/image.jpg
from dataset-viewer.
The support for image datasets is now followed here: #63
For audio: #70
from dataset-viewer.
Related Issues (20)
- /search and /filter are currently broken HOT 2
- Update datasets to 2.19.0 HOT 5
- Improve the message for DatasetWithScriptNotSupportedError HOT 4
- Presidio scan HOT 1
- Use `.__cause__` when possible when raising an exception
- Return partial dataset-hub-cache instead of error? HOT 3
- Upgrade pyarrow to 16? HOT 1
- Don't recompute everything if the change is only in the README body
- services/worker tests are failing HOT 1
- Upgrade to huggingface_hub 0.23.0 HOT 1
- A space in the column name breaks the assets URLs
- Children jobs are not created after `JobManagerCrashedError` HOT 3
- persisting CreateCommitError HOT 4
- Backfill ignores the existing cache entries for previous config/split names HOT 1
- FineWeb: Unexpected end of stream: Page was smaller (1862094) than expected (2055611) HOT 8
- Support LeRobot datasets? HOT 1
- Update datasets to 2.19.1
- Column name wrongly contains data HOT 5
- Truncate all the logs
- Rows Post Processing Error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataset-viewer.