Warning hffs is no longer maintained. Please use huggingface_hub's FileSystem API instead.

`hffs`

hffs builds on huggingface_hub and fsspec to provide a convenient Python filesystem interface to 🤗 Hub.

Basic usage

Locate and read a file from a 🤗 Hub repo:

>>> import hffs
>>> fs = hffs.HfFileSystem()
>>> fs.ls("datasets/my-username/my-dataset-repo", detail=False)
['datasets/my-username/my-dataset-repo/.gitattributes', 'datasets/my-username/my-dataset-repo/my-file.txt']
>>> with fs.open("datasets/my-username/my-dataset-repo/my-file.txt", "r") as f:
...     f.read()
'Hello, world'

Write a file to the repo:

>>> with fs.open("datasets/my-username/my-dataset-repo/my-file-new.txt", "w") as f:
...     f.write("Hello, world1")
...     f.write("Hello, world2")
>>> fs.exists("datasets/my-username/my-dataset-repo/my-file-new.txt")
True
>>> fs.du("datasets/my-username/my-dataset-repo/my-file-new.txt")
26

Instantiation via fsspec:

>>> import fsspec

>>> # Instantiate a `hffs.HfFileSystem` object
>>> fs = fsspec.filesystem("hf")
>>> fs.ls("my-username/my-model-repo")
['my-username/my-model-repo/.gitattributes', 'my-username/my-model-repo/config.json', 'my-username/my-model-repo/pytorch_model.bin']

>>> # Instantiate a `hffs.HfFileSystem` object and write a file to it
>>> with fsspec.open("hf://datasets/my-username/my-dataset-repo/my-file-new.txt"):
...     f.write("Hello, world1")
...     f.write("Hello, world2")

Note: To be recognized as a hffs URL, the URL path passed to fsspec.open must adhere to the following scheme:
hf://[<repo_type_prefix>]<repo_id>/<path/in/repo>

The prefix for datasets is "datasets/", the prefix for spaces is "spaces/" and models don't need a prefix in the URL.

Installation

pip install hffs

Usage examples

pandas/dask

>>> import pandas as pd

>>> # Read a remote CSV file into a dataframe
>>> df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")

>>> # Write a dataframe to a remote CSV file
>>> df.to_csv("hf://datasets/my-username/my-dataset-repo/test.csv")

datasets

>>> import datasets

>>> # Export a (large) dataset to a repo
>>> output_dir = "hf://datasets/my-username/my-dataset-repo"
>>> builder = datasets.load_dataset_builder("path/to/local/loading_script/loading_script.py")
>>> builder.download_and_prepare(output_dir, file_format="parquet")

>>> # Stream the dataset from the repo
>>> dset = datasets.load_dataset("my-username/my-dataset-repo", split="train", streaming=True)
>>> # Process the examples
>>> for ex in dset:
...    ...

zarr

>>> import numpy as np
>>> import zarr

>>> embeddings = np.random.randn(50000, 1000).astype("float32")

>>> # Write an array to a repo acting as a remote zarr store
>>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="w") as root:
...    foo = root.create_group("embeddings")
...    foobar = foo.zeros('experiment_0', shape=(50000, 1000), chunks=(10000, 1000), dtype='f4')
...    foobar[:] = embeddings

>>> # Read from a remote zarr store
>>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="r") as root:
...    first_row = root["embeddings/experiment_0"][0]

duckdb

>>> import hffs
>>> import duckdb

>>> fs = hffs.HfFileSystem()
>>> duckdb.register_filesystem(fs)
>>> # Query a remote file and get the result as a dataframe
>>> df = duckdb.query("SELECT * FROM 'hf://datasets/my-username/my-dataset-repo/data.parquet' LIMIT 10").df()

Authentication

To write to your repositories or access your private repositories; you can login by running

huggingface-cli login

Or pass a token (from your HF settings) to

>>> import hffs
>>> fs = hffs.HfFileSystem(token=token)

or as storage_options:

>>> storage_options = {"token": token}
>>> df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv", storage_options=storage_options)

Default value of `repo_type` in `HfFilesystem.init` (and in `hfh`)

@lhoestq thinks models' users should use hfh and not hffs, so he believes it would make more sense to change the default value of repo_type to "dataset" to avoid having to explicitly pass repo_type="dataset" to pandas.read_<format> for instance. However, this would mean hfh and hffs are no longer aligned when it comes to repo_type's default value (None/"model" vs. "dataset").

To preserve consistency (and to make this part more convenient), perhaps we could infer the repo type of a repo id if repo_type=None instead of defaulting to "model"? Of course, this would only be possible in scenarios where the repo id is not ambiguous.

Also, this would help us avoid situations like these in the future:
https://discuss.huggingface.co/t/huggingface-hub-upload-file-returns-a-404-error/16014
https://discuss.huggingface.co/t/upload-file-api-for-saving-to-persistent-datasets-on-hf-spaces/23689

cc @Wauplin @julien-c

huggingface / hffs Goto Github PK

hffs's Introduction

hffs

Basic usage

Installation

Usage examples

Authentication

hffs's People

Contributors

Stargazers

Watchers

Forkers

hffs's Issues

Recommend Projects

Recommend Topics

Recommend Org

`hffs`