Code Monkey home page Code Monkey logo

hffs's Introduction

Warning hffs is no longer maintained. Please use huggingface_hub's FileSystem API instead.

hffs

hffs builds on huggingface_hub and fsspec to provide a convenient Python filesystem interface to ๐Ÿค— Hub.

Basic usage

Locate and read a file from a ๐Ÿค— Hub repo:

>>> import hffs
>>> fs = hffs.HfFileSystem()
>>> fs.ls("datasets/my-username/my-dataset-repo", detail=False)
['datasets/my-username/my-dataset-repo/.gitattributes', 'datasets/my-username/my-dataset-repo/my-file.txt']
>>> with fs.open("datasets/my-username/my-dataset-repo/my-file.txt", "r") as f:
...     f.read()
'Hello, world'

Write a file to the repo:

>>> with fs.open("datasets/my-username/my-dataset-repo/my-file-new.txt", "w") as f:
...     f.write("Hello, world1")
...     f.write("Hello, world2")
>>> fs.exists("datasets/my-username/my-dataset-repo/my-file-new.txt")
True
>>> fs.du("datasets/my-username/my-dataset-repo/my-file-new.txt")
26

Instantiation via fsspec:

>>> import fsspec

>>> # Instantiate a `hffs.HfFileSystem` object
>>> fs = fsspec.filesystem("hf")
>>> fs.ls("my-username/my-model-repo")
['my-username/my-model-repo/.gitattributes', 'my-username/my-model-repo/config.json', 'my-username/my-model-repo/pytorch_model.bin']

>>> # Instantiate a `hffs.HfFileSystem` object and write a file to it
>>> with fsspec.open("hf://datasets/my-username/my-dataset-repo/my-file-new.txt"):
...     f.write("Hello, world1")
...     f.write("Hello, world2")

Note: To be recognized as a hffs URL, the URL path passed to fsspec.open must adhere to the following scheme:

hf://[<repo_type_prefix>]<repo_id>/<path/in/repo>

The prefix for datasets is "datasets/", the prefix for spaces is "spaces/" and models don't need a prefix in the URL.

Installation

pip install hffs

Usage examples

>>> import pandas as pd

>>> # Read a remote CSV file into a dataframe
>>> df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")

>>> # Write a dataframe to a remote CSV file
>>> df.to_csv("hf://datasets/my-username/my-dataset-repo/test.csv")
>>> import datasets

>>> # Export a (large) dataset to a repo
>>> output_dir = "hf://datasets/my-username/my-dataset-repo"
>>> builder = datasets.load_dataset_builder("path/to/local/loading_script/loading_script.py")
>>> builder.download_and_prepare(output_dir, file_format="parquet")

>>> # Stream the dataset from the repo
>>> dset = datasets.load_dataset("my-username/my-dataset-repo", split="train", streaming=True)
>>> # Process the examples
>>> for ex in dset:
...    ...
>>> import numpy as np
>>> import zarr

>>> embeddings = np.random.randn(50000, 1000).astype("float32")

>>> # Write an array to a repo acting as a remote zarr store
>>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="w") as root:
...    foo = root.create_group("embeddings")
...    foobar = foo.zeros('experiment_0', shape=(50000, 1000), chunks=(10000, 1000), dtype='f4')
...    foobar[:] = embeddings

>>> # Read from a remote zarr store
>>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="r") as root:
...    first_row = root["embeddings/experiment_0"][0]
>>> import hffs
>>> import duckdb

>>> fs = hffs.HfFileSystem()
>>> duckdb.register_filesystem(fs)
>>> # Query a remote file and get the result as a dataframe
>>> df = duckdb.query("SELECT * FROM 'hf://datasets/my-username/my-dataset-repo/data.parquet' LIMIT 10").df()

Authentication

To write to your repositories or access your private repositories; you can login by running

huggingface-cli login

Or pass a token (from your HF settings) to

>>> import hffs
>>> fs = hffs.HfFileSystem(token=token)

or as storage_options:

>>> storage_options = {"token": token}
>>> df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv", storage_options=storage_options)

hffs's People

Contributors

deep-diver avatar guspan-tanadi avatar lhoestq avatar mariosasko avatar wauplin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hffs's Issues

Optimize repo siblings cache control

Currently, we fetch all the siblings of a repo after each modification, which can be problematic for repos with many files due to the size of the returned payload. We can optimize this by fetching only the siblings of the repo directories affected by a particular modification. For this to be added, server-side querying of the repo siblings would need to be implemented (+ accompanying hfh API). (cc @julien-c @SBrandeis @Wauplin )

Default value of `repo_type` in `HfFilesystem.__init__` (and in `hfh`)

@lhoestq thinks models' users should use hfh and not hffs, so he believes it would make more sense to change the default value of repo_type to "dataset" to avoid having to explicitly pass repo_type="dataset" to pandas.read_<format> for instance. However, this would mean hfh and hffs are no longer aligned when it comes to repo_type's default value (None/"model" vs. "dataset").

To preserve consistency (and to make this part more convenient), perhaps we could infer the repo type of a repo id if repo_type=None instead of defaulting to "model"? Of course, this would only be possible in scenarios where the repo id is not ambiguous.

Also, this would help us avoid situations like these in the future:
https://discuss.huggingface.co/t/huggingface-hub-upload-file-returns-a-404-error/16014
https://discuss.huggingface.co/t/upload-file-api-for-saving-to-persistent-datasets-on-hf-spaces/23689

cc @Wauplin @julien-c

pip install not working

Maybe I'm a bit too early, but should the pip install hffs command already work? Because for me it is not:

ERROR: Could not find a version that satisfies the requirement hffs (from versions: none)
ERROR: No matching distribution found for hffs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.