huggingface / huggingface_hub Goto Github PK

View Code? Open in Web Editor NEW

1.7K 60.0 427.0 17.72 MB

The official Python client for the Huggingface Hub.

Home Page: https://huggingface.co/docs/huggingface_hub

License: Apache License 2.0

Makefile 0.10% Python 99.90%

model-hub machine-learning models natural-language-processing deep-learning pytorch pretrained-models hacktoberfest

huggingface_hub's Introduction

The official Python client for the Huggingface Hub.

English | Deutsch | हिंदी | 한국어 | 中文（简体）

Documentation: https://hf.co/docs/huggingface_hub

Source Code: https://github.com/huggingface/huggingface_hub

Welcome to the huggingface_hub library

The huggingface_hub library allows you to interact with the Hugging Face Hub, a platform democratizing open-source Machine Learning for creators and collaborators. Discover pre-trained models and datasets for your projects or play with the thousands of machine learning apps hosted on the Hub. You can also create and share your own models, datasets and demos with the community. The huggingface_hub library provides a simple way to do all these things with Python.

Key features

Download files from the Hub.
Upload files to the Hub.
Manage your repositories.
Run Inference on deployed models.
Search for models, datasets and Spaces.
Share Model Cards to document your models.
Engage with the community through PRs and comments.

Installation

Install the huggingface_hub package with pip:

pip install huggingface_hub

If you prefer, you can also install it with conda.

In order to keep the package minimal by default, huggingface_hub comes with optional dependencies useful for some use cases. For example, if you want have a complete experience for Inference, run:

pip install huggingface_hub[inference]

To learn more installation and optional dependencies, check out the installation guide.

Quick start

Download files

Download a single file

from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="tiiuae/falcon-7b-instruct", filename="config.json")

Or an entire repository

from huggingface_hub import snapshot_download

snapshot_download("stabilityai/stable-diffusion-2-1")

Files will be downloaded in a local cache folder. More details in this guide.

Login

The Hugging Face Hub uses tokens to authenticate applications (see docs). To login your machine, run the following CLI:

huggingface-cli login
# or using an environment variable
huggingface-cli login --token $HUGGINGFACE_TOKEN

Create a repository

from huggingface_hub import create_repo

create_repo(repo_id="super-cool-model")

Upload files

Upload a single file

from huggingface_hub import upload_file

upload_file(
    path_or_fileobj="/home/lysandre/dummy-test/README.md",
    path_in_repo="README.md",
    repo_id="lysandre/test-model",
)

Or an entire folder

from huggingface_hub import upload_folder

upload_folder(
    folder_path="/path/to/local/space",
    repo_id="username/my-cool-space",
    repo_type="space",
)

For details in the upload guide.

Integrating to the Hub.

We're partnering with cool open source ML libraries to provide free model hosting and versioning. You can find the existing integrations here.

The advantages are:

Free model or dataset hosting for libraries and their users.
Built-in file versioning, even with very large files, thanks to a git-based approach.
Serverless inference API for all models publicly available.
In-browser widgets to play with the uploaded models.
Anyone can upload a new model for your library, they just need to add the corresponding tag for the model to be discoverable.
Fast downloads! We use Cloudfront (a CDN) to geo-replicate downloads so they're blazing fast from anywhere on the globe.
Usage stats and more features to come.

If you would like to integrate your library, feel free to open an issue to begin the discussion. We wrote a step-by-step guide with ❤️ showing how to do this integration.

Contributions (feature requests, bugs, etc.) are super welcome 💙💚💛💜🧡❤️

Everyone is welcome to contribute, and we value everybody's contribution. Code is not the only way to help the community. Answering questions, helping others, reaching out and improving the documentations are immensely valuable to the community. We wrote a contribution guide to summarize how to get started to contribute to this repository.

huggingface_hub's People

Contributors

Stargazers

Watchers

Forkers

philschmid jonashaag thevasudevgupta lysandrejik nathanhundley aiinnova ericricky narsil lewtun ycemsubakan epwalsh fanshijianpharmacy shivampr21 huggingworld mishig25 subratcall usamaliaquat123 nateraw cccntu g4brielvs user06039 abidlabs jeanyu-habana borisdayma abirkorched devpramod calpt qianzhang42 mrm8488 jbelke knkarthick shunsunsun gaetanlepage pyannote angledluffa forkkit snapbuy stevhliu davanstrien maximedb techthiyanes coyotte508 pqmsoft1 philipxyc nielsrogge umair-nasir14 adrianeboyd zhangxinmrs tomsia1 patrickvonplaten rpatil524 muellerzr lawrendran kahne angeljaviersalazar tanbir masamiweb angrycaptain19 fremycompany mlonaws cakiki j-petiot huxiaoman7 mili-yini mtfelix hbredin johko juliensimon ab077u everlynasiko fazziekey frgfm markussagen segments-tobias slepox francescosaveriozuppichini pavel-lexyr osanseviero starborn anonimato404 stancld ske159 lvwerra amyeroberts jackma22 leondz lizzyzhan rizemtech joshzyj alekseykorshuk artemisep stefanv nimaboscarino sbrandeis arig23498 0xrushi manishbhat5 d-rath lsb lyle-nel

huggingface_hub's Issues

ModelHubMixin 400 Client Error

Hi guys,

I hope to find you well! So I am trying to add a dummy model to my models using the ModelHubMixin with the following snippet

from huggingface_hub import ModelHubMixin
from torch import nn

class MyModel(nn.Module, ModelHubMixin):
    def __init__(self, **kwargs):
        super().__init__()
        self.config = kwargs.pop('config', None)
        self.model = nn.Conv2d(self.config['in_channels'], self.config['channels'], kernel_size=3)
        
    def forward(self, x):
        return self.model(x)
    
config = {
    'in_channels': 3,
    'channels': 32
}
x = torch.randn((1, 3, 48, 48))
model = MyModel(config=config)

model.save_pretrained('Francesco/dummy', push_to_hub=True, config=config)

I also checked that I am correctly logged in my account (I used huggingface-cli login).

Unfortunately, I got the following error

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-32-9e0cb9790ccd> in <module>
     22 model(x)
     23 
---> 24 model.save_pretrained('Francesco/dummy', push_to_hub=True, config=config)

~/anaconda3/envs/dl/lib/python3.8/site-packages/huggingface_hub/hub_mixin.py in save_pretrained(self, save_directory, config, push_to_hub, **kwargs)
     80 
     81         if push_to_hub:
---> 82             return self.push_to_hub(save_directory, **kwargs)
     83 
     84     def _save_pretrained(self, path):

~/anaconda3/envs/dl/lib/python3.8/site-packages/huggingface_hub/hub_mixin.py in push_to_hub(save_directory, model_id, repo_url, commit_message, organization, private)
    233         token = HfFolder.get_token()
    234         if repo_url is None:
--> 235             repo_url = HfApi().create_repo(
    236                 token,
    237                 model_id,

~/anaconda3/envs/dl/lib/python3.8/site-packages/huggingface_hub/hf_api.py in create_repo(self, token, name, organization, private, repo_type, exist_ok, lfsmultipartthresh)
    219             d = r.json()
    220             return d["url"]
--> 221         r.raise_for_status()
    222         d = r.json()
    223         return d["url"]

~/anaconda3/envs/dl/lib/python3.8/site-packages/requests/models.py in raise_for_status(self)
    938 
    939         if http_error_msg:
--> 940             raise HTTPError(http_error_msg, response=self)
    941 
    942     def close(self):

HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/repos/create

Any Idea?

Thank you very much

Better error message when cloning without token

When cloning a repository without the use_auth_token parameter, the error is unintuitive:

from huggingface_hub import Repository
repo = Repository("hubert-base-ls960", clone_from="facebook/hubert-base-ls960")

OSError: fatal: repository 'facebook/hubert-base-ls960' does not exist

When the use_auth_token parameter is omitted, the URL isn't correctly reconstructed.

Mention the Tensorboard integration somewhere in ./docs/hub

Python versions with `rc...` suffix are not recognized

Error Message

../lib/python3.8/site-packages/huggingface_hub/file_download.py", line 37, in <genexpr>
    if tuple(int(i) for i in _PY_VERSION.split(".")) < (3, 8, 0):
ValueError: invalid literal for int() with base 10: '3rc1'

My Python version: 3.8.3rc1

Detect and track large files automatically

When pushing sentence-transformers models to the hub, I had the issue that tokenizers for some multilingual transformers models were creating a unigram.json file that is larger 10 MB. The push to the hub / git resulted in an error, due to the file limit size of 10 MB.

remote: -------------------------------------------------------------------------
remote: Your push was rejected because it contains files larger than 10M.
remote: Please use https://git-lfs.github.com/ to store larger files.
remote: -------------------------------------------------------------------------
remote: Offending files:
remote:  - unigram.json (ref: refs/heads/main)
To https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
 ! [remote rejected] main -> main (pre-receive hook declined)
error: Fehler beim Versenden einiger Referenzen nach 'https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'

I currently test a fix to detect large files in a repo and to add them to git-lfs before the push:
https://github.com/UKPLab/sentence-transformers/blob/695f12f9a7839e3697957271bbc9108caaf24ca8/sentence_transformers/SentenceTransformer.py#L525

I think this could be relevant for other libraries and also for transformers as well, when someone tries to push tokenizers files that are larger then 10MB.

In the best case, files larger than 10MB should be detected automatically and pushed to git-lfs.

TTS models response header is sent as application/json

causing widgets to fail: https://hf.co/julien-c/kan-bayashi_csmsc_tacotron2

Evidenced by:

curl 'https://api-inference.huggingface.co/models/julien-c/kan-bayashi_csmsc_tacotron2' \
  -H 'content-type: application/json' \
  -H 'Accept: */*' \
  --data-raw '{"inputs":"请您说得慢些好吗"}' \
  --compressed -vv

returns a

content-type: application/json

Add text classification for spaCy

This requires adding a text-classification script here that can be based in the token-classification implementation.

There is useful spaCy documentation in https://spacy.io/api/textcategorizer, but I think this should be straightforward to implement. Here is an example repo to use for testing - https://huggingface.co/edichief/en_textcat_goemotions

Give user a feedback when API returns something unepected with 200

When API returns something unepected with 200, currently no feedback is given to a user (i.e. no error message).
Necessary updates: when API returns something unepected with 200,

set error message with "Invalid/unexpected ..."
set outputJson with the API output

[Feature request] Add a hook to allow user to manipulate model init kwargs

The current from_pretrained implementation in ModelHubMixin reads in config.json (if available), and adds it as a kwarg named 'config' in the model init kwargs. This is in line with transformers style, more or less.

I'd rather explicitly write config kwargs in my model's __init__. To do that, we have to unpack the config dict instead of having it be a kwarg. The solution right now is to basically copy paste 99% of ModelHubMixin.from_pretrained to update one line of code.

What do you think about adding a simple hook that wraps this line and allows users to manipulate model_kwargs if they'd like by overriding?

huggingface_hub/src/huggingface_hub/hub_mixin.py

Line 196 in 379414d

model = cls(**model_kwargs)

Support for private models?

Looks like hf_hub_url and/or cached_download do not (yet ?) support authentication for private models.

What is the recommended way of downloading private models?

Widget for Structured Data Classification

This should be exactly the same as Table Question Answering, but:

Without a query input
Instead of table, the field is called data.

See discussion in #97

`Malformed soundfile` error when using ASR Widget

Malformed soundfile error occurs when a sample audio file is picked (example: https://huggingface.co/osanseviero/hubert_s3prl).

Solution: implement the same solution that is implement on image classification would work (downloading the file in the browser (using our proxy endpoint to workaround CORS) and reuploading it)

Only expose toplevel imports, not "nested" ones

Python Packaging question: how do we achieve from package_name import xxx importing from the __init__.py and not from the nested files inside the package?

Feature request: snapshot_download - add library_name and library_version

With cached_download() I can provide the library name & version, which is sent via http-agent.

For snapshot_download() I did not find such a function. For statistics reasons, it would be nice if you could also provide library_name and library_version. [And hopefully later get statistics which models are most frequently accessed from your library]

PS: Are the github issues the right place to propose wishes for features or should I put them somewhere else?

404 at 'Adding a Library' documentation page

This link is dead: https://huggingface.co/docs/adding-a-library

Its referenced in the readme of this repo as well as in the documentation here.

[Feature request] linking docs

While transformers are growing pretty fast and there are new models it would be cool if the docs are linked.
For example I saw Luke on the hub and had absolutely no idea how to use, until I found it in transformers docs

Widget for text to image generation

Input would be text
The output would be the image

Repository `use_auth_token` should default to `True` when in a HF-hub specific workflow

The Repository class is thought out as a wrapper to git/git-lfs, and isn't specific to the Hugging Face Hub. It's a useful helper to manage offline repositories.

However, the recent PRs #132, #143, #150 and #151 allow cloning repositories from the hub using only <namespace>/<model_id> or even <model_id> as identifiers, which is hf-hub specific.

@thomwolf rightfully asks why the use_auth_token parameter isn't True by default, and I tend to agree that for hf-specific workflows this should be the default as it would prevent a lot of pain points when cloning a repository without authentication.

Uniformized logging with other HF libs

The huggingface_hub library should also have a logging module like the other libraries (and maybe we should actually centralize all the logging command in this library once both datasets and transformers depend on it so that people can simply control all the logging of our tools.

Current reason to raise the issue is that uploading a very large dataset has no indication of status (since the logging is to info level by default).

[Feature request] Allow specifying `save_dir` in `snapshot_download`

This will make it easier to load without relying on cache_dir. IMO it also matches the function name snapshot_download better. (The current functionality of snapshot_download feels more like cached_download.)

snapshot_download('bert-base-uncased', save_dir='local_dir')
AutoModel.from_pretrained('local_dir')

Currently it's like this:

huggingface_hub/src/huggingface_hub/snapshot_download.py

Lines 47 to 49 in eed85ca

    
           storage_folder = os.path.join( 
        
               cache_dir, repo_id.replace("/", REPO_ID_SEPARATOR) + "." + model_info.sha 
        
           )

Add model card via the APIs

Hi guys,

First of all, great work! The hub is amazing. I would like to ask if there is an easy way to automatically add a model_card to the uploaded model.

Thank you.

Cheers,

Francesco

Non-blocking `git_push`

On local machine, the push can take a while, but usually there is no need to wait for it to complete, it would be great to keep training while the git-push uploads model in background.
This can be done by replacing subprocess.run with subprocess.Popen.

huggingface_hub/src/huggingface_hub/repository.py

Line 431 in 0d185de

def git_push(self) -> str:

Doing `push_to_hub` with a large model fail by default

I created a 1.5B parameters model and want to push the newly initialized model on the hub.

Doing a simple model.push_to_hub("thomwolf/my-model) fail with:

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
~/miniconda2/envs/datasets/lib/python3.7/site-packages/huggingface_hub/repository.py in git_push(self)
    445                 encoding="utf-8",
--> 446                 cwd=self.local_dir,
    447             )

~/miniconda2/envs/datasets/lib/python3.7/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    511             raise CalledProcessError(retcode, process.args,
--> 512                                      output=stdout, stderr=stderr)
    513     return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['git', 'push']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-4-3b4d432f9948> in <module>
----> 1 model.push_to_hub("thomwolf/codeparrot")

~/miniconda2/envs/datasets/lib/python3.7/site-packages/transformers/file_utils.py in push_to_hub(self, repo_path_or_name, repo_url, use_temp_dir, commit_message, organization, private, use_auth_token)
   2029         self.save_pretrained(repo_path_or_name)
   2030         # Commit and push!
-> 2031         url = self._push_to_hub(repo, commit_message=commit_message)
   2032 
   2033         # Clean up! Clean up! Everybody everywhere!

~/miniconda2/envs/datasets/lib/python3.7/site-packages/transformers/file_utils.py in _push_to_hub(cls, repo, commit_message)
   2109                 commit_message = "add model"
   2110 
-> 2111         return repo.push_to_hub(commit_message=commit_message)

~/miniconda2/envs/datasets/lib/python3.7/site-packages/huggingface_hub/repository.py in push_to_hub(self, commit_message)
    460         self.git_add()
    461         self.git_commit(commit_message)
--> 462         return self.git_push()
    463 
    464     @contextmanager

~/miniconda2/envs/datasets/lib/python3.7/site-packages/huggingface_hub/repository.py in git_push(self)
    448             logger.info(result.stdout)
    449         except subprocess.CalledProcessError as exc:
--> 450             raise EnvironmentError(exc.stderr)
    451 
    452         return self.git_head_commit_url()

OSError: batch response: 
You need to configure your repository to enable upload of files > 5GB.
Run "huggingface-cli lfs-enable-largefiles ./path/to/your/repo" and try again.

error: impossible de pousser des références vers 'https://huggingface.co/thomwolf/codeparrot'

Since push_to_hub() create a temporary folder (afaiu) I can't by default run huggingface-cli lfs-enable-largefiles ./path/to/your/repo in it.

Workaround is probably to manually create the folder to save and run the lfs command in it before pushing but maybe this shouldn't fail by default?

Spaces: Third party component streamlit-tags not loading

Link:https://huggingface.co/spaces/dk-crazydiv/streamlit-tags-breaking

Requirements installed:

streamlit==0.82.0
transformers==4.8.2
torch==1.9.0
streamlit-tags==1.2.6

Error Message on Spaces

Expected behaviour

Info:

The same code works fine on local & on streamlit share
A request in console is 404 on spaces page, but works on share.streamlit.io (looks like it might be an iframe blocking issue)

Thanks for help!

`HfApi.model_info(revision=)` does not resolve hash prefix

import huggingface_hub

huggingface_hub.__version__

'0.0.13'

from huggingface_hub import snapshot_download, HfApi
api = HfApi()

# success
api.model_info("flax-community/ft5-cnn-dm") 

# success
api.model_info(
    "flax-community/ft5-cnn-dm", revision="859350e337148108b32b6f9eef45d0d4c6b668a9"
)

# fail
api.model_info("flax-community/ft5-cnn-dm",revision='859350e')

Error message:

HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/flax-community/ft5-cnn-dm/revision/859350e

This is what snapshot_download calls internally, so snapshot_download would fail if only hash prefix is provided.

Feature Request: Have programatic way of adding metadata to a repo

If a user creates dozens of repos programmatically, and then realized they forgot adding some tag or any other info for the metadata, they will have to either go through each of them manually, re-generate the README.md, or change the README.md content manually which can be error-prone.

Spaces: Feature Request: Include git submodules in codebase.

Submodules allows having editable model code shared across Huggingface projects.

Sadly these aren’t copied into the space code base so don’t work.

See space: https://huggingface.co/spaces/flax-community/t5-vae

hub model: https://huggingface.co/flax-community/t5-vae-python

model sub module code: https://github.com/Fraser-Greenlee/t5-vae-flax

cached_download, but with custom file name

Hi there !

Thanks for building this very helpful interface. We are integrating hugging face to SpeechBrain to release all our pretrained speech models. However, we would like to be able to allow users to specify what will be the name of the file that they download. In practice cached_download stores the downloaded file with a very long and hashed filename. What is the best way to allow for a custom name (in the end, it's custom save dir + custom file name).

Thanks !

Netlify authorization issue on `JS Widgets` action

When I push new widget code, JS Widgets action run fails with:

Run npx netlify deploy --auth "$NETLIFY_AUTH_TOKEN" --dir ./build/ --site ${NETLIFY_SITE_ID} ${NETLIFY_EXTRA_FLAG}
Logging into your Netlify account...
Opening https://app.netlify.com/authorize?response_type=ticket&ticket=3991f04ac8fafbce2ad54c29bf5fd8b6
- Waiting for authorization...
 ›   Error: Timed out waiting for authorization. If you do not have a Netlify 
 ›   account, please create one at https://app.netlify.com/signup, then run 
 ›   netlify login again.
Error: Process completed with exit code 2.

See failed workflow run here.
What should I do to solve this issue? Thanks!

`Repository` can't be used with datasets repositories anymore

How to reproduce

from huggingface_hub import Repository
repo = Repository(local_dir=".", clone_from="https://huggingface.co/datasets/<username>/<repo_name>", use_token="token")

Observed behaviour

The command fails with a 404 error

The culprit is probably this line:

huggingface_hub/src/huggingface_hub/repository.py

Line 164 in 7f12f29

endpoint = "/".join(repo_url.split("/")[:-2])

That does not strip the datasets prefix of HF datasets repos

Stacktrace

----> 1 Repository(local_dir=".", clone_from="https://huggingface.co/datasets/<username>/<repo_name>", use_auth_token=True)

~\miniconda3\envs\venv\lib\site-packages\huggingface_hub\repository.py in __init__(self, local_dir, clone_from, use_auth_token, git_user, git_email)
    105                 clone_from = f"{ENDPOINT}/{clone_from}"
    106
--> 107             self.clone_from(repo_url=clone_from)
    108         else:
    109             if is_git_repo(self.local_dir):

~\miniconda3\envs\venv\lib\site-packages\huggingface_hub\repository.py in clone_from(self, repo_url, use_auth_token)
    169             organization, repo_id = repo_url.split("/")[-2:]
    170
--> 171             HfApi(endpoint=endpoint).create_repo(
    172                 token,
    173                 repo_id,

~\miniconda3\envs\venv\lib\site-packages\huggingface_hub\hf_api.py in create_repo(self, token, name, organization, private, repo_type, exist_ok, lfsmultipartthresh)
    322                     pass
    323
--> 324                 raise err
    325
    326         d = r.json()

~\miniconda3\envs\venv\lib\site-packages\huggingface_hub\hf_api.py in create_repo(self, token, name, organization, private, repo_type, exist_ok, lfsmultipartthresh)
    311
    312         try:
--> 313             r.raise_for_status()
    314         except HTTPError as err:
    315             if not (exist_ok and err.response.status_code == 409):

~\miniconda3\envs\venv\lib\site-packages\requests\models.py in raise_for_status(self)
    941
    942         if http_error_msg:
--> 943             raise HTTPError(http_error_msg, response=self)
    944
    945     def close(self):

HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/datasets/api/repos/create

Should HfApi().login save the token?

Currently, login only returns the token, but I wonder if it should call save_token so it behaves closer to a login.

WDYT @LysandreJik? Would it make sense?

Add simple class to ease saving loading from HF hub

Would like to have a class from which model class can be inherited to add support of Huggingface hub. It will look something like this :

from huggingface_hub import SavingUtils

class MyModel(nn.Module, SavingUtils):

   def __init__(self):
      self.layer1 = ....

   def forward(self, ...):
      return ...

model = MyModel()
model.save_pretrained("model_id")

# loading model
model = MyModel.from_pretrained("model_id")

I can make a PR for adding this simple feature if you approve it.

Open source our Inference widgets

cross-referencing issue on our internal repo: https://github.com/huggingface/moon-landing/issues/716

[Feature proposal] Ignore files for snapshot_download

I use the snapshot_download method to download all model files for sentence-transformers.

@patrickvonplaten converted the transformers models to flax and added them to the respective repository, e.g. for the LaBSE model:
https://huggingface.co/sentence-transformers/LaBSE/tree/main

Right now when you load the LaBSE model with sentence-transformers, the flax_model.msgpack is also downloaded, but never used (only the pytorch files are used). This adds another 1.8 GB to the download and to the storage requirement. This gets potentially worse if models are converted to further formats, e.g. to tensorflow.

Would it make sense to add a blacklist / whitelist parameter for the snapshot_download method? So that you can control which files should be checked out? Potentially also with wildcard, e.g. only download '.json' and '.bin' files?

Or do you know a better way how you could determine which files are needed and must be downloaded by your library? For sentence-transformers, the number of files you must download can greatly vary from model to model, i.e. using cached_download on a fixed set of files would not be possible.

Added to conda-forge

Hi, this is more of an FYI than an issue but I just wanted to let the maintainers here know that I added huggingface_hub to conda-forge. Repo: https://github.com/conda-forge/huggingface_hub-feedstock
I think you're already pushing to your own Anaconda channel but let me know if you'd like to be maintainers there (or take over completely) as well. Otherwise, feel free to just close the issue.
Thanks!

[Feature request] Scikit learn integration

As an initial step, a simple integration in the Inference API would be similar to what is done in this repo.

model = joblib.load(cached_download(
    hf_hub_url(REPO_ID, "sklearn_model.joblib")
))

We could do something similar to the Table QA widget and add a structured-data-classification task.

Unless there's a bigger plan at the moment, this could be a simple enough thing to add from our side to showcase simple classification/regression use cases. We could upload some of the example models from the documentation to a scikit-learn-examples org and let users test them directly in the browser.

WDYT @julien-c?

Feature request: enable search and download of dataset metadata

As an end-user, I would like to be able to query and download the dataset metadata on the Hub in a similar fashion to HfApi.list_models

In particular, I would like to extract the information stored in the dataset_infos.json file that is associated with each canonical (and community?) dataset.

Thanks to a tip from @lhoestq, the current way we can do this is by using datasets.load.prepare_module to download and cache the files:

from datasets import_main_class, prepare_module

module, module_hash = prepare_module(dataset_name)
builder_cls = import_main_class(module)
builder = builder_cls(hash=module_hash, name=dataset_config)
# get infos
builder.info

However, this is much, much slower than HfApi.list_models, presumably because we don't expose the dataset_infos.json information as an endpoint.

Would it make sense to implement an something analogous to ModelFile, ModelInfo and HfApi.list_models but for datasets?

Add to pipeline "feature-extration": "sentence-transformers / paraphrase-multilingual-MiniLM-L12-v2" model

Hello,

I would like to serve a custom model on accelerated inference API based on this original model: https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

However, pipeline is not defined for this model so I can't use it. Could you add this model to the feature-extraction pipeline?

Thanks!

Try wrapping git-lfs to improve its ergonomics for large files

I personally find the built-in progress reporting from git-lfs pretty bad.

If others feel the same way, it might make sense to experiment with wrapping git-lfs with our own helper which we can customize to our use cases.

git-lfs exposes a GIT_LFS_PROGRESS (described in https://github.com/git-lfs/git-lfs/blob/dce20b0d18213d720ff2897267e68960d296eb5e/docs/man/git-lfs-config.5.ronn) which writes into a file and which we can maybe use to improve the progress reporting.

Thoughts?

Add Documentation page on `list_models`, `model_info`, and similar methods

...as well as the underlying HTTP API endpoints:

https://huggingface.co/api/models?full=true to get the full list of models
https://huggingface.co/api/models/elgeish/wav2vec2-large-xlsr-53-arabic (replace with relevant model id) to get info on one specific model
...

Only use git credentials, no more token

The current authentication system isn't ideal for git-based workflows. It isn't clear to users why they should first authenticate with huggingface-cli, then re-authenticate with git push.

It also isn't simple to git push in a colab notebook, a shell-less environment which can't prompt for username and password. All of these issues could be handled in a simpler way by only using git credentials, not an authentication token, in order to handle all git-related operations.

Currently looking at:
git credentials fill
git credential-store store
git credential-store get

Feature request: Change repo from private to public (or viceversa) programatically

I think this use case might be common:

User A decided to use huggingface_hub to share their company models in their org.
User A decided to begin by creating private repos and see if everything works.
User A now wants to make their repos public. Unfortunately, they have over 50 repos and going through each of them might be painful.

[Issue]: Too long file-paths for Windows

Hi,
I was running the following code:

from huggingface_hub import snapshot_download, hf_hub_url, cached_download
model_name = 'osanseviero/full-sentence-distillroberta2'
url = hf_hub_url(model_name, 'modules.json')
print(url)
path = cached_download(url, cache_dir='local_cache')
print("file downloaded")

The call to cached_download() freezes on my Windows machine.

The issue is in this line:

huggingface_hub/src/huggingface_hub/file_download.py

Line 412 in 67198a0

with FileLock(lock_path):

The lock_path was:

D:\some\long\subfolder\dir\i\am\using\....\local_cache\1547cad195ef32a189cf8d0de51c99ab3c5b767f1c64e01a5711d3a6af9dfdb1.ab4f0ee45ce35db416c1d6d811634418f90db5d7a46bb8ff552d13c6468ec38e.lock

In total the path had 275 characters. However, on Windows only file-paths up to 255 characters are allowed. The generated filename (1547[....].lock) has 135 characters, only leaving at max 120 characters for the folder path.

If you want to download to any path with more than 120 characters, the process will freeze as the FileLock cannot be acquired.

How to fix it:

Not sure what a good fix would be. It would be nice if FileLock would raise an Exception, that the path is too long. But this is sadly not the case. Is there a way an exception could be thrown here?
I was able to get around the issue adding the force_filename and to use a shorter file name:

path = cached_download(url, cache_dir=f'local_cache', force_filename='modules.json')

Does it make sense to use shorter than the 135 character file-name?
Is there some way to check first if the lock can be created, before we acquire it for the download? This way we would hopefully note that the lock cannot be created.

Discussion: download and upload models have a somewhat inconsistent API

Hi all. Just a small-ish thing to discuss.

Methods that get information about/from repos take the full model id (e.g. hf_api.model_info("osanseviero/my-model") or snapshot_download("osanseviero/my-model)).

On the other hand, create_repo takes as two separate arguments the model name and the organization (if any). I wonder if this could lead to confusions in the future, and, if so, we should check if the name param has a / and determine if there's an org in there.

Add filtering and fields to /api/datasets

as @lhoestq noted in #164
There's differences that we may need to address compared to the /api/models endpoint:

/api/datasets doesn't return the sha
/api/datasets does return branch: main even if the specified revision is on another branch
/api/datasets doesn't support filtering afaik

@julien-c and @SBrandeis following this

push was rejected because it contains files larger than 10M

Hello,

I would like to push a custom model on model hub based on the original one: https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2

When applying git push, I got the following error message:

Username for 'https://huggingface.co': Matthieu
Password for 'https://[email protected]': 
Username for 'https://huggingface.co': Matthieu                                                                                                                                                                                                                                    
Password for 'https://[email protected]': 
Uploading LFS objects: 100% (1/1), 471 MB | 0 B/s, done.                                                                                                                                                                                                                           
Énumération des objets: 8, fait.
Décompte des objets: 100% (8/8), fait.
Compression par delta en utilisant jusqu'à 48 fils d'exécution
Compression des objets: 100% (7/7), fait.
Écriture des objets: 100% (7/7), 3.57 Mio | 2.06 Mio/s, fait.
Total 7 (delta 0), réutilisés 0 (delta 0), réutilisés du pack 0
remote: -------------------------------------------------------------------------
remote: Your push was rejected because it contains files larger than 10M.
remote: Please use https://git-lfs.github.com/ to store larger files.
remote: -------------------------------------------------------------------------
remote: Offending files:
remote:  - unigram.json (ref: refs/heads/main)
To https://huggingface.co/Matthieu/paraphrase-multilingual-MiniLM-L12-v2-custom
 ! [remote rejected] main -> main (pre-receive hook declined)
error: impossible de pousser des références vers 'https://huggingface.co/Matthieu/paraphrase-multilingual-MiniLM-L12-v2-custom'

It seems that it is caused by the unigram.json file which has size > 10M, but don't know how to deal with this problem. I already load on model hub previous sentence-transformer models but never had this unigram.json file.

Thanks!

Spaces: Gradio interface title and description is invisible in darkmode

Example Space:
https://huggingface.co/spaces/flax-community/FT5-demo

Dark mode:

Light mode:

Feature request: Check if repo / file exists

Hi,
not sure if these functions are already implement, but I was not able to find them. It would be nice to have the following two methods:

a function that checks if a repository exists. I provide a name, it checks if huggingface.co/[model_name] exists. Returns True / False.
a function that checks if a specific file in a repository exists. For sentence-transformers, I check if there is a sentence-transformers specific file (modules.json) in that repositoryand depending on it I execute different code. Currently I do this with cached_download(url_to_modules_json_file) and catching a 404 exception. But having a function that just checks if the file exists (without downloading) and return True / False would have been nice.

Feature Request: Be able to search the hub by substring and author

Hi! Thank you so much for a wonderful API and Hub, I'm finding it super useful with the AdaptNLP project! I do have one request though:

I wrote a wrapper API to make the search results a bit more easy to read and comprehend for users, such as seen below

One issue I've come to have though is if I want to search for a model by a substring of its name such as
if I search "danish-large", "flair/ner-danish-large" isn't returned which would be a very nice quality of life with the search API. Similarly if I try and search for models created by a specific user, the API will simply return nothing. It would be excellent if these features were added!

My method in the meantime is just to download the entire list (ie api.list_modes()) and filter the results, but it's far from ideal and memory efficient.

Thank you for all your your hard work!

(IF there is a different API route and flag I should be using rather than what's used by default, that would be good to know too)

Add `--filename` support for `git lfs track`

We need this to reliably track individual file with their file names.

Here in the doc:
https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-track.1.ronn

git lfs track --filename "project [1].psd"

relevant issue #138 (identify large files and track by their file names.)

	storage_folder = os.path.join(
	cache_dir, repo_id.replace("/", REPO_ID_SEPARATOR) + "." + model_info.sha
	)