dagshub / client Goto Github PK

DagsHub client libraries

Home Page: https://dagshub.com/docs/client/

License: MIT License

Python 100.00%

ai data data-science data-streaming dvc hacktoberfest hacktoberfest2023 keras machine-learning machinelearning mlops python pytorch tensorflow

client's Introduction

What is DagsHub?

DagsHub is a platform where machine learning and data science teams can build, manage, and collaborate on their projects. With DagsHub you can:

Version code, data, and models in one place. Use the free provided DagsHub storage or connect it to your cloud storage
Track Experiments using Git, DVC or MLflow, to provide a fully reproducible environment
Visualize pipelines, data, and notebooks in and interactive, diff-able, and dynamic way
Label your data directly on the platform using Label Studio
Share your work with your team members
Stream and upload your data in an intuitive and easy way, while preserving versioning and structure.

DagsHub is built firmly around open, standard formats for your project. In particular:

Git
DVC
MLflow
Label Studio
Standard data formats like YAML, JSON, CSV

Therefore, you can work with DagsHub regardless of your chosen programming language or frameworks.

DagsHub Client API & CLI

This client library is meant to help you get started quickly with DagsHub. It is made up of Experiment tracking and Direct Data Access (DDA), a component to let you stream and upload your data.

For more details on the different functions of the client, check out the docs segments:

Some functionality is supported only in Python.

To read about some of the awesome use cases for Direct Data Access, check out the relevant doc page.

Installation

pip install dagshub

Direct Data Access (DDA) functionality requires authentication, which you can easily do by running the following command in your terminal:

dagshub login

Quickstart for Data Streaming

The easiest way to start using DagsHub is via the Python Hooks method. To do this:

Your DagsHub project,
Copy the following 2 lines of code into your Python code which accesses your data:
```
from dagshub.streaming import install_hooks
install_hooks()
```
That’s it! You now have streaming access to all your project files.

🤩 Check out this colab to see an example of this Data Streaming work end to end:

Next Steps

You can dive into the expanded documentation, to learn more about data streaming, data upload and experiment tracking with DagsHub

Analytics

To improve your experience, we collect analytics on client usage. If you want to disable analytics collection, set the DAGSHUB_DISABLE_ANALYTICS environment variable to any value.

Made with 🐶 by DagsHub.

client's People

Contributors

Stargazers

Watchers

client's Issues

Truncated command help text for `dagshub repo --help`

Tried running dagshub repo --help and this is the full help text that gets printed:

Usage: dagshub repo [OPTIONS] COMMAND [ARGS]...

  Operations on repo: currently only 'create'

Options:
  --help  Show this message and exit.

Commands:
  create  create a repo and:

Experiments Card Overflow UI issue

The scroll feature doesn't seem to work. I'm trying to view the setup instructions in the "experiments" section. The card seems to have extra instructions at the bottom but i can't read them.

Windows - File with backslashes is created instead of directory

Steps to reproduce:
On a windows machine

ds = repo.directory("data_new/models")
ds.add(...)

Will result in data_new\models.dvc file at the root of the repo instead of a models.dvc file in the data_new directory.

installation on premis

Hi,
Is it possible to run DagsHub on premis?

Adding `versioning` argument to `repo.upload` when uploading a folder throws an error

Attempting to override the default settings for versioning when uploading a folder returns an error:

TypeError: dagshub.upload.wrapper.DataSet.commit() got multiple values for keyword argument 'versioning'

This is because we are passing kwargs which contains the versioning argument as well as our own versioning argument. We need to check for this first.

Data Engine: Allow downloading all binary fields when `qr.get_blob_fields` is called with no args

Right now you need to specify the columns you want to get. Would be nice to be able to download all of them at once

qr = ds.all()
qr.get_blob_field() # Should download all

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

Repo.upload method will return url of that uploaded file in dagshub/upload/wrapper.py

Currently repo.upload method in dagshub/upload/wrapper.py, just uploads the file and logs the error codes and its content. It would be helpful, if we can get the url of file uploaded.

Docstrings for the data upload code

It's pretty hard to understand how to use Repo, DataSet and related code in dagshub/upload/wrapper.py without more documentation.

Log-Model failed due to server issue

Query
An error occurs while attempting to make an HTTPS request to https://dagshub.com/kalema3502/mlflow_test.mlflow/api/2.0/mlflow/runs/log-model, and the server responds with many 500 error responses.

Screenshots

Patch os.open instead of builtins.open and pathlib

To fully support all cases, it would make more sense to rewrite our patching logic in install_hooks to patch the more low-level os.open function, which maps directly to syscalls and is actually used by both builtins.open and pathlib.

Docstring for dagshub.init command

The critical dagshub.init function currently has no docstring at all

Improve DataSet.commit() branch handling

Make it so it can upload to any arbitrary branch. Right now it can only create a new one via the new_branch argument

Add --verbose command line flag and improve logging

Logging levels should be standardized, and generally more logs added to the most verbose level.
After doing that, a --verbose flag can greatly help in debugging issues

Update twitter logo to X in footer

The following twitter logo in footer needs to be updated to X.
I would like to work on this .

Save non-Colab Jupyter saves entire execution history

In non-colab Jupyter, we save the entire execution history, i.e. if you run a cell twice, the save notebook will include 2 cell copies, each with the specific things run in each version.

There should probably be a way to save the current state, which is what we want.

Support Python 3.11

Currently, Python 3.11 isn't supported by popular ML frameworks such as Pytorch, so it wasn't a high priority to support it ourselves.
@kbolashev Might have more information on what's missing to add this support.

Unhelpful Client Errors

Cause	Stack Trace	Notes
Expects a list of file objects, but I provided strings - documentation requests a string of lists	Traceback (most recent call last): File "/home/jinen/documents/dagshub/git/yolo-v8/test.py", line 10, in results = model.train(data='coco128.yaml', epochs=1, imgsz=32) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/model.py", line 371, in train self.trainer.train() File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 193, in train self._do_train(world_size) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 377, in _do_train self.run_callbacks('on_model_save') File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 166, in run_callbacks callback(self) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/utils/callbacks/dagshub.py", line 57, in save_artifacts repo.upload_files(glob(os.path.join(trainer.save_dir.as_posix(), '*.yaml')), directory_path='.', commit_message='added arguments', versioning="git", force=True) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 252, in upload_files res = s.put( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_client.py", line 1153, in put return self.request( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_client.py", line 789, in request request = self.build_request( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_client.py", line 351, in build_request return Request( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_models.py", line 1104, in init headers, stream = encode_request(content, data, files, json) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_content.py", line 209, in encode_request return encode_multipart_data(data or {}, files, boundary) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_content.py", line 154, in encode_multipart_data multipart = MultipartStream(data=data, files=files, boundary=boundary) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_multipart.py", line 188, in init self.fields = list(self._iter_fields(data, files)) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_multipart.py", line 202, in _iter_fields yield FileField(name=name, value=value) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_multipart.py", line 111, in init raise TypeError(f"Expected bytes or bytes-like object got: {type(fileobj)}") TypeError: Expected bytes or bytes-like object got: <class 'str'>
I created a repository, and triggered dagshub logging, but did not initialize the repo, if I let DDA initialize it sets up a sample commit by default	Traceback (most recent call last): File "/home/jinen/documents/dagshub/git/yolo-v8/test.py", line 10, in results = model.train(data='coco128.yaml', epochs=1, imgsz=32) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/model.py", line 371, in train self.trainer.train() File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 193, in train self._do_train(world_size) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 377, in _do_train self.run_callbacks('on_model_save') File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 166, in run_callbacks callback(self) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/utils/callbacks/dagshub.py", line 56, in save_artifacts artifacts.add_dir(trainer.save_dir.as_posix(), glob_exclude=".yaml", commit_message="added artifacts", force=True) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 420, in add_dir self.commit(commit_message, *upload_kwargs) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 512, in commit self.repo.upload_files( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 249, in upload_files data["last_commit"] = self._api.last_commit_sha(self.branch) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/common/api/repo.py", line 233, in last_commit_sha return self.last_commit(branch).id File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/common/api/repo.py", line 230, in last_commit return self.get_branch_info(branch).commit File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/common/api/repo.py", line 90, in get_branch_info raise BranchNotFoundError(f"Branch {branch} not found in repo {self.repo_url}") dagshub.common.api.repo.BranchNotFoundError: Branch main not found in repo https://dagshub.com/jinensetpal/yolotest	Maybe we should just give a user warning and initialize the branch?
This is because it’s an existing file, and the ‘force’ argument is required	Traceback (most recent call last): File "/home/jinen/documents/dagshub/git/yolo-v8/test.py", line 10, in results = model.train(data='coco128.yaml', epochs=1, imgsz=32) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/model.py", line 371, in train self.trainer.train() File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 193, in train self._do_train(world_size) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 377, in _do_train self.run_callbacks('on_model_save') File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 166, in run_callbacks callback(self) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/utils/callbacks/dagshub.py", line 56, in save_artifacts artifacts.add_dir(trainer.save_dir.as_posix(), glob_exclude=".yaml", commit_message="added artifacts", force=True) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 420, in add_dir self.commit(commit_message, *upload_kwargs) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 512, in commit self.repo.upload_files( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 249, in upload_files data["last_commit"] = self._api.last_commit_sha(self.branch) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/common/api/repo.py", line 233, in last_commit_sha return self.last_commit(branch).id File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/common/api/repo.py", line 230, in last_commit return self.get_branch_info(branch).commit File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/common/api/repo.py", line 90, in get_branch_info raise BranchNotFoundError(f"Branch {branch} not found in repo {self.repo_url}") dagshub.common.api.repo.BranchNotFoundError: Branch main not found in repo https://dagshub.com/jinensetpal/yolotest	The error message shouldn’t refer to last_commit , it seems it’s only used internally. Also, I don’t thing force is a good name for the argument, since git and dvc are both VCS’ and it’s not abnormal to update an existing file. I internalize force as some out of the ordinary user flow.
I am not an owner for this repository, but it gives me a JSONDecodeError	Traceback (most recent call last): File "/home/jinen/documents/dagshub/git/yolo-v8/test.py", line 10, in results = model.train(data='coco128.yaml', epochs=1, imgsz=32) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/model.py", line 371, in train self.trainer.train() File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 193, in train self._do_train(world_size) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 377, in _do_train self.run_callbacks('on_model_save') File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 166, in run_callbacks callback(self) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/utils/callbacks/dagshub.py", line 56, in save_artifacts artifacts.add_dir(trainer.save_dir.as_posix(), glob_exclude=".yaml", commit_message="added artifacts") File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 420, in add_dir self.commit(commit_message, *upload_kwargs) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 512, in commit self.repo.upload_files( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 259, in upload_files self._log_upload_details(data, res, files) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 282, in _log_upload_details raise determine_upload_api_error(res) dagshub.upload.errors.UpdateNotAllowedError: Cannot update existing 'artifacts/P_curve.png' file without specifying last_commit

Docstrings for the create_repo function

Upload: Path is remote is unclear

dagshub upload noa/Bears-Recognition <local_file_path> <path_in_remote>

I accept from the target path to behave like in github, start with the github server and be relative to that, rather than the location relatively to my local machine

DDA Streaming: Listdir for non-storage directories returns only first 500 files

Need to implement proper paginated fetching there

log_image support?

Improve default behavior for repo.upload

In this code snippet: https://github.com/DagsHub/client/blob/master/dagshub/upload/wrapper.py#L177

The doc string is not up to date.

Also, remote_path defaults to None, which then fails with a "mysterious" error when run without a remote path provided. We should make it a required parameter or provide better defaults.

Add `--versioning` flag for CLI upload command

Sometimes the user wants to override our logic for whether to push a file to Git or DVC, they can not do this via the CLI currently, as dagshub upload doesn't have a versioning argument, and due to #293 uploading via API doesn't properly work either

Create Repo with visibility and clone settings

Currently, you can use DagsHub to create a repo in 2 ways:

CLI
Python API

Each one doesn't have all the options, which creates a strange UX when they are necessary.

The CLI flow, doesn't have the option to create a private repo, but does enable cloning the repo locally after creating it. The Python API does have the visibility option, but not the cloning option.

It would be good to have either the CLI or the API have all options. This can be achieved either by adding a clone flag to the Python function, and moving the logic for cloning from the CLI to the API, and/or we can add a --private flag to the CLI command that uses the argument in the API to create a private repo

Save notebook doesn't work properly

Save notebook has an extension issue:
In colab, we have the logic to add .ipynb extension backward. We add it when it exists, and don't add it when it doesn't.

Support `create_dataset` as CLI

upload failing on dataset with large files

My dataset - ~640GB - is failing on dagshub upload.
This is apparently due to some extremely large files (~70-80 MB) being grouped together to a single upload step of ~7GB.

Add ability to remove dataset items

The Dagshub CLI tool allows for the ability to upload datasets.

dagshub upload <repo> data/ data/

However, when we want to remove an item from the dataset, there is no CLI support.
Removing the items locally then running the upload cli tool will persist the items in the dataset.

The workaround is to use the DVC CLI tool to update the folder. However, this requires us to update the DVC Configurations if it has not already been done.

From the user perspective, there might be two ways to handle this:

New subcommand: dagshub sync <repo> data/ data
New flag for upload: dagshub upload <repo> data/ data/ --sync

Additional Thoughts
If there is a datasource that already points to the dataset folder, the rows will persist even after the items are deleted. To remove these rows, we would need to recreate the data source. Another solution would be to add a new method to the datasources that would remove the rows.

ds = datasources.get('<repo>, 'images')
ds.wait_until_ready()
ds.remove('/path/to/deleted')

Automatically upload big files using DVC

When uploading a new directory with the CLI

dagshub upload <repo> <local-dir-path> <remote-dir--path>

The directory is uploaded using DVC.
When uploading a single file using the same command, the file is always uploaded with git.
It would be nice to have a size threshold (i.e 5MB) that would automatically decide to upload the file using DVC.

The interesting question is how do you prevent the repo from growing into a list of many single dvc tracked files, and make sure the user makes use of dvc directories to store big files in a manner that makes sense:

.
├── data  <-- dvc
│   ├── preprocessed
│   │   └──  003.png <-- single file
│   └── raw
├── models <-- dvc
└── src <-- git

Add option to log only every N steps

Relevant for the various autologgers (keras etc.)

The scenario is that when using a large dataset, there might be many millions of training steps.
The logged metrics become huge.
When using the vanilla logger, the user can easily wrap the logger with an if clause to only log once every e.g. 100 steps.
However, when using an autologger, this isn't something they can easily modify, so it should be an optional configuration on the logger.

install_hooks throwing error on windows

Im trying to use the data streaming capabilities from dagshub but when I run the install_hooks function its throws the attribute error shown in the attached file. Running on Windows 11, Dagshub version 0.2.17_1

error.txt

Data Engine: Allow setting metadata on a datapoint via `setitem`

This way you would be able to do

dp = ds.head()[0]
dp["metadata_field"] = "metadata_value"
# Maybe will have to fire this, depending on how much performance tuning we want in this
dp.save_metadata()

DDA Client API features

I can upload data to DagsHub through python API. Can we use python API to remove files and search files (i.e. glob package) as well?

Query compositing results in unexpected bug

If I subquery an already queried Datasource, the subquery is not properly composited with the original query.

For instance:

The first query filters the Datasource down to 539 items, but the 2nd level queries ignore this (there are a total of 1008 items in the unfiltered Datasource)

The actually query description look like this:

Workaround:

If you use this notation, it works:

labeled = ds[ds['has_annotation'] == True]

train = labeled[labeled['split'] == 'train']

DagsHubFilesystem won't initialize withoug `project_root`

I was trying to list files in my data directory, simple enough.
I tried running:

fs = DagsHubFilesystem(repo_url=url)
fs.listdir("data")

and got the following error:

ValueError: Could not find a git repo. Either run the function inside of a git repo, specify project_root with the path to a cloned DagsHub repository, or specify repo_url (url of repo on DagsHub) and project_root (path to the folder where to mount the filesystem) arguments

The solution was to add project_root="." to DagsHubFilesystem(), but it was hard to understand and required me to look in the code.
I propose it will be the default when using DagsHubFilesystem with a repo_url.

Hi this is a test issue

Cannot login from mac M1

Hi,
it seems that I cannot login from a notebook in mac M1. In this version I'm using Python 3.9.12.
I have tried both using a token and without it.

This is the code that I'm using, I have tried this same code in a kaggle notebook (python 3.7.12) and it works fine.

dagshub_token = os.getenv("DAGSHUB_TOKEN")
dagshub.auth.add_app_token(token=dagshub_token)
dagshub.init(kaggle_competition, dagshub_username, mlflow=True)

and here's the traceback:

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[13], line 3
      1 dagshub_token = os.getenv("DAGSHUB_TOKEN")
      2 dagshub.auth.add_app_token(token="mysupersecrettoken")
----> 3 dagshub.init(kaggle_competition, dagshub_username, mlflow=True)

File ~/.conda/envs/machine-learning/lib/python3.10/site-packages/dagshub/common/init.py:57, in init(repo_name, repo_owner, url, root, host, mlflow, dvc)
     53 res = http_request("GET", urllib.parse.urljoin(host, config.REPO_INFO_URL.format(
     54     owner=repo_owner,
     55     reponame=repo_name)), auth=bearer)
     56 if res.status_code == 404:
---> 57     create_repo(repo_name)
     59 # Configure MLFlow
     60 if mlflow:

File ~/.conda/envs/machine-learning/lib/python3.10/site-packages/dagshub/upload/wrapper.py:108, in create_repo(repo_name, org_name, description, private, auto_init, gitignores, license, readme, template, host)
    105     if token is not None:
    106         auth = HTTPBearerAuth(token)
--> 108 if auth is None:
    109     raise RuntimeError("You can't create a repository without being authenticated.")
    111 if (license != "" or readme != "" or gitignores != "") and template == "none":

UnboundLocalError: local variable 'auth' referenced before assignment

Any ideas?

`upload_files` doesn't work when you send a list of paths to it

Discovered by Jinen

We need to make sure it's usable by users without having them figure out what the list actually has to be, which is a tuple of (path, binary_data).
Right now the workaround is to basically map(lambda p: DataSet.get_file(...), paths). Maybe that will be enough to handle most cases.

Data Engine: Get datapoint via its path

Right now there's no "neat" way to get a datapoint if you know its path, even though we can do query filtering by path.

Suggested syntax:

dp = ds.get_datapoint("file.jpg")

Warn users on trying to upload metadata outside of a context too frequently

should prevent from oversaturating the network when making a mistake and doing something like:

for dp in ds.all():
    with ds.metadata_context() as ctx:
        ctx.upload_metadata(....)

Instead of having the context outside

Support very large bucket directories

Right now the DagsHubFilesystem offers a listdir method that returns a list. What if I am trying to access a very large bucket directory, I cannot expect that list to be infinitely big.
Example snippet that will time out:

from dagshub.streaming import DagsHubFilesystem
fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/radiant-mlhub-dataset")
fs.listdir("s3://radiant-mlhub/bigearthnet")

I propose that the client implements a fs.Walk that returns a generator with potentially infinite content.

Token validation in cache

Right now the cache doesn't check that the tokens in it are valid.
This way you can add a permanent token, then revoke it on the website, but it won't be "revoked" in the client

Add a helper function for the `DataSet` object to add a whole directory at once

The user currently has to manually list all the files in a directory (probably by using something like os.Walk) and add them one by one to the DataSet. This should be a common use case that we can easily make much easier by rolling it into DataSet.add function.

Add a download() function into the RepoAPI object

RepoAPI should have a download() function that utilizes the common.download.download_files() function to download the whole repo or a directory in a repo in a parallelized fashion.

repo = RepoAPI("user/repo")
repo.download(outdir="/home/user/my-repo")
# OR
repo.download("data/images", "/home/user/images")

Probably should be recursive by default with the ability to turn it off

Add optional retry for data upload operations

They have a tendency to get interrupted, and an optional retry mechanism would really improve the experience.
This will require care, since if the upload has any opened files, they either have to be re-opened or else the retry aborted.

Support `create_repo` as a CLI

Of course, it should also automatically use OAuth