Code Monkey home page Code Monkey logo

client's Introduction

DagsHub Client


Tests pip License Python Version DagsHub Docs DagsHub Client Docs

DagsHub Sign Up Discord DagsHub on Twitter

What is DagsHub?

DagsHub is a platform where machine learning and data science teams can build, manage, and collaborate on their projects. With DagsHub you can:

  1. Version code, data, and models in one place. Use the free provided DagsHub storage or connect it to your cloud storage
  2. Track Experiments using Git, DVC or MLflow, to provide a fully reproducible environment
  3. Visualize pipelines, data, and notebooks in and interactive, diff-able, and dynamic way
  4. Label your data directly on the platform using Label Studio
  5. Share your work with your team members
  6. Stream and upload your data in an intuitive and easy way, while preserving versioning and structure.

DagsHub is built firmly around open, standard formats for your project. In particular:

Therefore, you can work with DagsHub regardless of your chosen programming language or frameworks.

DagsHub Client API & CLI

This client library is meant to help you get started quickly with DagsHub. It is made up of Experiment tracking and Direct Data Access (DDA), a component to let you stream and upload your data.

For more details on the different functions of the client, check out the docs segments:

  1. Installation & Setup
  2. Data Streaming
  3. Data Upload
  4. Experiment Tracking
    1. Autologging
  5. Data Engine

Some functionality is supported only in Python.

To read about some of the awesome use cases for Direct Data Access, check out the relevant doc page.

Installation

pip install dagshub

Direct Data Access (DDA) functionality requires authentication, which you can easily do by running the following command in your terminal:

dagshub login

Quickstart for Data Streaming

The easiest way to start using DagsHub is via the Python Hooks method. To do this:

  1. Your DagsHub project,
  2. Copy the following 2 lines of code into your Python code which accesses your data:
    from dagshub.streaming import install_hooks
    install_hooks()
  3. That’s it! You now have streaming access to all your project files.

🀩 Check out this colab to see an example of this Data Streaming work end to end:

Open In Colab

Next Steps

You can dive into the expanded documentation, to learn more about data streaming, data upload and experiment tracking with DagsHub


Analytics

To improve your experience, we collect analytics on client usage. If you want to disable analytics collection, set the DAGSHUB_DISABLE_ANALYTICS environment variable to any value.

Made with 🐢 by DagsHub.

client's People

Contributors

arjvik avatar deanp70 avatar evgenileonti avatar guysmoilov avatar idonov8 avatar jacob-zietek avatar jinensetpal avatar kbolashev avatar krishnaduttpanchagnula avatar martintali avatar mohithg avatar nikitha-narendra avatar nirbarazida avatar pyup-bot avatar rabroldan avatar sdafni avatar simonlsk avatar talmalka123 avatar yairl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

client's Issues

Truncated command help text for `dagshub repo --help`

Tried running dagshub repo --help and this is the full help text that gets printed:

Usage: dagshub repo [OPTIONS] COMMAND [ARGS]...

  Operations on repo: currently only 'create'

Options:
  --help  Show this message and exit.

Commands:
  create  create a repo and:

Experiments Card Overflow UI issue

The scroll feature doesn't seem to work. I'm trying to view the setup instructions in the "experiments" section. The card seems to have extra instructions at the bottom but i can't read them.

dagshub-experiment

Adding `versioning` argument to `repo.upload` when uploading a folder throws an error

Attempting to override the default settings for versioning when uploading a folder returns an error:

TypeError: dagshub.upload.wrapper.DataSet.commit() got multiple values for keyword argument 'versioning'

This is because we are passing kwargs which contains the versioning argument as well as our own versioning argument. We need to check for this first.

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

Patch os.open instead of builtins.open and pathlib

To fully support all cases, it would make more sense to rewrite our patching logic in install_hooks to patch the more low-level os.open function, which maps directly to syscalls and is actually used by both builtins.open and pathlib.

Save non-Colab Jupyter saves entire execution history

In non-colab Jupyter, we save the entire execution history, i.e. if you run a cell twice, the save notebook will include 2 cell copies, each with the specific things run in each version.

There should probably be a way to save the current state, which is what we want.

Support Python 3.11

Currently, Python 3.11 isn't supported by popular ML frameworks such as Pytorch, so it wasn't a high priority to support it ourselves.
@kbolashev Might have more information on what's missing to add this support.

Unhelpful Client Errors

Cause Stack Trace Notes
Expects a list of file objects, but I provided strings - documentation requests a string of lists Traceback (most recent call last): File "/home/jinen/documents/dagshub/git/yolo-v8/test.py", line 10, in results = model.train(data='coco128.yaml', epochs=1, imgsz=32) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/model.py", line 371, in train self.trainer.train() File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 193, in train self._do_train(world_size) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 377, in _do_train self.run_callbacks('on_model_save') File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 166, in run_callbacks callback(self) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/utils/callbacks/dagshub.py", line 57, in save_artifacts repo.upload_files(glob(os.path.join(trainer.save_dir.as_posix(), '*.yaml')), directory_path='.', commit_message='added arguments', versioning="git", force=True) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 252, in upload_files res = s.put( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_client.py", line 1153, in put return self.request( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_client.py", line 789, in request request = self.build_request( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_client.py", line 351, in build_request return Request( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_models.py", line 1104, in init headers, stream = encode_request(content, data, files, json) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_content.py", line 209, in encode_request return encode_multipart_data(data or {}, files, boundary) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_content.py", line 154, in encode_multipart_data multipart = MultipartStream(data=data, files=files, boundary=boundary) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_multipart.py", line 188, in init self.fields = list(self._iter_fields(data, files)) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_multipart.py", line 202, in _iter_fields yield FileField(name=name, value=value) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/httpx/_multipart.py", line 111, in init raise TypeError(f"Expected bytes or bytes-like object got: {type(fileobj)}") TypeError: Expected bytes or bytes-like object got: <class 'str'>
I created a repository, and triggered dagshub logging, but did not initialize the repo, if I let DDA initialize it sets up a sample commit by default Traceback (most recent call last): File "/home/jinen/documents/dagshub/git/yolo-v8/test.py", line 10, in results = model.train(data='coco128.yaml', epochs=1, imgsz=32) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/model.py", line 371, in train self.trainer.train() File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 193, in train self._do_train(world_size) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 377, in _do_train self.run_callbacks('on_model_save') File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 166, in run_callbacks callback(self) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/utils/callbacks/dagshub.py", line 56, in save_artifacts artifacts.add_dir(trainer.save_dir.as_posix(), glob_exclude="*.yaml", commit_message="added artifacts", force=True) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 420, in add_dir self.commit(commit_message, **upload_kwargs) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 512, in commit self.repo.upload_files( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 249, in upload_files data["last_commit"] = self._api.last_commit_sha(self.branch) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/common/api/repo.py", line 233, in last_commit_sha return self.last_commit(branch).id File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/common/api/repo.py", line 230, in last_commit return self.get_branch_info(branch).commit File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/common/api/repo.py", line 90, in get_branch_info raise BranchNotFoundError(f"Branch {branch} not found in repo {self.repo_url}") dagshub.common.api.repo.BranchNotFoundError: Branch main not found in repo https://dagshub.com/jinensetpal/yolotest Maybe we should just give a user warning and initialize the branch?
This is because it’s an existing file, and the β€˜force’ argument is required Traceback (most recent call last): File "/home/jinen/documents/dagshub/git/yolo-v8/test.py", line 10, in results = model.train(data='coco128.yaml', epochs=1, imgsz=32) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/model.py", line 371, in train self.trainer.train() File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 193, in train self._do_train(world_size) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 377, in _do_train self.run_callbacks('on_model_save') File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 166, in run_callbacks callback(self) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/utils/callbacks/dagshub.py", line 56, in save_artifacts artifacts.add_dir(trainer.save_dir.as_posix(), glob_exclude="*.yaml", commit_message="added artifacts", force=True) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 420, in add_dir self.commit(commit_message, **upload_kwargs) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 512, in commit self.repo.upload_files( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 249, in upload_files data["last_commit"] = self._api.last_commit_sha(self.branch) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/common/api/repo.py", line 233, in last_commit_sha return self.last_commit(branch).id File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/common/api/repo.py", line 230, in last_commit return self.get_branch_info(branch).commit File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/common/api/repo.py", line 90, in get_branch_info raise BranchNotFoundError(f"Branch {branch} not found in repo {self.repo_url}") dagshub.common.api.repo.BranchNotFoundError: Branch main not found in repo https://dagshub.com/jinensetpal/yolotest The error message shouldn’t refer to last_commit , it seems it’s only used internally. Also, I don’t thing force is a good name for the argument, since git and dvc are both VCS’ and it’s not abnormal to update an existing file. I internalize force as some out of the ordinary user flow.
I am not an owner for this repository, but it gives me a JSONDecodeError Traceback (most recent call last): File "/home/jinen/documents/dagshub/git/yolo-v8/test.py", line 10, in results = model.train(data='coco128.yaml', epochs=1, imgsz=32) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/model.py", line 371, in train self.trainer.train() File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 193, in train self._do_train(world_size) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 377, in _do_train self.run_callbacks('on_model_save') File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/engine/trainer.py", line 166, in run_callbacks callback(self) File "/home/jinen/documents/dagshub/git/yolo-v8/ultralytics/yolo/utils/callbacks/dagshub.py", line 56, in save_artifacts artifacts.add_dir(trainer.save_dir.as_posix(), glob_exclude="*.yaml", commit_message="added artifacts") File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 420, in add_dir self.commit(commit_message, **upload_kwargs) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 512, in commit self.repo.upload_files( File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 259, in upload_files self._log_upload_details(data, res, files) File "/home/jinen/.venv/yolov8/lib/python3.10/site-packages/dagshub/upload/wrapper.py", line 282, in _log_upload_details raise determine_upload_api_error(res) dagshub.upload.errors.UpdateNotAllowedError: Cannot update existing 'artifacts/P_curve.png' file without specifying last_commit

Upload: Path is remote is unclear

dagshub upload noa/Bears-Recognition <local_file_path> <path_in_remote>

I accept from the target path to behave like in github, start with the github server and be relative to that, rather than the location relatively to my local machine

Add `--versioning` flag for CLI upload command

Sometimes the user wants to override our logic for whether to push a file to Git or DVC, they can not do this via the CLI currently, as dagshub upload doesn't have a versioning argument, and due to #293 uploading via API doesn't properly work either

Create Repo with visibility and clone settings

Currently, you can use DagsHub to create a repo in 2 ways:

  1. CLI
  2. Python API

Each one doesn't have all the options, which creates a strange UX when they are necessary.

The CLI flow, doesn't have the option to create a private repo, but does enable cloning the repo locally after creating it. The Python API does have the visibility option, but not the cloning option.

It would be good to have either the CLI or the API have all options. This can be achieved either by adding a clone flag to the Python function, and moving the logic for cloning from the CLI to the API, and/or we can add a --private flag to the CLI command that uses the argument in the API to create a private repo

Save notebook doesn't work properly

Save notebook has an extension issue:
In colab, we have the logic to add .ipynb extension backward. We add it when it exists, and don't add it when it doesn't.

upload failing on dataset with large files

My dataset - ~640GB - is failing on dagshub upload.
This is apparently due to some extremely large files (~70-80 MB) being grouped together to a single upload step of ~7GB.

Add ability to remove dataset items

The Dagshub CLI tool allows for the ability to upload datasets.

dagshub upload <repo> data/ data/

However, when we want to remove an item from the dataset, there is no CLI support.
Removing the items locally then running the upload cli tool will persist the items in the dataset.

The workaround is to use the DVC CLI tool to update the folder. However, this requires us to update the DVC Configurations if it has not already been done.

From the user perspective, there might be two ways to handle this:

  1. New subcommand: dagshub sync <repo> data/ data
  2. New flag for upload: dagshub upload <repo> data/ data/ --sync

Additional Thoughts
If there is a datasource that already points to the dataset folder, the rows will persist even after the items are deleted. To remove these rows, we would need to recreate the data source. Another solution would be to add a new method to the datasources that would remove the rows.

ds = datasources.get('<repo>, 'images')
ds.wait_until_ready()
ds.remove('/path/to/deleted')

Automatically upload big files using DVC

When uploading a new directory with the CLI

dagshub upload <repo> <local-dir-path> <remote-dir--path>

The directory is uploaded using DVC.
When uploading a single file using the same command, the file is always uploaded with git.
It would be nice to have a size threshold (i.e 5MB) that would automatically decide to upload the file using DVC.

The interesting question is how do you prevent the repo from growing into a list of many single dvc tracked files, and make sure the user makes use of dvc directories to store big files in a manner that makes sense:

.
β”œβ”€β”€ data  <-- dvc
β”‚Β Β  β”œβ”€β”€ preprocessed
β”‚Β   β”‚Β   └──  003.png <-- single file
β”‚Β Β  └── raw
β”œβ”€β”€ models <-- dvc
└── src <-- git

Add option to log only every N steps

Relevant for the various autologgers (keras etc.)

The scenario is that when using a large dataset, there might be many millions of training steps.
The logged metrics become huge.
When using the vanilla logger, the user can easily wrap the logger with an if clause to only log once every e.g. 100 steps.
However, when using an autologger, this isn't something they can easily modify, so it should be an optional configuration on the logger.

install_hooks throwing error on windows

Im trying to use the data streaming capabilities from dagshub but when I run the install_hooks function its throws the attribute error shown in the attached file. Running on Windows 11, Dagshub version 0.2.17_1

error.txt

DDA Client API features

I can upload data to DagsHub through python API. Can we use python API to remove files and search files (i.e. glob package) as well?

Query compositing results in unexpected bug

If I subquery an already queried Datasource, the subquery is not properly composited with the original query.

For instance:

Screenshot 2023-09-05 at 10 47 35

The first query filters the Datasource down to 539 items, but the 2nd level queries ignore this (there are a total of 1008 items in the unfiltered Datasource)

The actually query description look like this:

Screenshot_2023-09-05_at_11 15 06

Workaround:

If you use this notation, it works:

labeled = ds[ds['has_annotation'] == True]

train = labeled[labeled['split'] == 'train']

DagsHubFilesystem won't initialize withoug `project_root`

I was trying to list files in my data directory, simple enough.
I tried running:

fs = DagsHubFilesystem(repo_url=url)
fs.listdir("data")

and got the following error:

ValueError: Could not find a git repo. Either run the function inside of a git repo, specify project_root with the path to a cloned DagsHub repository, or specify repo_url (url of repo on DagsHub) and project_root (path to the folder where to mount the filesystem) arguments

The solution was to add project_root="." to DagsHubFilesystem(), but it was hard to understand and required me to look in the code.
I propose it will be the default when using DagsHubFilesystem with a repo_url.

Cannot login from mac M1

Hi,
it seems that I cannot login from a notebook in mac M1. In this version I'm using Python 3.9.12.
I have tried both using a token and without it.

This is the code that I'm using, I have tried this same code in a kaggle notebook (python 3.7.12) and it works fine.

dagshub_token = os.getenv("DAGSHUB_TOKEN")
dagshub.auth.add_app_token(token=dagshub_token)
dagshub.init(kaggle_competition, dagshub_username, mlflow=True)

and here's the traceback:

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[13], line 3
      1 dagshub_token = os.getenv("DAGSHUB_TOKEN")
      2 dagshub.auth.add_app_token(token="mysupersecrettoken")
----> 3 dagshub.init(kaggle_competition, dagshub_username, mlflow=True)

File ~/.conda/envs/machine-learning/lib/python3.10/site-packages/dagshub/common/init.py:57, in init(repo_name, repo_owner, url, root, host, mlflow, dvc)
     53 res = http_request("GET", urllib.parse.urljoin(host, config.REPO_INFO_URL.format(
     54     owner=repo_owner,
     55     reponame=repo_name)), auth=bearer)
     56 if res.status_code == 404:
---> 57     create_repo(repo_name)
     59 # Configure MLFlow
     60 if mlflow:

File ~/.conda/envs/machine-learning/lib/python3.10/site-packages/dagshub/upload/wrapper.py:108, in create_repo(repo_name, org_name, description, private, auto_init, gitignores, license, readme, template, host)
    105     if token is not None:
    106         auth = HTTPBearerAuth(token)
--> 108 if auth is None:
    109     raise RuntimeError("You can't create a repository without being authenticated.")
    111 if (license != "" or readme != "" or gitignores != "") and template == "none":

UnboundLocalError: local variable 'auth' referenced before assignment

Any ideas?

`upload_files` doesn't work when you send a list of paths to it

Discovered by Jinen

We need to make sure it's usable by users without having them figure out what the list actually has to be, which is a tuple of (path, binary_data).
Right now the workaround is to basically map(lambda p: DataSet.get_file(...), paths). Maybe that will be enough to handle most cases.

Data Engine: Get datapoint via its path

Right now there's no "neat" way to get a datapoint if you know its path, even though we can do query filtering by path.

Suggested syntax:

dp = ds.get_datapoint("file.jpg")

Support very large bucket directories

Right now the DagsHubFilesystem offers a listdir method that returns a list. What if I am trying to access a very large bucket directory, I cannot expect that list to be infinitely big.
Example snippet that will time out:

from dagshub.streaming import DagsHubFilesystem
fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/radiant-mlhub-dataset")
fs.listdir("s3://radiant-mlhub/bigearthnet")

I propose that the client implements a fs.Walk that returns a generator with potentially infinite content.

Token validation in cache

Right now the cache doesn't check that the tokens in it are valid.
This way you can add a permanent token, then revoke it on the website, but it won't be "revoked" in the client

Add a download() function into the RepoAPI object

RepoAPI should have a download() function that utilizes the common.download.download_files() function to download the whole repo or a directory in a repo in a parallelized fashion.

repo = RepoAPI("user/repo")
repo.download(outdir="/home/user/my-repo")
# OR
repo.download("data/images", "/home/user/images")

Probably should be recursive by default with the ability to turn it off

Add optional retry for data upload operations

They have a tendency to get interrupted, and an optional retry mechanism would really improve the experience.
This will require care, since if the upload has any opened files, they either have to be re-opened or else the retry aborted.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.