dstackai / gpuhunt Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 5.0 206 KB

GPU prices aggregator for cloud providers

License: Mozilla Public License 2.0

Python 100.00%

gpuhunt's People

Contributors

Stargazers

Watchers

Forkers

smtm-capital shreybirmiwal ego bihan

gpuhunt's Issues

Datacrunch provider returns wrong `gpu_memory` for more than 1 GPU

gpu_memory must contain value per GPU, not the total

There are exception on query CPU instances

Hi!

I have exceptions on a simple query.

>>> from gpuhunt import Catalog
>>> catalog = Catalog()
>>> catalog.load(version="20231120")
>>> catalog.query(provider='nebius', gpu_name=['H100'])
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    catalog.query(provider='nebius', gpu_name=['H100'])
  File "/home/user/dstack/gpuhunt/.direnv/python-3.11/lib/python3.11/site-packages/gpuhunt/_internal/catalog.py", line 142, in query
    items = list(heapq.merge(*[f.result() for f in completed], key=lambda i: i.price))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/dstack/gpuhunt/.direnv/python-3.11/lib/python3.11/site-packages/gpuhunt/_internal/catalog.py", line 142, in <listcomp>
    items = list(heapq.merge(*[f.result() for f in completed], key=lambda i: i.price))
                               ^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/dstack/gpuhunt/.direnv/python-3.11/lib/python3.11/site-packages/gpuhunt/_internal/catalog.py", line 194, in _get_offline_provider_items
    if constraints.matches(item, query_filter):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/dstack/gpuhunt/.direnv/python-3.11/lib/python3.11/site-packages/gpuhunt/_internal/constraints.py", line 110, in matches
    if i.gpu_name.lower() not in q.gpu_name:
       ^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'lower'

VastAI offers include maximum available disk_size instead of min_disk_size

When I specify disk as 85GB.., VastAI offers have disk_size set to offer["disk_size"], which is a maximum available disk_size:

✗ dstack run . -b vastai
 Configuration          .dstack.yml                   
 Project                main                          
 User                   admin                         
 Pool name              default-pool                  
 Min resources          2..xCPU, 8GB.., 85GB.. (disk) 
 Max price              -                             
 Max duration           6h                            
 Spot policy            auto                          
 Retry policy           yes                           
 Creation policy        reuse-or-create               
 Termination policy     destroy-after-idle            
 Termination idle time  600s                          

 #  BACKEND  REGION     INSTANCE  RESOURCES                    SPOT  PRICE      
 1  vastai   es-spain   10245490  24xCPU, 8GB, 1xRTX4090       no    $0.59542   
                                  (24GB), 293.1GB (disk)                        
 2  vastai   tw-taiwan  9110786   32xCPU, 8GB, 1xRTX4090       no    $0.44132   
                                  (24GB), 713.362GB (disk)                      
 3  vastai   us-utah    10498512  16xCPU, 16GB, 2xRTX3090      no    $0.42361   
                                  (24GB), 357GB (disk)                          
    ...                                                                         
 Shown 3 of 54 offers, $11.2177 max

This is inconsistent with all other providers that return min_disk_size. It also does not allow the user to get the offer with disk_size less than the maximum one.

✗ dstack run . -b runpod
 Configuration          .dstack.yml                   
 Project                main                          
 User                   admin                         
 Pool name              default-pool                  
 Min resources          2..xCPU, 8GB.., 85GB.. (disk) 
 Max price              -                             
 Max duration           6h                            
 Spot policy            auto                          
 Retry policy           yes                           
 Creation policy        reuse-or-create               
 Termination policy     destroy-after-idle            
 Termination idle time  600s                          

 #  BACKEND  REGION   INSTANCE              RESOURCES             SPOT  PRICE   
 1  runpod   EU-SE-1  NVIDIA RTX A4000      9xCPU, 50GB,          yes   $0.19   
                                            1xRTXA4000 (16GB),                  
                                            85GB (disk)                         
 2  runpod   EU-RO-1  NVIDIA RTX 4000 Ada   9xCPU, 50GB,          yes   $0.21   
                      Generation            1xRTX4000 (20GB),                   
                                            85GB (disk)                         
 3  runpod   EU-SE-1  NVIDIA RTX A5000      9xCPU, 43GB,          yes   $0.26   
                                            1xRTXA5000 (24GB),                  
                                            85GB (disk)                         
    ...                                                                         
 Shown 3 of 206 offers, $37.52 max

Needed to make dstackai/dstack#973 work.

Sort VastAI offers by GPU count

VastAI offers are not sorted by price because the cheapest instances are unreliable.
However, the default sorting offers large instances (many GPU) before smaller ones.

The proposed fix is to stable-sort by price, keeping the original order within offers with the same GPU count

DataCrunch

Hi!

Datacrunch.io is an interesting modern ML cloud with premium dedicated GPU servers and clusters.
I'll add support for this provider.

Tests for providers

Hi!

I've written a few tests for providers and catalog.

FAILED src/integrity_tests/test_all.py::TestAllCatalogs::test_non_zero_cost

test_non_zero_cost fails today.

One of the providers returns 0 prices.

Run pytest src/integrity_tests
============================= test session starts ==============================
platform linux -- Python 3.11.8, pytest-8.0.2, pluggy-1.4.0
rootdir: /home/runner/work/gpuhunt/gpuhunt
collected 25 items

src/integrity_tests/test_all.py F                                        [  4%]
src/integrity_tests/test_aws.py ...                                      [ 16%]
src/integrity_tests/test_azure.py .....                                  [ 36%]
src/integrity_tests/test_cudo.py F..                                     [ 48%]
src/integrity_tests/test_datacrunch.py ....                              [ 64%]
src/integrity_tests/test_gcp.py .....                                    [ 84%]
src/integrity_tests/test_nebius.py ....                                  [100%]

=================================== FAILURES ===================================
______________________ TestAllCatalogs.test_non_zero_cost ______________________

self = <integrity_tests.test_all.TestAllCatalogs object at 0x7f5729aaaa90>
catalog_files = [PosixPath('lambdalabs.csv'), PosixPath('cudo.csv'), PosixPath('nebius.csv'), PosixPath('gcp.csv'), PosixPath('aws.csv'), PosixPath('datacrunch.csv'), ...]

    def test_non_zero_cost(self, catalog_files: List[Path]):
        for file in catalog_files:
            with open(file, "r") as f:
                reader = csv.DictReader(f)
                prices = [float(row["price"]) for row in reader]
>           assert 0 not in prices
E           assert 0 not in [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]

src/integrity_tests/test_all.py:19: AssertionError
________________________________ test_locations ________________________________

data_rows = [{'cpu': '1', 'disk_size': '', 'gpu_count': '1', 'gpu_memory': '16', ...}, {'cpu': '1', 'disk_size': '', 'gpu_count': ...u_count': '1', 'gpu_memory': '16', ...}, {'cpu': '3', 'disk_size': '', 'gpu_count': '1', 'gpu_memory': '16', ...}, ...]

    def test_locations(data_rows):
        expected = {
            "no-luster-1",
            "se-smedjebacken-1",
            "se-stockholm-1",
            "us-newyork-1",
            "us-santaclara-1",
        }
        locations = select_row(data_rows, "location")
>       assert set(locations) == expected
E       AssertionError: assert {'no-luster-1...santaclara-1'} == {'no-luster-1...santaclara-1'}
E         
E         Extra items in the right set:
E         'se-smedjebacken-1'
E         
E         Full diff:
E           {
E               'no-luster-1',
E         -     'se-smedjebacken-1',
E               'se-stockholm-1',
E               'us-newyork-1',
E               'us-santaclara-1',
E           }

src/integrity_tests/test_cudo.py:32: AssertionError
---------------------------- Captured stdout setup -----------------------------
.
=========================== short test summary info ============================
FAILED src/integrity_tests/test_all.py::TestAllCatalogs::test_non_zero_cost - assert 0 not in [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]
FAILED src/integrity_tests/test_cudo.py::test_locations - AssertionError: assert {'no-luster-1...santaclara-1'} == {'no-luster-1...santaclara-1'}
  
  Extra items in the right set:
  'se-smedjebacken-1'
  
  Full diff:
    {
        'no-luster-1',
  -     'se-smedjebacken-1',
        'se-stockholm-1',
        'us-newyork-1',
        'us-santaclara-1',
    }
========================= 2 failed, 23 passed in 0.35s =========================
Error: Process completed with exit code 1.

Incorrect state of the CatalogItem

There is a problem in the dstack when the field gpu_count greater than zero and gpu_name is None. This raises an error when creating
Gpu class.

Implement public query API

Renamed the package to gpuhunt
Uploading to PyPi
Rename the S3 bucket to gpuhunt
Move the instance type filtering logic to dstack
Update the README.md

Cudo Compute: Improve the filtering of offers

Cudo Provider doesn't return offers if I set min_gpu_count and max_gpu_count to 0.

❯ python test_case.py
QueryFilter(provider=['cudo']) 3793
QueryFilter(provider=['cudo'], min_gpu_count=0) 458
QueryFilter(provider=['cudo'], min_gpu_count=0, max_gpu_count=0) 0
QueryFilter(provider=['cudo'], min_gpu_count=0, max_gpu_count=1) 29
QueryFilter(provider=['cudo'], min_gpu_count=1, max_gpu_count=1) 29

Sync the job with the tests in the pipelines test.yml and release.yml

Add filtering options for `query`

Add min/max constraints for CPU, memory, gpu_count, gpu_memory, and gpu_total_memory parameters
Return the minimal configuration in clouds with flexible sizing (like TensorDock)
Add reasonable minimal constraints for all resources if only one (or couple) of CPU, memory, or GPU specified

DataCrunch provider returns unsorted offers

Provider must return offers in the optimal order. Further steps do not perform any sorting

Make a pre-defined dictionary with unified regions

Currently, every provider has their region codes. While this information is useful, it would be also great to support unified region codes (e.g. us-east, europe-central, etc.), and map provider-specific region codes to unified.

This will allow the user to search GPUs by region without having to know how these regions are coded in every provider.

This will be also supported by https://github.com/dstackai/dstack

FAILED src/integrity_tests/test_cudo.py::test_locations

Test catalog integrity failed today.
https://github.com/dstackai/gpuhunt/actions/runs/8105089998/job/22172772003

The servers located in the 'se-smedjebacken-1' location can sometimes be unavailable.

Run pytest src/integrity_tests
============================= test session starts ==============================
platform linux -- Python 3.11.8, pytest-8.0.2, pluggy-1.4.0
rootdir: /home/runner/work/gpuhunt/gpuhunt
collected 25 items

src/integrity_tests/test_all.py F                                        [  4%]
src/integrity_tests/test_aws.py ...                                      [ 16%]
src/integrity_tests/test_azure.py .....                                  [ 36%]
src/integrity_tests/test_cudo.py F..                                     [ 48%]
src/integrity_tests/test_datacrunch.py ....                              [ 64%]
src/integrity_tests/test_gcp.py .....                                    [ 84%]
src/integrity_tests/test_nebius.py ....                                  [100%]

=================================== FAILURES ===================================
______________________ TestAllCatalogs.test_non_zero_cost ______________________

self = <integrity_tests.test_all.TestAllCatalogs object at 0x7f5729aaaa90>
catalog_files = [PosixPath('lambdalabs.csv'), PosixPath('cudo.csv'), PosixPath('nebius.csv'), PosixPath('gcp.csv'), PosixPath('aws.csv'), PosixPath('datacrunch.csv'), ...]

    def test_non_zero_cost(self, catalog_files: List[Path]):
        for file in catalog_files:
            with open(file, "r") as f:
                reader = csv.DictReader(f)
                prices = [float(row["price"]) for row in reader]
>           assert 0 not in prices
E           assert 0 not in [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]

src/integrity_tests/test_all.py:19: AssertionError
________________________________ test_locations ________________________________

data_rows = [{'cpu': '1', 'disk_size': '', 'gpu_count': '1', 'gpu_memory': '16', ...}, {'cpu': '1', 'disk_size': '', 'gpu_count': ...u_count': '1', 'gpu_memory': '16', ...}, {'cpu': '3', 'disk_size': '', 'gpu_count': '1', 'gpu_memory': '16', ...}, ...]

    def test_locations(data_rows):
        expected = {
            "no-luster-1",
            "se-smedjebacken-1",
            "se-stockholm-1",
            "us-newyork-1",
            "us-santaclara-1",
        }
        locations = select_row(data_rows, "location")
>       assert set(locations) == expected
E       AssertionError: assert {'no-luster-1...santaclara-1'} == {'no-luster-1...santaclara-1'}
E         
E         Extra items in the right set:
E         'se-smedjebacken-1'
E         
E         Full diff:
E           {
E               'no-luster-1',
E         -     'se-smedjebacken-1',
E               'se-stockholm-1',
E               'us-newyork-1',
E               'us-santaclara-1',
E           }

src/integrity_tests/test_cudo.py:32: AssertionError
---------------------------- Captured stdout setup -----------------------------
.
=========================== short test summary info ============================
FAILED src/integrity_tests/test_all.py::TestAllCatalogs::test_non_zero_cost - assert 0 not in [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]
FAILED src/integrity_tests/test_cudo.py::test_locations - AssertionError: assert {'no-luster-1...santaclara-1'} == {'no-luster-1...santaclara-1'}
  
  Extra items in the right set:
  'se-smedjebacken-1'
  
  Full diff:
    {
        'no-luster-1',
  -     'se-smedjebacken-1',
        'se-stockholm-1',
        'us-newyork-1',
        'us-santaclara-1',
    }
========================= 2 failed, 23 passed in 0.35s =========================
Error: Process completed with exit code 1.

Nebius docs is an object of change

We can not rely on docs pages and should hardcode presets until API becomes available.

Fix integrity test

The workflow Collect and Publish catalog broke down today.

Temporarily disabling the nebius catalog

There is an error when the catalog is collected.

https://github.com/dstackai/gpuhunt/actions/runs/8731010583/job/23955739082

Run python -m gpuhunt nebius --output ../nebius.csv
2024-04-18 01:27:21,093 INFO Fetching offers for nebius
2024-04-18 01:27:22,629 INFO Fetching SKUs
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/runner/work/gpuhunt/gpuhunt/src/gpuhunt/__main__.py", line 89, in <module>
    main()
  File "/home/runner/work/gpuhunt/gpuhunt/src/gpuhunt/__main__.py", line 82, in main
    offers = provider.get()
             ^^^^^^^^^^^^^^
  File "/home/runner/work/gpuhunt/gpuhunt/src/gpuhunt/providers/nebius.py", line 117, in get
    offers = self.get_gpu_platforms(zone, platform_resources)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/gpuhunt/gpuhunt/src/gpuhunt/providers/nebius.py", line 134, in get_gpu_platforms
    cpu * prices["cpu"]
          ~~~~~~^^^^^^^
KeyError: 'cpu'
Error: Process completed with exit code 1.

Error running example in the ReadME.md

There is an error running the example under Advanced usage.
KeyError: "There is no item named 'nebius.csv' in the archive or KeyError: "There is no item named 'datacrunch.csv' in the archive"

Environment
gpuhunt version 0.0.5
MacOS 13.0 Ventura
python version 3.11.7

VastAI returns non-standard GPU memory sizes

VastAI hosts provide different GPU memory for the same GPU model name

gpuhunt must normalize memory values according to the GPU model

Export KNOWN_GPUS

dstackai/dstack#855 requires a list of known GPUs dstack supports. It's already available in gpuhunt as KNOWN_GPUS, so let's export it for reusing.

Datacrunch provider: Add the FIN-02 datacenter

Make provider responsible for offers order

Sorting by price is not always the best option, we should support custom ordering and preserve it later during the merge

Suggest to move requirements_dev.txt inside pyproject.toml

Hi!
I think it is consistent way to store requirements.

Fix `runpod` provider integration tests

The wrokflow Collect and publish catalogs isn't working.

Collect and publish catalogs more frequently

Currently, gpuhunt catalog is collected once a day:

gpuhunt/.github/workflows/catalogs.yml

Line 10 in c8bb9a7

- cron: '0 1 * * *' # 01:00 UTC every day

This approach is not well suited for runpod since the runpod API returns only the offers available at the time of the request ( dstackai/dstack#1118).

I suggest we try mitigating the problem with runpod by collecting the catalog more frequently. This won't guarantee that users won't get stale or missing offers, but it may occur rarely in practice and not be critical.

Since Collect and publish catalogs workflow usually takes 5m, it should be ok to collect the catalogs every hour.