Code Monkey home page Code Monkey logo

gpuhunt's People

Contributors

bihan avatar egor-s avatar jvstme avatar peterschmidt85 avatar r4victor avatar thebits avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gpuhunt's Issues

There are exception on query CPU instances

Hi!

I have exceptions on a simple query.

>>> from gpuhunt import Catalog
>>> catalog = Catalog()
>>> catalog.load(version="20231120")
>>> catalog.query(provider='nebius', gpu_name=['H100'])
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    catalog.query(provider='nebius', gpu_name=['H100'])
  File "/home/user/dstack/gpuhunt/.direnv/python-3.11/lib/python3.11/site-packages/gpuhunt/_internal/catalog.py", line 142, in query
    items = list(heapq.merge(*[f.result() for f in completed], key=lambda i: i.price))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/dstack/gpuhunt/.direnv/python-3.11/lib/python3.11/site-packages/gpuhunt/_internal/catalog.py", line 142, in <listcomp>
    items = list(heapq.merge(*[f.result() for f in completed], key=lambda i: i.price))
                               ^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/dstack/gpuhunt/.direnv/python-3.11/lib/python3.11/site-packages/gpuhunt/_internal/catalog.py", line 194, in _get_offline_provider_items
    if constraints.matches(item, query_filter):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/dstack/gpuhunt/.direnv/python-3.11/lib/python3.11/site-packages/gpuhunt/_internal/constraints.py", line 110, in matches
    if i.gpu_name.lower() not in q.gpu_name:
       ^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'lower'

VastAI offers include maximum available disk_size instead of min_disk_size

When I specify disk as 85GB.., VastAI offers have disk_size set to offer["disk_size"], which is a maximum available disk_size:

✗ dstack run . -b vastai
 Configuration          .dstack.yml                   
 Project                main                          
 User                   admin                         
 Pool name              default-pool                  
 Min resources          2..xCPU, 8GB.., 85GB.. (disk) 
 Max price              -                             
 Max duration           6h                            
 Spot policy            auto                          
 Retry policy           yes                           
 Creation policy        reuse-or-create               
 Termination policy     destroy-after-idle            
 Termination idle time  600s                          

 #  BACKEND  REGION     INSTANCE  RESOURCES                    SPOT  PRICE      
 1  vastai   es-spain   10245490  24xCPU, 8GB, 1xRTX4090       no    $0.59542   
                                  (24GB), 293.1GB (disk)                        
 2  vastai   tw-taiwan  9110786   32xCPU, 8GB, 1xRTX4090       no    $0.44132   
                                  (24GB), 713.362GB (disk)                      
 3  vastai   us-utah    10498512  16xCPU, 16GB, 2xRTX3090      no    $0.42361   
                                  (24GB), 357GB (disk)                          
    ...                                                                         
 Shown 3 of 54 offers, $11.2177 max

This is inconsistent with all other providers that return min_disk_size. It also does not allow the user to get the offer with disk_size less than the maximum one.

✗ dstack run . -b runpod
 Configuration          .dstack.yml                   
 Project                main                          
 User                   admin                         
 Pool name              default-pool                  
 Min resources          2..xCPU, 8GB.., 85GB.. (disk) 
 Max price              -                             
 Max duration           6h                            
 Spot policy            auto                          
 Retry policy           yes                           
 Creation policy        reuse-or-create               
 Termination policy     destroy-after-idle            
 Termination idle time  600s                          

 #  BACKEND  REGION   INSTANCE              RESOURCES             SPOT  PRICE   
 1  runpod   EU-SE-1  NVIDIA RTX A4000      9xCPU, 50GB,          yes   $0.19   
                                            1xRTXA4000 (16GB),                  
                                            85GB (disk)                         
 2  runpod   EU-RO-1  NVIDIA RTX 4000 Ada   9xCPU, 50GB,          yes   $0.21   
                      Generation            1xRTX4000 (20GB),                   
                                            85GB (disk)                         
 3  runpod   EU-SE-1  NVIDIA RTX A5000      9xCPU, 43GB,          yes   $0.26   
                                            1xRTXA5000 (24GB),                  
                                            85GB (disk)                         
    ...                                                                         
 Shown 3 of 206 offers, $37.52 max

Needed to make dstackai/dstack#973 work.

Sort VastAI offers by GPU count

VastAI offers are not sorted by price because the cheapest instances are unreliable.
However, the default sorting offers large instances (many GPU) before smaller ones.

The proposed fix is to stable-sort by price, keeping the original order within offers with the same GPU count

DataCrunch

Hi!

Datacrunch.io is an interesting modern ML cloud with premium dedicated GPU servers and clusters.
I'll add support for this provider.

FAILED src/integrity_tests/test_all.py::TestAllCatalogs::test_non_zero_cost

test_non_zero_cost fails today.

One of the providers returns 0 prices.

Run pytest src/integrity_tests
============================= test session starts ==============================
platform linux -- Python 3.11.8, pytest-8.0.2, pluggy-1.4.0
rootdir: /home/runner/work/gpuhunt/gpuhunt
collected 25 items

src/integrity_tests/test_all.py F                                        [  4%]
src/integrity_tests/test_aws.py ...                                      [ 16%]
src/integrity_tests/test_azure.py .....                                  [ 36%]
src/integrity_tests/test_cudo.py F..                                     [ 48%]
src/integrity_tests/test_datacrunch.py ....                              [ 64%]
src/integrity_tests/test_gcp.py .....                                    [ 84%]
src/integrity_tests/test_nebius.py ....                                  [100%]

=================================== FAILURES ===================================
______________________ TestAllCatalogs.test_non_zero_cost ______________________

self = <integrity_tests.test_all.TestAllCatalogs object at 0x7f5729aaaa90>
catalog_files = [PosixPath('lambdalabs.csv'), PosixPath('cudo.csv'), PosixPath('nebius.csv'), PosixPath('gcp.csv'), PosixPath('aws.csv'), PosixPath('datacrunch.csv'), ...]

    def test_non_zero_cost(self, catalog_files: List[Path]):
        for file in catalog_files:
            with open(file, "r") as f:
                reader = csv.DictReader(f)
                prices = [float(row["price"]) for row in reader]
>           assert 0 not in prices
E           assert 0 not in [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]

src/integrity_tests/test_all.py:19: AssertionError
________________________________ test_locations ________________________________

data_rows = [{'cpu': '1', 'disk_size': '', 'gpu_count': '1', 'gpu_memory': '16', ...}, {'cpu': '1', 'disk_size': '', 'gpu_count': ...u_count': '1', 'gpu_memory': '16', ...}, {'cpu': '3', 'disk_size': '', 'gpu_count': '1', 'gpu_memory': '16', ...}, ...]

    def test_locations(data_rows):
        expected = {
            "no-luster-1",
            "se-smedjebacken-1",
            "se-stockholm-1",
            "us-newyork-1",
            "us-santaclara-1",
        }
        locations = select_row(data_rows, "location")
>       assert set(locations) == expected
E       AssertionError: assert {'no-luster-1...santaclara-1'} == {'no-luster-1...santaclara-1'}
E         
E         Extra items in the right set:
E         'se-smedjebacken-1'
E         
E         Full diff:
E           {
E               'no-luster-1',
E         -     'se-smedjebacken-1',
E               'se-stockholm-1',
E               'us-newyork-1',
E               'us-santaclara-1',
E           }

src/integrity_tests/test_cudo.py:32: AssertionError
---------------------------- Captured stdout setup -----------------------------
.
=========================== short test summary info ============================
FAILED src/integrity_tests/test_all.py::TestAllCatalogs::test_non_zero_cost - assert 0 not in [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]
FAILED src/integrity_tests/test_cudo.py::test_locations - AssertionError: assert {'no-luster-1...santaclara-1'} == {'no-luster-1...santaclara-1'}
  
  Extra items in the right set:
  'se-smedjebacken-1'
  
  Full diff:
    {
        'no-luster-1',
  -     'se-smedjebacken-1',
        'se-stockholm-1',
        'us-newyork-1',
        'us-santaclara-1',
    }
========================= 2 failed, 23 passed in 0.35s =========================
Error: Process completed with exit code 1.

Implement public query API

  • Renamed the package to gpuhunt
  • Uploading to PyPi
  • Rename the S3 bucket to gpuhunt
  • Move the instance type filtering logic to dstack
  • Update the README.md

Cudo Compute: Improve the filtering of offers

Cudo Provider doesn't return offers if I set min_gpu_count and max_gpu_count to 0.

❯ python test_case.py
QueryFilter(provider=['cudo']) 3793
QueryFilter(provider=['cudo'], min_gpu_count=0) 458
QueryFilter(provider=['cudo'], min_gpu_count=0, max_gpu_count=0) 0
QueryFilter(provider=['cudo'], min_gpu_count=0, max_gpu_count=1) 29
QueryFilter(provider=['cudo'], min_gpu_count=1, max_gpu_count=1) 29

Add filtering options for `query`

  • Add min/max constraints for CPU, memory, gpu_count, gpu_memory, and gpu_total_memory parameters
  • Return the minimal configuration in clouds with flexible sizing (like TensorDock)
  • Add reasonable minimal constraints for all resources if only one (or couple) of CPU, memory, or GPU specified

Make a pre-defined dictionary with unified regions

Currently, every provider has their region codes. While this information is useful, it would be also great to support unified region codes (e.g. us-east, europe-central, etc.), and map provider-specific region codes to unified.

This will allow the user to search GPUs by region without having to know how these regions are coded in every provider.

This will be also supported by https://github.com/dstackai/dstack

FAILED src/integrity_tests/test_cudo.py::test_locations

Test catalog integrity failed today.
https://github.com/dstackai/gpuhunt/actions/runs/8105089998/job/22172772003

The servers located in the 'se-smedjebacken-1' location can sometimes be unavailable.

Run pytest src/integrity_tests
============================= test session starts ==============================
platform linux -- Python 3.11.8, pytest-8.0.2, pluggy-1.4.0
rootdir: /home/runner/work/gpuhunt/gpuhunt
collected 25 items

src/integrity_tests/test_all.py F                                        [  4%]
src/integrity_tests/test_aws.py ...                                      [ 16%]
src/integrity_tests/test_azure.py .....                                  [ 36%]
src/integrity_tests/test_cudo.py F..                                     [ 48%]
src/integrity_tests/test_datacrunch.py ....                              [ 64%]
src/integrity_tests/test_gcp.py .....                                    [ 84%]
src/integrity_tests/test_nebius.py ....                                  [100%]

=================================== FAILURES ===================================
______________________ TestAllCatalogs.test_non_zero_cost ______________________

self = <integrity_tests.test_all.TestAllCatalogs object at 0x7f5729aaaa90>
catalog_files = [PosixPath('lambdalabs.csv'), PosixPath('cudo.csv'), PosixPath('nebius.csv'), PosixPath('gcp.csv'), PosixPath('aws.csv'), PosixPath('datacrunch.csv'), ...]

    def test_non_zero_cost(self, catalog_files: List[Path]):
        for file in catalog_files:
            with open(file, "r") as f:
                reader = csv.DictReader(f)
                prices = [float(row["price"]) for row in reader]
>           assert 0 not in prices
E           assert 0 not in [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]

src/integrity_tests/test_all.py:19: AssertionError
________________________________ test_locations ________________________________

data_rows = [{'cpu': '1', 'disk_size': '', 'gpu_count': '1', 'gpu_memory': '16', ...}, {'cpu': '1', 'disk_size': '', 'gpu_count': ...u_count': '1', 'gpu_memory': '16', ...}, {'cpu': '3', 'disk_size': '', 'gpu_count': '1', 'gpu_memory': '16', ...}, ...]

    def test_locations(data_rows):
        expected = {
            "no-luster-1",
            "se-smedjebacken-1",
            "se-stockholm-1",
            "us-newyork-1",
            "us-santaclara-1",
        }
        locations = select_row(data_rows, "location")
>       assert set(locations) == expected
E       AssertionError: assert {'no-luster-1...santaclara-1'} == {'no-luster-1...santaclara-1'}
E         
E         Extra items in the right set:
E         'se-smedjebacken-1'
E         
E         Full diff:
E           {
E               'no-luster-1',
E         -     'se-smedjebacken-1',
E               'se-stockholm-1',
E               'us-newyork-1',
E               'us-santaclara-1',
E           }

src/integrity_tests/test_cudo.py:32: AssertionError
---------------------------- Captured stdout setup -----------------------------
.
=========================== short test summary info ============================
FAILED src/integrity_tests/test_all.py::TestAllCatalogs::test_non_zero_cost - assert 0 not in [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]
FAILED src/integrity_tests/test_cudo.py::test_locations - AssertionError: assert {'no-luster-1...santaclara-1'} == {'no-luster-1...santaclara-1'}
  
  Extra items in the right set:
  'se-smedjebacken-1'
  
  Full diff:
    {
        'no-luster-1',
  -     'se-smedjebacken-1',
        'se-stockholm-1',
        'us-newyork-1',
        'us-santaclara-1',
    }
========================= 2 failed, 23 passed in 0.35s =========================
Error: Process completed with exit code 1.

Temporarily disabling the nebius catalog

There is an error when the catalog is collected.

https://github.com/dstackai/gpuhunt/actions/runs/8731010583/job/23955739082

Run python -m gpuhunt nebius --output ../nebius.csv
2024-04-18 01:27:21,093 INFO Fetching offers for nebius
2024-04-18 01:27:22,629 INFO Fetching SKUs
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/runner/work/gpuhunt/gpuhunt/src/gpuhunt/__main__.py", line 89, in <module>
    main()
  File "/home/runner/work/gpuhunt/gpuhunt/src/gpuhunt/__main__.py", line 82, in main
    offers = provider.get()
             ^^^^^^^^^^^^^^
  File "/home/runner/work/gpuhunt/gpuhunt/src/gpuhunt/providers/nebius.py", line 117, in get
    offers = self.get_gpu_platforms(zone, platform_resources)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/gpuhunt/gpuhunt/src/gpuhunt/providers/nebius.py", line 134, in get_gpu_platforms
    cpu * prices["cpu"]
          ~~~~~~^^^^^^^
KeyError: 'cpu'
Error: Process completed with exit code 1.

Error running example in the ReadME.md

There is an error running the example under Advanced usage.
KeyError: "There is no item named 'nebius.csv' in the archive or KeyError: "There is no item named 'datacrunch.csv' in the archive"

Environment
gpuhunt version 0.0.5
MacOS 13.0 Ventura
python version 3.11.7

Collect and publish catalogs more frequently

Currently, gpuhunt catalog is collected once a day:

- cron: '0 1 * * *' # 01:00 UTC every day

This approach is not well suited for runpod since the runpod API returns only the offers available at the time of the request ( dstackai/dstack#1118).

I suggest we try mitigating the problem with runpod by collecting the catalog more frequently. This won't guarantee that users won't get stale or missing offers, but it may occur rarely in practice and not be critical.

Since Collect and publish catalogs workflow usually takes 5m, it should be ok to collect the catalogs every hour.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.