dstackai / gpuhunt Goto Github PK
View Code? Open in Web Editor NEWGPU prices aggregator for cloud providers
License: Mozilla Public License 2.0
GPU prices aggregator for cloud providers
License: Mozilla Public License 2.0
gpu_memory
must contain value per GPU, not the total
Hi!
I have exceptions on a simple query.
>>> from gpuhunt import Catalog
>>> catalog = Catalog()
>>> catalog.load(version="20231120")
>>> catalog.query(provider='nebius', gpu_name=['H100'])
Traceback (most recent call last):
File "<input>", line 1, in <module>
catalog.query(provider='nebius', gpu_name=['H100'])
File "/home/user/dstack/gpuhunt/.direnv/python-3.11/lib/python3.11/site-packages/gpuhunt/_internal/catalog.py", line 142, in query
items = list(heapq.merge(*[f.result() for f in completed], key=lambda i: i.price))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/dstack/gpuhunt/.direnv/python-3.11/lib/python3.11/site-packages/gpuhunt/_internal/catalog.py", line 142, in <listcomp>
items = list(heapq.merge(*[f.result() for f in completed], key=lambda i: i.price))
^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/dstack/gpuhunt/.direnv/python-3.11/lib/python3.11/site-packages/gpuhunt/_internal/catalog.py", line 194, in _get_offline_provider_items
if constraints.matches(item, query_filter):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/dstack/gpuhunt/.direnv/python-3.11/lib/python3.11/site-packages/gpuhunt/_internal/constraints.py", line 110, in matches
if i.gpu_name.lower() not in q.gpu_name:
^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'lower'
When I specify disk as 85GB.., VastAI offers have disk_size set to offer["disk_size"], which is a maximum available disk_size:
✗ dstack run . -b vastai
Configuration .dstack.yml
Project main
User admin
Pool name default-pool
Min resources 2..xCPU, 8GB.., 85GB.. (disk)
Max price -
Max duration 6h
Spot policy auto
Retry policy yes
Creation policy reuse-or-create
Termination policy destroy-after-idle
Termination idle time 600s
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 vastai es-spain 10245490 24xCPU, 8GB, 1xRTX4090 no $0.59542
(24GB), 293.1GB (disk)
2 vastai tw-taiwan 9110786 32xCPU, 8GB, 1xRTX4090 no $0.44132
(24GB), 713.362GB (disk)
3 vastai us-utah 10498512 16xCPU, 16GB, 2xRTX3090 no $0.42361
(24GB), 357GB (disk)
...
Shown 3 of 54 offers, $11.2177 max
This is inconsistent with all other providers that return min_disk_size. It also does not allow the user to get the offer with disk_size less than the maximum one.
✗ dstack run . -b runpod
Configuration .dstack.yml
Project main
User admin
Pool name default-pool
Min resources 2..xCPU, 8GB.., 85GB.. (disk)
Max price -
Max duration 6h
Spot policy auto
Retry policy yes
Creation policy reuse-or-create
Termination policy destroy-after-idle
Termination idle time 600s
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 runpod EU-SE-1 NVIDIA RTX A4000 9xCPU, 50GB, yes $0.19
1xRTXA4000 (16GB),
85GB (disk)
2 runpod EU-RO-1 NVIDIA RTX 4000 Ada 9xCPU, 50GB, yes $0.21
Generation 1xRTX4000 (20GB),
85GB (disk)
3 runpod EU-SE-1 NVIDIA RTX A5000 9xCPU, 43GB, yes $0.26
1xRTXA5000 (24GB),
85GB (disk)
...
Shown 3 of 206 offers, $37.52 max
Needed to make dstackai/dstack#973 work.
VastAI offers are not sorted by price because the cheapest instances are unreliable.
However, the default sorting offers large instances (many GPU) before smaller ones.
The proposed fix is to stable-sort by price, keeping the original order within offers with the same GPU count
Hi!
Datacrunch.io is an interesting modern ML cloud with premium dedicated GPU servers and clusters.
I'll add support for this provider.
Hi!
I've written a few tests for providers and catalog.
test_non_zero_cost
fails today.
One of the providers returns 0 prices.
Run pytest src/integrity_tests
============================= test session starts ==============================
platform linux -- Python 3.11.8, pytest-8.0.2, pluggy-1.4.0
rootdir: /home/runner/work/gpuhunt/gpuhunt
collected 25 items
src/integrity_tests/test_all.py F [ 4%]
src/integrity_tests/test_aws.py ... [ 16%]
src/integrity_tests/test_azure.py ..... [ 36%]
src/integrity_tests/test_cudo.py F.. [ 48%]
src/integrity_tests/test_datacrunch.py .... [ 64%]
src/integrity_tests/test_gcp.py ..... [ 84%]
src/integrity_tests/test_nebius.py .... [100%]
=================================== FAILURES ===================================
______________________ TestAllCatalogs.test_non_zero_cost ______________________
self = <integrity_tests.test_all.TestAllCatalogs object at 0x7f5729aaaa90>
catalog_files = [PosixPath('lambdalabs.csv'), PosixPath('cudo.csv'), PosixPath('nebius.csv'), PosixPath('gcp.csv'), PosixPath('aws.csv'), PosixPath('datacrunch.csv'), ...]
def test_non_zero_cost(self, catalog_files: List[Path]):
for file in catalog_files:
with open(file, "r") as f:
reader = csv.DictReader(f)
prices = [float(row["price"]) for row in reader]
> assert 0 not in prices
E assert 0 not in [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]
src/integrity_tests/test_all.py:19: AssertionError
________________________________ test_locations ________________________________
data_rows = [{'cpu': '1', 'disk_size': '', 'gpu_count': '1', 'gpu_memory': '16', ...}, {'cpu': '1', 'disk_size': '', 'gpu_count': ...u_count': '1', 'gpu_memory': '16', ...}, {'cpu': '3', 'disk_size': '', 'gpu_count': '1', 'gpu_memory': '16', ...}, ...]
def test_locations(data_rows):
expected = {
"no-luster-1",
"se-smedjebacken-1",
"se-stockholm-1",
"us-newyork-1",
"us-santaclara-1",
}
locations = select_row(data_rows, "location")
> assert set(locations) == expected
E AssertionError: assert {'no-luster-1...santaclara-1'} == {'no-luster-1...santaclara-1'}
E
E Extra items in the right set:
E 'se-smedjebacken-1'
E
E Full diff:
E {
E 'no-luster-1',
E - 'se-smedjebacken-1',
E 'se-stockholm-1',
E 'us-newyork-1',
E 'us-santaclara-1',
E }
src/integrity_tests/test_cudo.py:32: AssertionError
---------------------------- Captured stdout setup -----------------------------
.
=========================== short test summary info ============================
FAILED src/integrity_tests/test_all.py::TestAllCatalogs::test_non_zero_cost - assert 0 not in [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]
FAILED src/integrity_tests/test_cudo.py::test_locations - AssertionError: assert {'no-luster-1...santaclara-1'} == {'no-luster-1...santaclara-1'}
Extra items in the right set:
'se-smedjebacken-1'
Full diff:
{
'no-luster-1',
- 'se-smedjebacken-1',
'se-stockholm-1',
'us-newyork-1',
'us-santaclara-1',
}
========================= 2 failed, 23 passed in 0.35s =========================
Error: Process completed with exit code 1.
There is a problem in the dstack when the field gpu_count
greater than zero and gpu_name
is None
. This raises an error when creating
Gpu class.
gpuhunt
gpuhunt
dstack
README.md
Cudo Provider doesn't return offers if I set min_gpu_count and max_gpu_count to 0.
❯ python test_case.py
QueryFilter(provider=['cudo']) 3793
QueryFilter(provider=['cudo'], min_gpu_count=0) 458
QueryFilter(provider=['cudo'], min_gpu_count=0, max_gpu_count=0) 0
QueryFilter(provider=['cudo'], min_gpu_count=0, max_gpu_count=1) 29
QueryFilter(provider=['cudo'], min_gpu_count=1, max_gpu_count=1) 29
Provider must return offers in the optimal order. Further steps do not perform any sorting
Currently, every provider has their region codes. While this information is useful, it would be also great to support unified region codes (e.g. us-east
, europe-central
, etc.), and map provider-specific region codes to unified.
This will allow the user to search GPUs by region without having to know how these regions are coded in every provider.
This will be also supported by https://github.com/dstackai/dstack
Test catalog integrity
failed today.
https://github.com/dstackai/gpuhunt/actions/runs/8105089998/job/22172772003
The servers located in the 'se-smedjebacken-1' location can sometimes be unavailable.
Run pytest src/integrity_tests
============================= test session starts ==============================
platform linux -- Python 3.11.8, pytest-8.0.2, pluggy-1.4.0
rootdir: /home/runner/work/gpuhunt/gpuhunt
collected 25 items
src/integrity_tests/test_all.py F [ 4%]
src/integrity_tests/test_aws.py ... [ 16%]
src/integrity_tests/test_azure.py ..... [ 36%]
src/integrity_tests/test_cudo.py F.. [ 48%]
src/integrity_tests/test_datacrunch.py .... [ 64%]
src/integrity_tests/test_gcp.py ..... [ 84%]
src/integrity_tests/test_nebius.py .... [100%]
=================================== FAILURES ===================================
______________________ TestAllCatalogs.test_non_zero_cost ______________________
self = <integrity_tests.test_all.TestAllCatalogs object at 0x7f5729aaaa90>
catalog_files = [PosixPath('lambdalabs.csv'), PosixPath('cudo.csv'), PosixPath('nebius.csv'), PosixPath('gcp.csv'), PosixPath('aws.csv'), PosixPath('datacrunch.csv'), ...]
def test_non_zero_cost(self, catalog_files: List[Path]):
for file in catalog_files:
with open(file, "r") as f:
reader = csv.DictReader(f)
prices = [float(row["price"]) for row in reader]
> assert 0 not in prices
E assert 0 not in [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]
src/integrity_tests/test_all.py:19: AssertionError
________________________________ test_locations ________________________________
data_rows = [{'cpu': '1', 'disk_size': '', 'gpu_count': '1', 'gpu_memory': '16', ...}, {'cpu': '1', 'disk_size': '', 'gpu_count': ...u_count': '1', 'gpu_memory': '16', ...}, {'cpu': '3', 'disk_size': '', 'gpu_count': '1', 'gpu_memory': '16', ...}, ...]
def test_locations(data_rows):
expected = {
"no-luster-1",
"se-smedjebacken-1",
"se-stockholm-1",
"us-newyork-1",
"us-santaclara-1",
}
locations = select_row(data_rows, "location")
> assert set(locations) == expected
E AssertionError: assert {'no-luster-1...santaclara-1'} == {'no-luster-1...santaclara-1'}
E
E Extra items in the right set:
E 'se-smedjebacken-1'
E
E Full diff:
E {
E 'no-luster-1',
E - 'se-smedjebacken-1',
E 'se-stockholm-1',
E 'us-newyork-1',
E 'us-santaclara-1',
E }
src/integrity_tests/test_cudo.py:32: AssertionError
---------------------------- Captured stdout setup -----------------------------
.
=========================== short test summary info ============================
FAILED src/integrity_tests/test_all.py::TestAllCatalogs::test_non_zero_cost - assert 0 not in [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]
FAILED src/integrity_tests/test_cudo.py::test_locations - AssertionError: assert {'no-luster-1...santaclara-1'} == {'no-luster-1...santaclara-1'}
Extra items in the right set:
'se-smedjebacken-1'
Full diff:
{
'no-luster-1',
- 'se-smedjebacken-1',
'se-stockholm-1',
'us-newyork-1',
'us-santaclara-1',
}
========================= 2 failed, 23 passed in 0.35s =========================
Error: Process completed with exit code 1.
We can not rely on docs pages and should hardcode presets until API becomes available.
The workflow Collect and Publish catalog
broke down today.
There is an error when the catalog is collected.
https://github.com/dstackai/gpuhunt/actions/runs/8731010583/job/23955739082
Run python -m gpuhunt nebius --output ../nebius.csv
2024-04-18 01:27:21,093 INFO Fetching offers for nebius
2024-04-18 01:27:22,629 INFO Fetching SKUs
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/runner/work/gpuhunt/gpuhunt/src/gpuhunt/__main__.py", line 89, in <module>
main()
File "/home/runner/work/gpuhunt/gpuhunt/src/gpuhunt/__main__.py", line 82, in main
offers = provider.get()
^^^^^^^^^^^^^^
File "/home/runner/work/gpuhunt/gpuhunt/src/gpuhunt/providers/nebius.py", line 117, in get
offers = self.get_gpu_platforms(zone, platform_resources)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/work/gpuhunt/gpuhunt/src/gpuhunt/providers/nebius.py", line 134, in get_gpu_platforms
cpu * prices["cpu"]
~~~~~~^^^^^^^
KeyError: 'cpu'
Error: Process completed with exit code 1.
There is an error running the example under Advanced usage.
KeyError: "There is no item named 'nebius.csv' in the archive
or KeyError: "There is no item named 'datacrunch.csv' in the archive"
Environment
gpuhunt
version 0.0.5
MacOS 13.0 Ventura
python version 3.11.7
VastAI hosts provide different GPU memory for the same GPU model name
gpuhunt must normalize memory values according to the GPU model
dstackai/dstack#855 requires a list of known GPUs dstack supports. It's already available in gpuhunt as KNOWN_GPUS, so let's export it for reusing.
Sorting by price is not always the best option, we should support custom ordering and preserve it later during the merge
Hi!
I think it is consistent way to store requirements.
The wrokflow Collect and publish catalogs
isn't working.
Currently, gpuhunt catalog is collected once a day:
gpuhunt/.github/workflows/catalogs.yml
Line 10 in c8bb9a7
This approach is not well suited for runpod since the runpod API returns only the offers available at the time of the request ( dstackai/dstack#1118).
I suggest we try mitigating the problem with runpod by collecting the catalog more frequently. This won't guarantee that users won't get stale or missing offers, but it may occur rarely in practice and not be critical.
Since Collect and publish catalogs workflow usually takes 5m, it should be ok to collect the catalogs every hour.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.