hejing / instance_containize Goto Github PK
View Code? Open in Web Editor NEWany issues when using the bitdeer.ai
any issues when using the bitdeer.ai
Issue:
When initiating Distributed Training with two instances, InstanceA and InstanceB, each equipped with two GPUs, two significant bottlenecks block further progress:
The total estimated time for this job was initially 2.5 hours with 4 GPUs on a single node. However, with the current 1Gbps bandwidth for distributed training, the projected training time has surged to 45 hours.
our suggestion is to free the port constraints in intranet, such as 172.0.0.1/255, 192.168.0.1/255 or 10.0.0.1/255.
The terminator which connects to the BitDeer server, will halt at each saving time.
My training script will save the checkpoint every 30 minutes, meanwhile, the terminator will halt at this same period for almost 1 minute until saving is done.
There are two reasons I guess, 1, my saving method may be a CPU-shark, which will take up too much CPU memory. 2, the saving disk relies on network transfer instead of local file writing, and such saving will consume much more bandwidth. [strong agree this point ]
def safe_save_model_for_hf_trainer(
trainer: transformers.Trainer, output_dir: str, bias="none"
):
"""Collects the state dict and dump to disk."""
# check if zero3 mode enabled
if deepspeed.is_deepspeed_zero3_enabled():
state_dict = trainer.model_wrapped._zero3_consolidated_16bit_state_dict()
else:
if trainer.args.use_lora:
state_dict = get_peft_state_maybe_zero_3(
trainer.model.named_parameters(), bias
)
else:
state_dict = trainer.model.state_dict()
if trainer.args.should_save and trainer.args.local_rank == 0:
trainer._save(output_dir, state_dict=state_dict)
Is that RAM overload when saving, if YES, please give the tips to the user when creating a small RAM instance.
Is the saving in remote disk, such as transferring into the Data disk by occupying bandwidth, if YES, please expand the bandwidth or give tips to the user.
Issue:
Given two instances, named instanceA and instanceB, their corresponding disks are named diskA and diskB.
there is a bug, that can Mistakenly delete diskA when you intend to delete instanceB and diskB by asking the system to destroy them immediately.
This is the snapshot of our test instance, you can trace them to identify the root cause.
Reproduce:
please make sure to check this option "Destory immediately" in the following picture.
We acknowledge that this bug can occasionally come, it may be hard to identify
Dear User:
We would like to inform you that according to the system, your account balance is less than $100.00, and there are on-demand items. This may result in forthcoming account arrears.
.......
I have received 12 emails, always duplicated content, within 25 minutes, this is a little wired. may need to check the email interface,
Lots of AI developers focus on GPU tasks and want to easily train their jobs, such as simply modifying little lines of code and starting it to train.
So, quick and easy using is essential. I suggest an end-to-end docker image that includes some high-rated libraries in this docker image, the user does not need to install it by themselves.
For example, The installation of Cudatoolkit and Anaconda took a long time (40 minutes), this should be avoided if integrated into a docker image rather than an os image.
pip install deepspeed
ls /usr/local/cuda
which nvcc
Collecting deepspeed (from -r requirements.txt (line 5))
Using cached deepspeed-0.14.2.tar.gz (1.3 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [27 lines of output]
[2024-05-27 01:37:52,346] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-27 01:37:52,450] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/setup.py", line 37, in <module>
from op_builder import get_default_compute_capabilities, OpBuilder
File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/op_builder/__init__.py", line 18, in <module>
import deepspeed.ops.op_builder # noqa: F401 # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/deepspeed/__init__.py", line 25, in <module>
from . import ops
File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/deepspeed/ops/__init__.py", line 15, in <module>
from ..git_version_info import compatible_ops as __compatible_ops__
File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/deepspeed/git_version_info.py", line 29, in <module>
op_compatible = builder.is_compatible()
^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/op_builder/fp_quantizer.py", line 29, in is_compatible
sys_cuda_major, _ = installed_cuda_version()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/op_builder/builder.py", line 50, in installed_cuda_version
raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
please build the docker image from nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
# for example
docker pull nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04
# if with this dockerimages, we can see the nvcc installed path
which nvcc
Lots of AI developer are concerned about their training job to stop unexpected. such as out of disk space, and out of memory.
Can we expand the initial disk to 200GB, instead of 100G currently? because in the era of LLM, the LLM weight' takes so much disk space before we start training.
For example, the detailed disk consumption from my first training experience in BitDeer plt: the anaconda is 30G, two llama-7b models' weight take 26G and one checkpoint will take 10G while keeping some intermediate training state.
after the job starts, it is difficult to mount/add newly purchased disks without stopping.
So, I would suggest it is better to initial Instance with 200G disk
A trashed instance is still activated and also in cost status.
You can reproduce this by stopping an on-demand job and archiving it by clicking the delete command, Don't delete immediately instead drop it into the trash. Yet, the system still thinks this archive is active and charges money.
I think the archived job is dead which releasing all GPU resources, and should not cost money.
Trashed Instance still in cost
Can we regard the trashed job as the right way to release all resources, stop the on-demand jobs, and could not cause any costs?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.