Code Monkey home page Code Monkey logo

instance_containize's People

Contributors

jianwang-ntu avatar

instance_containize's Issues

A Limited 1Gbps bandwidth is a bottleneck in distributing training and causes high latency in transfer gradient data between multi-nodes.

Issue:

When initiating Distributed Training with two instances, InstanceA and InstanceB, each equipped with two GPUs, two significant bottlenecks block further progress:

  1. Bandwidth limitation: The bandwidth is limited at 1Gbps, causing traffic jam during data transfer among multiple nodes. The following picture was first screenshot when we assessed the maximum bandwidth Bitdeer could provide.
image

The total estimated time for this job was initially 2.5 hours with 4 GPUs on a single node. However, with the current 1Gbps bandwidth for distributed training, the projected training time has surged to 45 hours.

image
  1. Port constraints: Another challenge arises from port limitations. It took us some time to understand why InstanceA and InstanceB couldn't communicate via TCP or Socket. Without familiarity with bitdeer whitelist policy, beginners might take multiple time to find themselves in debugging their distributed training script rather than pinpointing the actual issue, port constraint.

our suggestion is to free the port constraints in intranet, such as 172.0.0.1/255, 192.168.0.1/255 or 10.0.0.1/255.

image

The terminator didnot response any when save checkpoint

Explain the issue

The terminator which connects to the BitDeer server, will halt at each saving time.
My training script will save the checkpoint every 30 minutes, meanwhile, the terminator will halt at this same period for almost 1 minute until saving is done.

There are two reasons I guess, 1, my saving method may be a CPU-shark, which will take up too much CPU memory. 2, the saving disk relies on network transfer instead of local file writing, and such saving will consume much more bandwidth. [strong agree this point ]

Reproduce script in Traning

def safe_save_model_for_hf_trainer(
    trainer: transformers.Trainer, output_dir: str, bias="none"
):
    """Collects the state dict and dump to disk."""
    # check if zero3 mode enabled
    if deepspeed.is_deepspeed_zero3_enabled():
        state_dict = trainer.model_wrapped._zero3_consolidated_16bit_state_dict()
    else:
        if trainer.args.use_lora:
            state_dict = get_peft_state_maybe_zero_3(
                trainer.model.named_parameters(), bias
            )
        else:
            state_dict = trainer.model.state_dict()
    if trainer.args.should_save and trainer.args.local_rank == 0:
        trainer._save(output_dir, state_dict=state_dict)

Recommend resolving methods:

Is that RAM overload when saving, if YES, please give the tips to the user when creating a small RAM instance.

Is the saving in remote disk, such as transferring into the Data disk by occupying bandwidth, if YES, please expand the bandwidth or give tips to the user.

Mistakenly delete instance’s disk data

Issue:
Given two instances, named instanceA and instanceB, their corresponding disks are named diskA and diskB.

there is a bug, that can Mistakenly delete diskA when you intend to delete instanceB and diskB by asking the system to destroy them immediately.

This is the snapshot of our test instance, you can trace them to identify the root cause.
image

Reproduce:
please make sure to check this option "Destory immediately" in the following picture.

image

We acknowledge that this bug can occasionally come, it may be hard to identify

duplicate email reminding, which title like "[Reminder] Balance is running low."

Dear User:

We would like to inform you that according to the system, your account balance is less than $100.00, and there are on-demand items. This may result in forthcoming account arrears.
.......

I have received 12 emails, always duplicated content, within 25 minutes, this is a little wired. may need to check the email interface,

email_repeat)

Docker image vs OS image

Explain why current not satisfy

Lots of AI developers focus on GPU tasks and want to easily train their jobs, such as simply modifying little lines of code and starting it to train.

So, quick and easy using is essential. I suggest an end-to-end docker image that includes some high-rated libraries in this docker image, the user does not need to install it by themselves.

For example, The installation of Cudatoolkit and Anaconda took a long time (40 minutes), this should be avoided if integrated into a docker image rather than an os image.

How to reproduce

pip install deepspeed

ls /usr/local/cuda

which nvcc 

Error tips

Collecting deepspeed (from -r requirements.txt (line 5))
  Using cached deepspeed-0.14.2.tar.gz (1.3 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [27 lines of output]
      [2024-05-27 01:37:52,346] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      [2024-05-27 01:37:52,450] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/setup.py", line 37, in <module>
          from op_builder import get_default_compute_capabilities, OpBuilder
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/op_builder/__init__.py", line 18, in <module>
          import deepspeed.ops.op_builder  # noqa: F401 # type: ignore
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/deepspeed/__init__.py", line 25, in <module>
          from . import ops
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/deepspeed/ops/__init__.py", line 15, in <module>
          from ..git_version_info import compatible_ops as __compatible_ops__
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/deepspeed/git_version_info.py", line 29, in <module>
          op_compatible = builder.is_compatible()
                          ^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/op_builder/fp_quantizer.py", line 29, in is_compatible
          sys_cuda_major, _ = installed_cuda_version()
                              ^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-install-cup88g1b/deepspeed_0549dc6032cc47d4a5e510880856770f/op_builder/builder.py", line 50, in installed_cuda_version
          raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
      op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
       [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
       [WARNING]  async_io: please install the libaio-dev package with apt
       [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
       [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Recommendation resolving idea

please build the docker image from nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

# for example 
docker pull  nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04
# if with this dockerimages, we can see the nvcc installed path 

which nvcc 

[Feature] Given more guidance to the user when they want to attach new disks

Lots of AI developer are concerned about their training job to stop unexpected. such as out of disk space, and out of memory.

Feature adding

Can we expand the initial disk to 200GB, instead of 100G currently? because in the era of LLM, the LLM weight' takes so much disk space before we start training.

For example, the detailed disk consumption from my first training experience in BitDeer plt: the anaconda is 30G, two llama-7b models' weight take 26G and one checkpoint will take 10G while keeping some intermediate training state.

after the job starts, it is difficult to mount/add newly purchased disks without stopping.

So, I would suggest it is better to initial Instance with 200G disk

The deleted Instance (VirtualMachine) is still active and costs money.

Explain the issue

A trashed instance is still activated and also in cost status.

You can reproduce this by stopping an on-demand job and archiving it by clicking the delete command, Don't delete immediately instead drop it into the trash. Yet, the system still thinks this archive is active and charges money.

I think the archived job is dead which releasing all GPU resources, and should not cost money.

Picture about the transaction.

Instance in trash
trash

Trashed Instance still in cost
cost

Recommend resolving methods:

Can we regard the trashed job as the right way to release all resources, stop the on-demand jobs, and could not cause any costs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.