Code Monkey home page Code Monkey logo

Comments (3)

PareesaMS avatar PareesaMS commented on May 26, 2024

For the one in the training folder: It looks like it is given a path to a ds config json file while it doesn't need it. This is also visible in the command that is being executed:
cmd = /opt/miniconda/envs/env2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json

While the correct command does not need this flag "--deepspeed_config ds_config.json"

I see you are not using the last version of DeepSpeed. Please upgrade your version to 10 and try again as I don't see this issue when running version 10: pip install --upgrade deepspeed

For the compression one: I see the same issue. Let me dig deeper and report back to you

from deepspeedexamples.

Samanthavsilva avatar Samanthavsilva commented on May 26, 2024

For the one in the training folder: It looks like it is given a path to a ds config json file while it doesn't need it. This is also visible in the command that is being executed: cmd = /opt/miniconda/envs/env2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json

While the correct command does not need this flag "--deepspeed_config ds_config.json"

I see you are not using the last version of DeepSpeed. Please upgrade your version to 10 and try again as I don't see this issue when running version 10: pip install --upgrade deepspeed

For the compression one: I see the same issue. Let me dig deeper and report back to you

I have just started over in a environment and upgraded deepspeed but I keep getting this issue
[2023-10-07 01:37:30,894] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-10-07 01:37:30,894] [INFO] [launch.py:163:main] dist_world_size=2
[2023-10-07 01:37:30,894] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
libnuma: Warning: cpu argument 0-19 is out of range

<0-19> is invalid
usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]
[--physcpubind= | -C ] [--cpunodebind= | -N ]
[--membind= | -m ] [--localalloc | -l] command args ...
numactl [--show | -s]
numactl [--hardware | -H]
numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ]
[--strict | -t]
[--shmid | -I ] --shm | -S
[--shmid | -I ] --file | -f
[--huge | -u] [--touch | -T]
memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
[2023-10-07 01:37:30,949] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 70145
libnuma: Warning: cpu argument 20-39 is out of range

<20-39> is invalid
usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]
[--physcpubind= | -C ] [--cpunodebind= | -N ]
[--membind= | -m ] [--localalloc | -l] command args ...
numactl [--show | -s]
numactl [--hardware | -H]
numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ]
[--strict | -t]
[--shmid | -I ] --shm | -S
[--shmid | -I ] --file | -f
[--huge | -u] [--touch | -T]
memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
[2023-10-07 01:37:30,950] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 70148

from deepspeedexamples.

Blenderama avatar Blenderama commented on May 26, 2024

Just remove “--deepspeed_config ds_config.json \” in run_ds.sh

from deepspeedexamples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.