Code Monkey home page Code Monkey logo

slurm-ray-cluster's People

Contributors

howardlau1999 avatar mustafamustafa avatar pzharrington avatar sparticlesteve avatar zekailin00 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

slurm-ray-cluster's Issues

Not Getting Past ray.init

Hi, I modified submit-ray-cluster.sbatch to run on a single node using a container. When it runs, everything seems okay but then everything stops with the the head, worker and python mnist code all being cancelled.

I can tell from some logging statements I added that the mnist code never gets past ray.init. But I can't find any way to understand more about what going on.

Any suggestions on what to try next?

$ sacct -j 1023
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1023         submit-ra+      batch                    24     FAILED      6:0 
1023.batch        batch                               24     FAILED      6:0 
1023.extern      extern                               24  COMPLETED      0:0 
1023.0         hostname                                8  COMPLETED      0:0 
1023.1          RayHead                                8  CANCELLED     0:15 
1023.2       RayWorker1                                8  CANCELLED     0:15 
1023.3       AiPythonS+                                8  CANCELLED      0:6 
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_vx=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this output (/usr/share/lmod/lmod/init/bash)
Shell debugging restarted
+ unset __lmod_vx
+ set -o pipefail
+ export NCCL_BLOCKING_WAIT=1
+ NCCL_BLOCKING_WAIT=1
++ uuidgen
+ redis_password=adc28f91-2b23-4eb8-851e-33d41694d40c
+ export redis_password
++ scontrol show hostnames compute1
+ nodes=compute1
+ nodes_array=(${nodes[0]} ${nodes[0]})
+ echo NODES ARRAY: compute1 compute1
NODES ARRAY: compute1 compute1
+ node_1=compute1
++ srun --gres=gpu:1 --nodes=1 --ntasks=1 -w compute1 hostname --ip-address
+ ip=10.111.245.102
+ port=6379
+ ip_head=10.111.245.102:6379
+ export ip_head
+ echo 'IP Head: 10.111.245.102:6379'
IP Head: 10.111.245.102:6379
+ echo 'STARTING HEAD at compute1'
STARTING HEAD at compute1
+ sleep 10
+ srun -u -l --job-name=RayHead --gres=gpu:1 --nodes=1 --ntasks=1 -w compute1 --container-mounts=/home/updikca1/slurm/rayDocker/slurm-ray-cluster:/code --container-name=ray-torch /code/start-head.sh 10.111.245.102 adc28f91-2b23-4eb8-851e-33d41694d40c
0: pyxis: reusing existing container filesystem
0: pyxis: starting container ...
0: starting ray head node
0: 2021-07-22 18:41:20,012	INFO scripts.py:560 -- Local node IP: 10.111.245.102
0: 2021-07-22 18:41:20,041	WARNING utils.py:510 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
0: 2021-07-22 18:41:21,712	INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
0: 2021-07-22 18:41:22,730	SUCC scripts.py:592 -- --------------------
0: 2021-07-22 18:41:22,730	SUCC scripts.py:593 -- Ray runtime started.
0: 2021-07-22 18:41:22,730	SUCC scripts.py:594 -- --------------------
0: 2021-07-22 18:41:22,730	INFO scripts.py:596 -- Next steps
0: 2021-07-22 18:41:22,730	INFO scripts.py:597 -- To connect to this Ray runtime from another node, run
0: 2021-07-22 18:41:22,730	INFO scripts.py:601 --   ray start --address='10.111.245.102:6379' --redis-password='adc28f91-2b23-4eb8-851e-33d41694d40c'
0: 2021-07-22 18:41:22,730	INFO scripts.py:606 -- Alternatively, use the following Python code:
0: 2021-07-22 18:41:22,731	INFO scripts.py:609 -- import ray
0: 2021-07-22 18:41:22,731	INFO scripts.py:610 -- ray.init(address='auto', _redis_password='adc28f91-2b23-4eb8-851e-33d41694d40c')
0: 2021-07-22 18:41:22,731	INFO scripts.py:618 -- If connection fails, check your firewall sett
0: ings and network configuration.
0: 2021-07-22 18:41:22,731	INFO scripts.py:623 -- To terminate the Ray runtime, run
0: 2021-07-22 18:41:22,731	INFO scripts.py:624 --   ray stop
0: 
+ worker_num=1
+ (( i=1 ))
+ (( i<=1 ))
+ node_i=compute1
+ echo 'STARTING WORKER 1 at compute1'
STARTING WORKER 1 at compute1
+ sleep 5
+ srun -u -l --job-name=RayWorker1 --gres=gpu:1 --nodes=1 --ntasks=1 -w compute1 --container-mounts=/home/updikca1/slurm/rayDocker/slurm-ray-cluster:/code --container-name=ray-torch /code/start-worker.sh 10.111.245.102:6379 adc28f91-2b23-4eb8-851e-33d41694d40c
0: pyxis: reusing existing container filesystem
0: pyxis: starting container ...
0: starting ray worker node
0: 2021-07-22 18:41:29,106	INFO scripts.py:670 -- Local node IP: 10.111.245.102
0: 2021-07-22 18:41:29,109	WARNING utils.py:510 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
0: 2021-07-22 18:41:29,121	SUCC scripts.py:683 -- --------------------
0: 2021-07-22 18:41:29,121	SUCC scripts.py:684 -- Ray runtime started.
0: 2021-07-22 18:41:29,122	SUCC scripts.py:685 -- --------------------
0: 2021-07-22 18:41:29,122	INFO scripts.py:687 -- To terminate the Ray runtime, run
0: 2021-07-22 18:41:29,122	INFO scripts.py:688 --   ray stop
0: 
+ (( i++  ))
+ (( i<=1 ))
+ sleep 20
+ HOROVOD_LOG_LEVEL=debug
+ srun -u -l --job-name=AiPythonScript -u --gpus-per-task=0 --nodes=1 --ntasks=1 --container-mounts=/home/updikca1/slurm/rayDocker/slurm-ray-cluster:/code --container-name=ray-torch python -u /code/examples/mnist_pytorch_trainable.py --ray-address 10.111.245.102:6379
0: pyxis: reusing existing container filesystem
0: pyxis: starting container ...
0: inside mnist_pytorch_trainable.py
0: inside __main__
0: 10.111.245.102:6379 adc28f91-2b23-4eb8-851e-33d41694d40c
0: 2021-07-22 18:41:54,938	INFO worker.py:735 -- Connecting to existing Ray cluster at address: 10.111.245.102:6379
srun: error: compute1: task 0: Aborted

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.