Code Monkey home page Code Monkey logo

Comments (10)

CCorfield avatar CCorfield commented on August 15, 2024 2

I have written up notes on using Torch, parallel, MPS, and nccl at:

https://github.com/CCorfield/Torch-parallel-nccl-MPS-Example

Let me know if they are helpful.

from nccl.

nluehr avatar nluehr commented on August 15, 2024

In CUDA 8.0 MPS is still required for multiple processes to simultaneously share a single GPU.

from nccl.

CCorfield avatar CCorfield commented on August 15, 2024

Still having issues:

I am running my test script configured with 2 processes per GPU. The script is essentially the same as the one I attached yesterday, with variations for setting up blocking/non-blocking streams, and either synchronizing or not synchronizing the streams after the call to the nccl operation.

I have tried the following "number of MPS daemons" scenarios:
(A) One MPS daemon per GPU
(B) One MPS daemon for both GPUs
(C) No MPS daemon.
(For sake of simplicity, I have not run the X Server.)
I have also tried with and without setting exclusive mode (per Section 5.1.1.1 in the MPS documentation).

Observed behaviors:

  1. If I set Exclusive Mode when running one or two MPS daemons, the child processes will die during initialization with an error "All CUDA-capable device are busy..."
  2. Reverting to Default Mode the script will run to the end but not exit. The process is visible in nvidia-smi's output.
  3. Killing hung processes results in zombie processes. Eventually, most of these defunct processes get reaped (but not all). During this time the utility nvidia-smi also hangs. When nvidia-smi resumes working, I have noticed that there are still some defunct processes in the process table.
  4. If I set up a non-blocking stream on which to do the nccl operations, the nccl operations return a status of "no error". The receiving buffers do not contain any data (not entirely unexpected), if I build in CPU-sleeps to see if the receiving buffers will get filled (eventually), nothing happens, . If I try to synchronize the (non-blocking) stream, the process will hang.
  5. If I setup a blocking stream, the script will hang after the first nccl operation (an out-of-place reduce)

There are quite a few permutations and combinations of potential settings. However, when all is said and done I want the following to work: start a parent process, which does a fork & exec of child processes; each process is assigned a GPU (parent sends assignments to children), and each process performs cycles of forwarding training data through its copy of the net, back propagation by each process to accumulate its own gradient data, nccl.AllReduce to share gradient data between all processes, update of each process's network parameters, and then rinse and repeat.

Any more insights you can share?

-CC

from nccl.

CCorfield avatar CCorfield commented on August 15, 2024

I have found the answer of to how to make MPS work with Torch/parallel and nccl. Rather than give a long description here, I'll look into posting a "How To"with examples under my own github account. At that point I recommend that someone from the nVidia documentation team take a look, because I suspect that there are others who will also find the (current) documentation opaque, and a few points of clarification would save them a lot of churn.

from nccl.

js947 avatar js947 commented on August 15, 2024

Hi @CCorfield, we have a similar problem, did you post a solution somewhere?

from nccl.

CCorfield avatar CCorfield commented on August 15, 2024

I have not yet posted my solution, but will do so fairly soon, since it is on my to-do list.

from nccl.

hiyijian avatar hiyijian commented on August 15, 2024

Hi @CCorfield,
I also observed the same mystery hangs with mpi + nccl. My scenarios is much simpler than yours. I use one single GPU per process(rank). According to issue#37 discussed about mutil-threads scenarios, they resolve the hang issue by add boost::barrier before nccl call. As a result, I add MPI_Barrier() before nccl call in my case. But still hang.

Did you have any sugguestion about fix this? I miss something?

from nccl.

nluehr avatar nluehr commented on August 15, 2024

@hiyijian If NCCL isn't working at all (rather than intermittent failures), you should check ACSCtl settings (run lspci -vvv | grep ACSCtl as root, see #19). Also, you can try disabling VT-d if your BIOS has such options.

from nccl.

hiyijian avatar hiyijian commented on August 15, 2024

hi @nluehr . it intermittently fail

from nccl.

sjeaugey avatar sjeaugey commented on August 15, 2024

Closing this old issue. If still a problem, please reopen with details on the use case (NCCL hangs can be caused by many very different problems, including invalid usage or CUDA limitations).

from nccl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.