Code Monkey home page Code Monkey logo

Comments (14)

MatthieuSchaller avatar MatthieuSchaller commented on June 7, 2024

Hi, what MPI version and fabric are you using? We have seen some implementations not behaving correctly.

from swift.

JanKleine avatar JanKleine commented on June 7, 2024

I'm using openmpi 3.1.3 with InfiniBand.
Thanks for the quick response.

from swift.

MatthieuSchaller avatar MatthieuSchaller commented on June 7, 2024

Ok. So that seems fine. And what version of the OFED driver are you using? The regular Linux-kernel one or the Mellanox-optimised?

from swift.

MatthieuSchaller avatar MatthieuSchaller commented on June 7, 2024

Also, what transport library are you using in OpenMPI?

We recommend psm and not psm2. That is running with --mca btl vader,self --mca mtl psm.

from swift.

JanKleine avatar JanKleine commented on June 7, 2024

Ok. So that seems fine. And what version of the OFED driver are you using? The regular Linux-kernel one or the Mellanox-optimised?

I think the Mellanox-optimised version

We recommend psm and not psm2. That is running with --mca btl vader,self --mca mtl psm.

I will try that

from swift.

MatthieuSchaller avatar MatthieuSchaller commented on June 7, 2024
 Ok. So that seems fine. And what version of the OFED driver are you using? The regular Linux-kernel one or the Mellanox-optimised?

I think the Mellanox-optimised version

Right, then that is likely the issue. Their curent driver hangs if too many asynchronous communications are in-flight at a given point in time. SWIFT makes extensive use of this mechanism so you may be facing this issue here.

from swift.

JanKleine avatar JanKleine commented on June 7, 2024

Would removing the Mellanox driver fix the issue?

from swift.

MatthieuSchaller avatar MatthieuSchaller commented on June 7, 2024

I can only speculate as I have never seen this issue on machines where we have control over things but it may help.

Otherwise, trying a different mtl in OpenMPI might help (instead of changing driver).

from swift.

JanKleine avatar JanKleine commented on June 7, 2024

Right, then that is likely the issue. Their curent driver hangs if too many asynchronous communications are in-flight at a given point in time. SWIFT makes extensive use of this mechanism so you may be facing this issue here.

I seem to get the same problem on a system with the regular driver.

from swift.

MatthieuSchaller avatar MatthieuSchaller commented on June 7, 2024

Did you try changing the mtl to psm?

from swift.

MatthieuSchaller avatar MatthieuSchaller commented on June 7, 2024

Hi Jan,

Have you had some luck with the code?

from swift.

JanKleine avatar JanKleine commented on June 7, 2024

I'm having some trouble as the OpenMPI version I'm using apparently doesn't support psm and installing it with psm is a little problematic at the moment, but I'm still on it.

from swift.

JanKleine avatar JanKleine commented on June 7, 2024

I'm also using Slurm as job scheduler, I forgot to mention that earlier. I hope that that is not interfering with anything.

from swift.

JanKleine avatar JanKleine commented on June 7, 2024

Did you try changing the mtl to psm?

using psm did not seem to resolve the issue.

Edit: I made another mistake while running. Using psm does seem to resolve the problem

from swift.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.