Code Monkey home page Code Monkey logo

Comments (3)

ZYHowell avatar ZYHowell commented on May 20, 2024

Not only the order with multiple sender/receiver is a problem, but the order between send and receive is a problem, due to our current implementation that both use the default stream.
The below is a simple explanation:
Suppose that a send/recv takes 1 unit time, and a computation takes k=2 units(in real bert layer, k >>1).
Denote microbatch by 1,2..., and communication between mesh i and mesh j by cij. If a unit time is occupied by c12 in mesh1, the same happens in mesh2. All communication and computation with same microbatch have same color.
Denote mesh by M1,M2...

Then if we use the principle that "for each mesh, always deal with prior stage first, whether send or recv", it looks like:
image
That is, to enable computation on Mesh4, c34 should be done, but c34 is after c23 on Mesh3, and c23 is after c12 on Mesh2. In consequence, Mesh i waits all its prior communication to be done(i units of time).
A better solution should be like:
image
That is, in the first unit, we run C12, C34, C56..., then in the second we run C23, C45...In case of that, only 2 units are required no matter how many meshes we have.

from alpa.

ZYHowell avatar ZYHowell commented on May 20, 2024

In our current case, that only embedding causes multiple sender/receiver for a mesh, the order of send/recv only has little influence if #157 is addressed.
image
In the last column of the timeline above, the top timeline is slightly better than the bottom one because of a tricky communication c23. It is only 1 communication faster because even the first microbatch is much faster in Mesh3 and later meshes, it will finally waits to receive results for the second microbatch. As a result, only the time of one c23 is reduced.

from alpa.

ZYHowell avatar ZYHowell commented on May 20, 2024

In a general case that for each microbatch, there are some communications c_ij where j is neither i+1 nor i-1, which is frequent in U-Net, to always send/recv earlier stages are not enough.
Let me give an example still with notations above, but let the communication c14 happens for each microbatch:
There are two policies. The first is to let c14 before c12 while the second is c12 before c14.
image
The result is that to let c14 before c12 can have slightly better performance.

However, when we extend it into 6 stages, and (1,6), (2,5) have extra communications, the situation is totally different because Mesh2 is too busy: it has three communications(C12, C23, C25), so the communication influences other meshes and creates bubbles.
image

As mentioned in the graph, a swap between C25 and C23 makes the pipeline tighter, which is in contrast with the first example who always send/recv with later stages first.

from alpa.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.