Comments (3)
Not only the order with multiple sender/receiver is a problem, but the order between send and receive is a problem, due to our current implementation that both use the default stream.
The below is a simple explanation:
Suppose that a send/recv takes 1 unit time, and a computation takes k=2 units(in real bert layer, k >>1).
Denote microbatch by 1,2
..., and communication between mesh i and mesh j by cij
. If a unit time is occupied by c12
in mesh1, the same happens in mesh2. All communication and computation with same microbatch have same color.
Denote mesh by M1,M2
...
Then if we use the principle that "for each mesh, always deal with prior stage first, whether send or recv", it looks like:
That is, to enable computation on Mesh4, c34 should be done, but c34 is after c23 on Mesh3, and c23 is after c12 on Mesh2. In consequence, Mesh i waits all its prior communication to be done(i units of time).
A better solution should be like:
That is, in the first unit, we run C12, C34, C56..., then in the second we run C23, C45...In case of that, only 2 units are required no matter how many meshes we have.
from alpa.
In our current case, that only embedding causes multiple sender/receiver for a mesh, the order of send/recv only has little influence if #157 is addressed.
In the last column of the timeline above, the top timeline is slightly better than the bottom one because of a tricky communication c23. It is only 1 communication faster because even the first microbatch is much faster in Mesh3 and later meshes, it will finally waits to receive results for the second microbatch. As a result, only the time of one c23 is reduced.
from alpa.
In a general case that for each microbatch, there are some communications c_ij
where j
is neither i+1
nor i-1
, which is frequent in U-Net, to always send/recv earlier stages are not enough.
Let me give an example still with notations above, but let the communication c14
happens for each microbatch:
There are two policies. The first is to let c14
before c12
while the second is c12
before c14
.
The result is that to let c14
before c12
can have slightly better performance.
However, when we extend it into 6 stages, and (1,6), (2,5) have extra communications, the situation is totally different because Mesh2 is too busy: it has three communications(C12, C23, C25
), so the communication influences other meshes and creates bubbles.
As mentioned in the graph, a swap between C25
and C23
makes the pipeline tighter, which is in contrast with the first example who always send/recv with later stages first.
from alpa.
Related Issues (20)
- Will alpa support jax 0.4.x and cuda 12.x?
- cupy package mismatches with CUDA version in the docs HOT 2
- Unable to use pipeline parallelism with multi-node meshes HOT 1
- PLS, a paper related question I want to ask HOT 1
- Question abuot licence / usage HOT 1
- Problem in building Alpa-modified Jaxlib. HOT 5
- IndexError: `InlinedVector::at(size_type) const` failed bounds check
- Check failed: operand_dim < ins->operand(0)->shape().rank() (2 vs. 2)Does not support this kind of Gather. HOT 2
- How to build debug-version Alpa-modified jaxlib HOT 3
- when i check installation by running python3 -m alpa.test_install,AssertionError happend HOT 6
- Unsupported parallel mode in shard-only auto perf test: load_solution
- How to use Alpa to serve BERT models
- Error about python3 -m alpa.test_install
- A question about file /alpa/benchmark/gen_serving_database.py
- Any solution to support llama2 finetune?
- Why did you choose ray instead of using torch distributed? HOT 2
- Ray spill out of disk error when using alpa to auto-parallelize llama HOT 2
- [Bug] Segment fault when using alpa to parallelize llama with jax 0.4.6 environment HOT 2
- How to profile Alpa models and get the trace HOT 1
- Check Installation failled HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alpa.