Comments (5)
Tutel MoE works just like it is in DDP modes for data loaders and models, so you can safely stack the Tutel MoE layer in your original forward graph design for DDP. (e.g. https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_ddp.py#L79)
There is only one thing you need to pay attention: All expert parameters shouldn't be managed by DDP allreduce.
And this is how you can achieve this goal:
- Set an attribution as mask on each expert parameter object (follow the example): https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_ddp.py#L68
- Follow Pytorch DDP to add the handler to skip them: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_ddp.py#L83-L87
- Just after model initialization to target device (e.g.
model = model.to('cuda:#')
), call the handler: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_ddp.py#L92-L95
from tutel.
Thank you very much, I'll try it out! And any explanation about the different type of parallel arguments and what are the differences between them?
from tutel.
Whatever type of parallel you choose, it doesn't change how to use MoE layer out of the box. Different types of parallel just change the MoE internal parallelism to use, but those choices are all transparent to users & also math-equivalent with each other.
For large scales / small scales, smartly setting of that option will improve the execution time of Tutel MoE layer, since each different parallelism has its particular network complexity and local memory consumption.
from tutel.
Additionally, you can also change the parallel option for every different iteration. e.g. https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_switch.py#L88
The value of adaptive_r
varies from 0 to max(1, [Total GPU Count / Total Expert Count])
from tutel.
Wow, thank you very much I'll try it out!
from tutel.
Related Issues (20)
- New Tutel checkpoint loading is incompatible with old models HOT 7
- Multi-nodes training is much more slower than single node HOT 1
- [installation errors] fatal error: nccl.h: No such file or directory HOT 1
- RuntimeError: No such operator tutel_ops::cumsum HOT 10
- How the experts' gradients are handled under data parallelism? HOT 1
- All2All precision always in fp32 HOT 1
- tutel/jit_kernels/sparse.py torch.float16 There is a bug in the calculation: the cuda calculation result is inconsistent with the CPU calculation result and the array is out of bounds HOT 1
- [Bug]The function func_fwd is calculated inconsistent on the cpu and gpu HOT 1
- ImportError: cannot import name 'tutel_custom_kernel' from 'tutel.impls.jit_compiler' HOT 12
- about compute_location and locations HOT 1
- INTERNAL ASSERT FAILED HOT 5
- Can this package support the one-gpu machine HOT 5
- how to use tutel on Megatron Deepspeed HOT 4
- numpy not in requirements HOT 5
- What is the difference between this and deepspeed-moe? HOT 2
- tutel is slower than the naive p2p using 2DH for small scale HOT 3
- RuntimeError: (0) == (cuModuleLoadDataEx(&hMod, image.c_str(), sizeof(options) / sizeof(*options), options, values)) INTERNAL ASSERT FAILED HOT 3
- Non-surface function utilities only work for contiguous input data HOT 12
- How to implement Fairseq-MoE training checkpoint like Swin-MoE? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tutel.