How should I prepare my code (data loaders, model, etc..) in order to train in a both

Training with Data and Expert Parallelism about tutel HOT 5 OPEN

microsoft commented on May 23, 2024

Training with Data and Expert Parallelism

from tutel.

Comments (5)

ghostplant commented on May 23, 2024

Tutel MoE works just like it is in DDP modes for data loaders and models, so you can safely stack the Tutel MoE layer in your original forward graph design for DDP. (e.g. https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_ddp.py#L79)

There is only one thing you need to pay attention: All expert parameters shouldn't be managed by DDP allreduce.
And this is how you can achieve this goal:

Set an attribution as mask on each expert parameter object (follow the example): https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_ddp.py#L68
Follow Pytorch DDP to add the handler to skip them: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_ddp.py#L83-L87
Just after model initialization to target device (e.g. model = model.to('cuda:#')), call the handler: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_ddp.py#L92-L95

from tutel.

santurini commented on May 23, 2024

Thank you very much, I'll try it out! And any explanation about the different type of parallel arguments and what are the differences between them?

from tutel.

ghostplant commented on May 23, 2024

Whatever type of parallel you choose, it doesn't change how to use MoE layer out of the box. Different types of parallel just change the MoE internal parallelism to use, but those choices are all transparent to users & also math-equivalent with each other.

For large scales / small scales, smartly setting of that option will improve the execution time of Tutel MoE layer, since each different parallelism has its particular network complexity and local memory consumption.

from tutel.

ghostplant commented on May 23, 2024

Additionally, you can also change the parallel option for every different iteration. e.g. https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_switch.py#L88

The value of adaptive_r varies from 0 to max(1, [Total GPU Count / Total Expert Count])

from tutel.

santurini commented on May 23, 2024

Wow, thank you very much I'll try it out!

from tutel.

Recommend Projects

Training with Data and Expert Parallelism about tutel HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent