Code Monkey home page Code Monkey logo

tutel's Introduction

Tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation.

  • Supported Framework: Pytorch (recommend: >= 1.10)
  • Supported GPUs: CUDA(fp64/fp32/fp16/bfp16), ROCm(fp64/fp32/fp16)
  • Supported CPU: fp64/fp32

What's New:

  • Tutel v0.3.1: Add NCCL all_to_all_v and all_gather_v for arbitrary-length message transfers:
  >> Example:
    # All_to_All_v:
    python3 -m torch.distributed.run --nproc_per_node=2 --master_port=7340 -m tutel.examples.nccl_all_to_all_v
    # All_Gather_v:
    python3 -m torch.distributed.run --nproc_per_node=2 --master_port=7340 -m tutel.examples.nccl_all_gather_v

  >> How to:
    net.batch_all_to_all_v([t_x_cuda, t_y_cuda, ..], common_send_counts)
    net.batch_all_gather_v([t_x_cuda, t_y_cuda, ..])
  • Tutel v0.3: Add Megablocks solution to improve decoder inference on single-GPU with num_local_expert >= 2:
  >> Example (capacity_factor=0 for dropless-MoE):
    # Using BatchMatmul:
    python3 -m tutel.examples.helloworld --megablocks_size=0 --batch_size=1 --num_tokens=32 --top=1 --eval --num_local_experts=128 --capacity_factor=0
    # Using Megablocks with block_size = 1:
    python3 -m tutel.examples.helloworld --megablocks_size=1 --batch_size=1 --num_tokens=32 --top=1 --eval --num_local_experts=128 --capacity_factor=0
    # Using Megablocks with block_size = 2:
    python3 -m tutel.examples.helloworld --megablocks_size=2 --batch_size=1 --num_tokens=32 --top=1 --eval --num_local_experts=128 --capacity_factor=0

  >> How to:
    self._moe_layer.forward(x, .., megablocks_size=1)         # Control the switch of megablocks_size (0 for disabled)
  • Tutel v0.2: Allow most configurations to be dynamic switchable with free cost:
  >> Example:
    python3 -m torch.distributed.run --nproc_per_node=8 -m tutel.examples.helloworld_switch --batch_size=16

  >> How to:
    self._moe_layer.forward(x, .., a2a_ffn_overlap_degree=2)  # Control the switch of overlap granularity (1 for no overlapping)
    self._moe_layer.forward(x, .., adaptive_r=1)              # Control the switch of parallelism (0 for DP, 1 for DP + EP, W / E for MP + EP, else for DP + MP + EP)
    self._moe_layer.forward(x, .., capacity_factor=1)         # Control the switch of capacity_volume (positive for padding, negative for no-padding, 0 for dropless)
    self._moe_layer.forward(x, .., top_k=1)                   # Control the switch of top_k sparsity
  • Tutel v0.1: Optimize the Einsum Complexity of Data Dispatch Encoding and Decoding, add 2DH option to deal with All-to-All at scale:
  >> Example (suggest enabling 2DH only at scale, note that the value of --nproc_per_node MUST equal to total physical GPU counts per node, e.g. 8 for A100x8):
    python3 -m torch.distributed.run --nproc_per_node=8 -m tutel.examples.helloworld --batch_size=16 --use_2dh

How to setup Tutel MoE for Pytorch 2 and run examples, or enable fairseq with MoE:

* Prepare Recommended Pytorch >= 2.0.0 (minimal version == 1.8.0):
        #  Windows/Linux Pytorch for NVIDIA CUDA >= 11.7:
        python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
        #  Linux Pytorch for AMD ROCm == 5.4.2:
        python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
        #  Windows/Linux Pytorch for CPU:
        python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

* Install Tutel Online:

        $ python3 -m pip uninstall tutel -y
        $ python3 -m pip install setuptools wheel
        $ python3 -m pip install -v -U --no-build-isolation git+https://github.com/microsoft/tutel@main

* Build Tutel from Source:

        $ git clone https://github.com/microsoft/tutel --branch main

        $ python3 -m pip uninstall tutel -y
        $ python3 ./tutel/setup.py install --user

* Quick Test on Single-GPU:

        $ python3 -m tutel.examples.helloworld --batch_size=16               # Test Tutel-optimized MoE + manual distribution
        $ python3 -m tutel.examples.helloworld_ddp --batch_size=16           # Test Tutel-optimized MoE + Pytorch DDP distribution (requires: Pytorch >= 1.8.0)
        $ python3 -m tutel.examples.helloworld_ddp_tutel --batch_size=16     # Test Tutel-optimized MoE + Tutel DDP distribution (ZeRO on optimizors)
        $ python3 -m tutel.examples.helloworld_amp --batch_size=16           # Test Tutel-optimized MoE with AMP data type + manual distribution
        $ python3 -m tutel.examples.helloworld_demo --batch_size=16          # Test Tutel-optimized MoE + custom defined expert layer
        $ python3 -m tutel.examples.helloworld_from_scratch                  # Test Custom MoE implementation from scratch
        $ python3 -m tutel.examples.moe_mnist                                # Test MoE layer in end-to-end MNIST dataset
        $ python3 -m tutel.examples.moe_cifar10                              # Test MoE layer in end-to-end CIFAR10 dataset

        (If building from source, the following method also works:)
        $ python3 ./tutel/examples/helloworld.py --batch_size=16
        ..

* Run Tutel MoE in Distributed Mode:

        (Method A - Torch launcher for `Multi-Node x Multi-GPU`:)
        $ ssh <node-ip-0> python3 -m torch.distributed.run --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=<node-ip-0> -m tutel.examples.helloworld --batch_size=16
        $ ssh <node-ip-1> python3 -m torch.distributed.run --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=<node-ip-0> -m tutel.examples.helloworld --batch_size=16

        (Method B - Tutel launcher for `Multi-Node x Multi-GPU`, requiring package `openmpi-bin`:)
        # << Single Node >>
        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.helloworld_ddp_tutel --batch_size=16
        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.moe_mnist
        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.moe_cifar10
        ...

        # << Cross Nodes >>
        $ mpiexec -bind-to none -host <node-ip-0>,<node-ip-1>,.. -x MASTER_ADDR=<node-ip-0> -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.helloworld --batch_size=16

        # << For CPU-based Launch>>
        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=1 -x OMP_NUM_THREADS=1024 python3 -m tutel.launcher.run -m tutel.examples.helloworld --batch_size=16 --device cpu

How to convert checkpoint files that adapt to different distributed world sizes:

# Firstly, using 2 GPUs to train a model with 16 global experts (each GPU holds 8 local experts), saving checkpoint files in the end:
mpiexec -bind-to none -host localhost -x LOCAL_SIZE=2 python3 -m tutel.launcher.run -m tutel.examples.helloworld --num_local_experts=8 --checkpoint=./states/{rank}-of-{size}.ckpt --device=cuda

# Secondly, convert the checkpoint files (based on 2 GPUs) into a single checkpoint file containing all parameters:
python3 -m tutel.checkpoint.gather --inputs=./states/{rank}-of-{size}.ckpt --input_size=2 --output ./model-synthetis.ckpt

# Optionally, you can test the synthetis checkpoint using single CPU device, note that there will be 16 experts locally:
python3 -m tutel.examples.helloworld --num_local_experts=16 --checkpoint=./model-synthetis.ckpt --device=cpu --eval

# Next, convert the synthetis checkpoint file that adapts to distributed training using 8 GPUs:
python3 -m tutel.checkpoint.scatter --input=./model-synthetis.ckpt --output_size=8 --outputs=./adapted-for-8-gpus/{rank}-of-{size}.ckpt

# Then, using generated checkpoint files to train/eval using 8 GPUs, note that there will be 2 local experts this time:
mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.helloworld --num_local_experts=2 --checkpoint=./adapted-for-8-gpus/{rank}-of-{size}.ckpt --device=cuda

# Similarly, the convertion tool also supports X global experts adapting to Y GPUs, where Y % X == 0, making num_local_experts to be -Y / X.
python3 -m tutel.checkpoint.scatter --input=./model-synthetis.ckpt --output_size=32 --outputs=./adapted-for-32-gpus/{rank}-of-{size}.ckpt
mpiexec -bind-to none -host localhost -x LOCAL_SIZE=32 python3 -m tutel.launcher.run -m tutel.examples.helloworld --num_local_experts=-2 --checkpoint=./adapted-for-32-gpus/{rank}-of-{size}.ckpt --device=cuda

How to import Tutel-optimized MoE in Pytorch:

# Input Example:
import torch
x = torch.ones([6, 1024], device='cuda:0')

# Create MoE:
from tutel import moe as tutel_moe
moe_layer = tutel_moe.moe_layer(
    gate_type={'type': 'top', 'k': 2},
    model_dim=x.shape[-1],
    experts={
        'count_per_node': 2,
        'type': 'ffn', 'hidden_size_per_expert': 2048, 'activation_fn': lambda x: torch.nn.functional.relu(x)
    },
    scan_expert_func = lambda name, param: setattr(param, 'skip_allreduce', True),
)

# Cast to GPU
moe_layer = moe_layer.to('cuda:0')

# In distributed model, you need further skip doing allreduce on global parameters that have `skip_allreduce` mask, 
# e.g.
#    for p in moe_layer.parameters():
#        if hasattr(p, 'skip_allreduce'):
#            continue
#        dist.all_reduce(p.grad)


# Forward MoE:
y = moe_layer(x)

print(y)

Usage of MOELayer:

* Usage of MOELayer Args:

        gate_type        : dict-type gate description, e.g. {'type': 'top', 'k': 2, 'capacity_factor': -1.5, ..},
                              or a list of dict-type gate descriptions, e.g. [{'type': 'top', 'k', 2}, {'type': 'top', 'k', 2}],
                              the value of k in top-gating can be also negative, like -2, which indicates one GPU will hold 1/(-k) parameters of an expert
                              capacity_factor X can be positive (factor = X), zero (factor = max(needed_volumes)) or negative (factor = min(-X, max(needed_volumes))).
        model_dim        : the number of channels for MOE's input tensor
        experts          : a dict-type config for builtin expert network
        scan_expert_func : allow users to specify a lambda function to iterate each experts param, e.g. `scan_expert_func = lambda name, param: setattr(param, 'expert', True)`
        result_func      : allow users to specify a lambda function to format the MoE output and aux_loss, e.g. `result_func = lambda output: (output, output.l_aux)`
        group            : specify the explicit communication group of all_to_all
        seeds            : a tuple containing a tripple of int to specify manual seed of (shared params, local params, others params after MoE's)
        a2a_ffn_overlap_degree : the value to control a2a overlap depth, 1 by default for no overlap, 2 for overlap a2a with half gemm, ..
        parallel_type    : the parallel method to compute MoE, valid types: 'auto', 'data', 'model'
        pad_samples      : whether do auto padding on newly-coming input data to maximum data size in history

* Usage of dict-type Experts Config:

        count_per_node   : the number of local experts per device (by default, the value is 1 if not specified)
        type             : available built-in experts implementation, e.g: ffn
        hidden_size_per_expert : the hidden size between two linear layers for each expert (used for type == 'ffn' only)
        activation_fn    : the custom-defined activation function between two linear layers (used for type == 'ffn' only)
        has_fc1_bias     : If set to False, the expert bias parameters `batched_fc1_bias` is disabled. Default: True
        has_fc2_bias     : If set to False, the expert bias parameters `batched_fc2_bias` is disabled. Default: True

Reference

You can consult this paper below to get to know more technical details about Tutel:

@article {tutel,
author = {Changho Hwang and Wei Cui and Yifan Xiong and Ziyue Yang and Ze Liu and Han Hu and Zilong Wang and Rafael Salas and Jithin Jose and Prabhat Ram and Joe Chau and Peng Cheng and Fan Yang and Mao Yang and Yongqiang Xiong},
title = {Tutel: Adaptive Mixture-of-Experts at Scale},
year = {2022},
month = jun,
journal = {CoRR},
volume= {abs/2206.03382},
url = {https://arxiv.org/pdf/2206.03382.pdf},
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

tutel's People

Contributors

abuccts avatar ericwangcn avatar foreveronehundred avatar ghostplant avatar guoshzhao avatar harsh-sensei avatar jspark1105 avatar microsoft-github-operations[bot] avatar microsoftopensource avatar msftsw avatar ngoyal2707 avatar vchiley avatar yzygitzh avatar zeliu98 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tutel's Issues

tutel is slower than the naive p2p using 2DH for small scale

Hello, and thank you for providing this excellent codebase. I have a question regarding our experiment, where we observed that the 2DH inter-node communication is slower compared to the naive peer-to-peer approach, despite both methods transmitting the same amount of data. This observation contradicts our initial intuition, and I'm curious about the potential reasons behind it?

Multi-nodes training is much more slower than single node

hi, when I train models using tutel, I find that, in each step, multi-nodes training will need much more step time (if n nodes, it will take around n times of training time of 1-node) than single node training. Thus multi-node training will take even more time than single-node training to finish one epoch.
Any debugging suggestions with this issue?
Thanks!!!

Cannot Import JIT optimized kernels?

I used TUTEL for a while and it works greatly fine. But today I updated my environment and reinstall tutel, I found it crashed during importing module. Do you have any idea on why this happen? Thanks!

>>> from tutel import moe as tutel_moe
Traceback (most recent call last):
  File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel/impls/jit_compiler.py", line 8, in <module>
    import tutel_custom_kernel
ImportError: /mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel_custom_kernel.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel/moe.py", line 6, in <module>
    from .jit_kernels.gating import fast_cumsum_sub_one
  File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel/jit_kernels/gating.py", line 7, in <module>
    from ..impls.jit_compiler import tutel_custom_kernel
  File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/tutel/impls/jit_compiler.py", line 10, in <module>
    raise Exception("Cannot import JIT optimized kernels. Did you forget to install Custom Kernel Extension?")
Exception: Cannot import JIT optimized kernels. Did you forget to install Custom Kernel Extension?

how can I install this pack on conda environment??

how can I install this pack on conda environment??
It jumped ERROR: Microsoft Visual C++ 14.0 or greater is required.
but I had tied: " conda install libpython m2w64-toolchain -c msys2 " and isntalled "Microsoft Visual C++ Build Tools" as some blogs answer to solve it.
what should I do???

[installation errors] fatal error: nccl.h: No such file or directory

running install
/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
reading manifest file 'tutel.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.9
creating build/lib.linux-x86_64-3.9/tutel
copying tutel/moe.py -> build/lib.linux-x86_64-3.9/tutel
copying tutel/net.py -> build/lib.linux-x86_64-3.9/tutel
copying tutel/__init__.py -> build/lib.linux-x86_64-3.9/tutel
copying tutel/jit.py -> build/lib.linux-x86_64-3.9/tutel
copying tutel/system.py -> build/lib.linux-x86_64-3.9/tutel
creating build/lib.linux-x86_64-3.9/tutel/jit_kernels
copying tutel/jit_kernels/gating.py -> build/lib.linux-x86_64-3.9/tutel/jit_kernels
copying tutel/jit_kernels/__init__.py -> build/lib.linux-x86_64-3.9/tutel/jit_kernels
copying tutel/jit_kernels/sparse.py -> build/lib.linux-x86_64-3.9/tutel/jit_kernels
creating build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/helloworld_ddp.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/moe_mnist.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/helloworld.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/helloworld_ddp_tutel.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/moe_cifar10.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/__init__.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/helloworld_amp.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/helloworld_from_scratch.py -> build/lib.linux-x86_64-3.9/tutel/examples
copying tutel/examples/helloworld_deepspeed.py -> build/lib.linux-x86_64-3.9/tutel/examples
creating build/lib.linux-x86_64-3.9/tutel/launcher
copying tutel/launcher/execl.py -> build/lib.linux-x86_64-3.9/tutel/launcher
copying tutel/launcher/__init__.py -> build/lib.linux-x86_64-3.9/tutel/launcher
copying tutel/launcher/run.py -> build/lib.linux-x86_64-3.9/tutel/launcher
creating build/lib.linux-x86_64-3.9/tutel/gates
copying tutel/gates/top.py -> build/lib.linux-x86_64-3.9/tutel/gates
copying tutel/gates/__init__.py -> build/lib.linux-x86_64-3.9/tutel/gates
copying tutel/gates/cosine_top.py -> build/lib.linux-x86_64-3.9/tutel/gates
creating build/lib.linux-x86_64-3.9/tutel/custom
copying tutel/custom/__init__.py -> build/lib.linux-x86_64-3.9/tutel/custom
creating build/lib.linux-x86_64-3.9/tutel/experts
copying tutel/experts/ffn.py -> build/lib.linux-x86_64-3.9/tutel/experts
copying tutel/experts/__init__.py -> build/lib.linux-x86_64-3.9/tutel/experts
creating build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/fast_dispatch.py -> build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/moe_layer.py -> build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/losses.py -> build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/jit_compiler.py -> build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/overlap.py -> build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/__init__.py -> build/lib.linux-x86_64-3.9/tutel/impls
copying tutel/impls/communicate.py -> build/lib.linux-x86_64-3.9/tutel/impls
creating build/lib.linux-x86_64-3.9/tutel/parted
copying tutel/parted/patterns.py -> build/lib.linux-x86_64-3.9/tutel/parted
copying tutel/parted/__init__.py -> build/lib.linux-x86_64-3.9/tutel/parted
copying tutel/parted/spmdx.py -> build/lib.linux-x86_64-3.9/tutel/parted
copying tutel/parted/solver.py -> build/lib.linux-x86_64-3.9/tutel/parted
creating build/lib.linux-x86_64-3.9/tutel/parted/backend
copying tutel/parted/backend/__init__.py -> build/lib.linux-x86_64-3.9/tutel/parted/backend
creating build/lib.linux-x86_64-3.9/tutel/parted/backend/torch
copying tutel/parted/backend/torch/executor.py -> build/lib.linux-x86_64-3.9/tutel/parted/backend/torch
copying tutel/parted/backend/torch/__init__.py -> build/lib.linux-x86_64-3.9/tutel/parted/backend/torch
copying tutel/parted/backend/torch/config.py -> build/lib.linux-x86_64-3.9/tutel/parted/backend/torch
running build_ext
building 'tutel_custom_kernel' extension
creating /mnt/lustre/tutel/build/temp.linux-x86_64-3.9
creating /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel
creating /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom
Emitting ninja build file /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ -MMD -MF /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -pthread -B /mnt/lustre/anaconda3/envs/scale/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /mnt/lustre/anaconda3/envs/scale/include -I/mnt/lustre/anaconda3/envs/scale/include -fPIC -O2 -isystem /mnt/lustre/anaconda3/envs/scale/include -fPIC -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/TH -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/lustre/anaconda3/envs/scale/include/python3.9 -c -c /mnt/lustre/tutel/tutel/custom/custom_kernel.cpp -o /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -Wno-strict-aliasing -DUSE_GPU -DUSE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o
c++ -MMD -MF /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o.d -pthread -B /mnt/lustre/anaconda3/envs/scale/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /mnt/lustre/anaconda3/envs/scale/include -I/mnt/lustre/anaconda3/envs/scale/include -fPIC -O2 -isystem /mnt/lustre/anaconda3/envs/scale/include -fPIC -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/TH -I/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/lustre/anaconda3/envs/scale/include/python3.9 -c -c /mnt/lustre/tutel/tutel/custom/custom_kernel.cpp -o /mnt/lustre/tutel/build/temp.linux-x86_64-3.9/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -Wno-strict-aliasing -DUSE_GPU -DUSE_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
/mnt/lustre/tutel/tutel/custom/custom_kernel.cpp:19:10: fatal error: nccl.h: No such file or directory
   19 | #include <nccl.h>
      |          ^~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Try installing without NCCL extension..
running install
/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
reading manifest file 'tutel.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
copying tutel/moe.py -> build/lib.linux-x86_64-3.9/tutel
Try installing without CUDA extension..
running install
/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/mnt/lustre/anaconda3/envs/scale/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
reading manifest file 'tutel.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
copying tutel/moe.py -> build/lib.linux-x86_64-3.9/tutel
error: could not create 'build/lib.linux-x86_64-3.9/tutel/moe.py': No such file or directory

Problem from applying pipeline parallel with Tutel's cumsum

I just got an error when applying Torch pipeline parallel with Tutel's cumsum. The problem is caused by the non zero return value of cuLaunchKernel in tutel/tutel/custom/custom_kernel.cpp. To investigate the issue, I added printf to tutel/tutel/custom/custom_kernel.cpp to print that return value and found that the value is (CUDA_ERROR_INVALID_HANDLE == 400)

From NVidia's docs, I found that the following information about CUDA_ERROR_INVALID_HANDLE,
This indicates that a resource handle passed to the API call was not valid. Resource handles are opaque types like CUstream and CUevent

Here is the sample code to reproduce the problem,

from tutel.jit_kernels.gating import fast_cumsum_sub_one
from torch.distributed.pipeline.sync import Pipe

import torch

def _initialize_pipeline_parallel(num_nodes, nstages):
    '''
    Initialize pipelining framework.
    Args:
      nstages : Total number of pipeline stages.
    '''
    #print("==> inside initialize-pipeline_parallel :")
    assert num_nodes == 1, "Only single node pipeline parallelism is supported right now"
    from torch.distributed import rpc
    import tempfile
    tmpfile = tempfile.NamedTemporaryFile()
    rpc.init_rpc(
        name="grid_worker",
        rank=0,
        world_size=num_nodes,
        rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
            init_method="file://{}".format(tmpfile.name)
        )
    )

class CumsumModule(torch.nn.Module):
    def __init__(self):
        super(CumsumModule, self).__init__()
        self.param = torch.nn.Parameter(torch.ones(5, 5))
    def forward(self, x):
        x = x + self.param
        x_cumsum = fast_cumsum_sub_one(x, dim=0)
        return x_cumsum

_initialize_pipeline_parallel(1, 2)

input = torch.randint(0, 5, (5, 5), device='cuda:0')
cumsum_module0 = CumsumModule().to(device='cuda:0')
cumsum_module1 = CumsumModule().to(device='cuda:1')
model = torch.nn.Sequential(cumsum_module0, cumsum_module1)
model = Pipe(model)
output = model(input)

The following is the error messages,

Traceback (most recent call last):
  File "Reproduce.py", line 42, in <module>
    output = model(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/pipeline/sync/pipe.py", line 362, in forward
    self.pipeline.run(batches)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/pipeline/sync/pipeline.py", line 117, in run
    self.compute(batches, schedule, skip_trackers)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/pipeline/sync/pipeline.py", line 257, in compute
    raise exc_info[0].with_traceback(exc_info[1], exc_info[2])
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/pipeline/sync/worker.py", line 79, in worker
    batch = task.compute()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/pipeline/sync/worker.py", line 60, in compute
    return self._compute()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/pipeline/sync/pipeline.py", line 222, in compute
    return batch.call(partition)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/pipeline/sync/microbatch.py", line 70, in call
    return Batch(function(self.value))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "Reproduce.py", line 32, in forward
    x_cumsum = fast_cumsum_sub_one(x, dim=0)
  File "/home/chutsai/.local/lib/python3.8/site-packages/tutel/jit_kernels/gating.py", line 83, in fast_cumsum_sub_one
    return get_cumsum_kernel(data.size(0), data.size(1))(data)
  File "/home/chutsai/.local/lib/python3.8/site-packages/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
    base_kernel(mask1.to(torch.int32).contiguous(), locations1)
  File "/home/chutsai/.local/lib/python3.8/site-packages/tutel/impls/jit_compiler.py", line 41, in func
    tutel_custom_kernel.invoke(inputs, __ctx__)
RuntimeError: (0) == (cuLaunchKernel(gm.hFunc, gm.blocks.x, gm.blocks.y, gm.blocks.z, gm.threads.x, gm.threads.y, gm.threads.z, 0, nullptr, ppargs.data(), nullptr)) INTERNAL ASSERT FAILED at "/tmp/pip-req-build-u7pfjlvx/tutel/custom/custom_kernel.cpp":146, please report a bug to PyTorch. CHECK_EQ fails.

Please help me to solve this issue or tell me if did not use the Tutel correctly. Thanks.

question about how to set data Parallelism

Thanks for your contributions. Tutel is definitely a great work!
But I have difficulty applying this Tutel to a new framework.
If I have 8GPUs, And I want to set the number of experts to 4 while there are 2 local experts on a device (GPU).
Thus the GPUs will be divided into 4 groups, which each has 2 GPUs and contain 4 experts.
I think the GPU0, GPU2, GPU4, and GPU6 have the experts with the same parameters.
How can I implement this setting?

Pretrained MoE model

hi thanks for providing such a wonderful work. However, I am curious that will you consider providing pretrained MoE models (e.g. ViT on ImageNet or machine translation tasks)

numpy not in requirements

after following the instructions to build from source, i noticed that numpy is needed for the megablocks example:

$ python3 -m tutel.examples.helloworld --megablocks_size=1 --batch_size=1 --num_tokens=32 --top=1 --eval --num_local_experts=128 --capacity_factor=0
/mnt/localdisk/tutel/tutel/impls/jit_compiler.py:19: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  tutel_custom_kernel.update_sdk_home(torch.tensor([ord(x) for x in SDK_HOME] + [0], dtype=torch.int8, device='cpu'))
CRITICAL:root:Registering device global rank 0: data_rank = 0, model_rank = 0
[Statistics] param count for MoE local_experts = 1074266112, param count for MoE gate = 262144.

ExampleModel(
  (_moe_layer): MOELayer(
    Top-K(s) = ['k=1, noise=0.0'], Total-Experts = 128 [managed by 1 device(s)],
    (experts): FusedExpertsNetwork(model_dim=2048, hidden_size=2048, output_dim=2048, local_experts=128)
    (gates): ModuleList(
      (0): LinearTopKGate(
        (wg): Linear(in_features=2048, out_features=128, bias=False)
      )
    )
  )
)
Traceback (most recent call last):
  File "/home/ubuntu/micromamba/envs/tutel/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/micromamba/envs/tutel/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/mnt/localdisk/tutel/tutel/examples/helloworld.py", line 108, in <module>
    x = torch.tensor(torch.randn([batch_size, num_tokens, model_dim], dtype=torch.float32, device='cpu').detach().numpy(), dtype=torch.get_default_dtype(), requires_grad=False, device=device)
RuntimeError: Numpy is not available

but is not installed by setup.py because it is not in install_requires

recommendation: add numpy there, also maybe add ninja as an optional for building (not sure how this could be done in practice)

[Bug]The function func_fwd is calculated inconsistent on the cpu and gpu

The tutel/fasst_dispatch.py file calls the function func_fwd, but the cpu and cuda implementations of this function are inconsistent. Here is the code implementation of this function on cpu and cuda:

// cpu code
// code on tutel/custom/custom_kernel.cpp
    for (int i = 0; i < samples; ++i) {
      if (locations1_s[i] < capacity && indices1_s[i] >= 0) {
        for (int j = 0; j < hidden; ++j) {
          dispatched_input[(indices1_s[i] * capacity + locations1_s[i]) * (hidden) + j] += gates1_s[i] * reshaped_input[i * (hidden) + j];
        }
      }
    }

//cuda code
// code on tutel/jit_kernels/sparse.py
      for (int i = blockIdx.x; i < samples; i += gridDim.x)
          if (locations1_s[i] < capacity && indices1_s[i] >= 0) {
              #pragma unroll
              for (int j = threadIdx.x; j < hidden; j += 1024)
                  dispatched_input[(indices1_s[i] * capacity + locations1_s[i]) * (hidden) + j] = gates1_s[i] * reshaped_input[i * (hidden) + j];
          }

The dispatched_input is computed differently on the two implementations.
On the cpu, dispatched_input += gates1_sreshaped_input, and on cuda, dispatched_input=gates1_sreshaped_input.

bp of shared parameters and experts

The ddp in pytorch can not distinguish experts and other shared parameters. And experts may be updated with shared gradient.
The TutelDistributedOptimizer seems to be an implementation of zero, which does not affect the graident. How does tutel deal with the problem?

INTERNAL ASSERT FAILED at custom_kernel.cpp

My branch is (v0.1.3), and I got the following assertion fail sometimes when I used the fast_cumsum_sub_one,

Traceback (most recent call last):
  File "test_tutel.py", line 5, in <module>
    cumsum_tutel = fast_cumsum_sub_one(matrix, dim=0) + 1
  File "/home/xxx/.local/lib/python3.8/site-packages/tutel-0.1-py3.8-linux-x86_64.egg/tutel/jit_kernels/gating.py", line 81, in fast_cumsum_sub_one
    return get_cumsum_kernel(data.size(0), data.size(1))(data)
  File "/home/xxx/.local/lib/python3.8/site-packages/tutel-0.1-py3.8-linux-x86_64.egg/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
    base_kernel(mask1.to(torch.int32).contiguous(), locations1)
  File "/home/xxx/.local/lib/python3.8/site-packages/tutel-0.1-py3.8-linux-x86_64.egg/tutel/impls/jit_compiler.py", line 39, in func
    tutel_custom_kernel.invoke_with_source(inputs, __ctx__, use_nvrtc, source)
RuntimeError: (true) == (fp != nullptr) INTERNAL ASSERT FAILED at "/home/xxx/tutel/tutel/custom/custom_kernel.cpp":33, please report a bug to PyTorch. CHECK_EQ fails.

To reproduce the problem, I create the a python file (test_tutel.py),

#test_tutel.py
import torch
from tutel.jit_kernels.gating import fast_cumsum_sub_one

matrix = torch.randint(0, 100, (10000, 100), device='cuda')
cumsum_tutel = fast_cumsum_sub_one(matrix, dim=0) + 1

Please use the following script to run it,

git clone https://github.com/microsoft/tutel --branch v0.1.3
python3 ./tutel/setup.py install --user
mpirun -n 8 python test_tutel.py

Try to run it repeatedly, I think error will appear within 20 times.

INTERNAL ASSERT FAILED

Hi there,
When I ran a quick test "python3 -m tutel.examples.helloworld --batch_size=16", it showed error as follow:
RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/ssdisk2/tutel/tutel/custom/custom_kernel.cpp":46, please report a bug to PyTorch. CHECK_EQ fails.
Could you help me fix it?Thanks

RuntimeError: No such operator tutel_ops::cumsum

Hello, thanks for providing such a great work. However, I cannot use tutel successfully. I have followed the library installation steps:

* Install Pytorch for NVIDIA CUDA >= 11.3:
        $ python3 -m pip install --user torch==1.10.0+cu113 torchvision==0.11.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
       

* Install Tutel Online:

        $ python3 -m pip uninstall tutel -y
        $ python3 -m pip install --user --upgrade git+https://github.com/microsoft/tutel@main
        $ python3 ./tutel/setup.py install --user

But when I try the followed test:

* Quick Test on Single-GPU:

        $ python3 -m tutel.examples.helloworld --batch_size=16               # Test Tutel-optimized MoE + manual distribution

The followed error is reported:

Traceback (most recent call last):
  File "/root/miniconda3/envs/widenet/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/miniconda3/envs/widenet/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/tutel-main/tutel/examples/helloworld.py", line 120, in <module>
    output = model(x)
  File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/tutel-main/tutel/examples/helloworld.py", line 85, in forward
    result = self._moe_layer(input)
  File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/tutel-main/tutel/impls/moe_layer.py", line 267, in forward
    logits_dtype, (crit, l_aux) = routing()
  File "/root/tutel-main/tutel/impls/moe_layer.py", line 261, in routing
    inequivalent_tokens = inequivalent_tokens,
  File "/root/tutel-main/tutel/impls/fast_dispatch.py", line 158, in extract_critical
    locations1 = compute_location(masks_se[0])
  File "/root/tutel-main/tutel/jit_kernels/gating.py", line 22, in fast_cumsum_sub_one
    return torch.ops.tutel_ops.cumsum(data)
  File "/root/.local/lib/python3.7/site-packages/torch/_ops.py", line 63, in __getattr__
    op = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator tutel_ops::cumsum

All2All precision always in fp32

This example shows how to use TutelMoE with Torch autocast amp.

Q: Is the All2All precision still meant to be done in FP32?

In general torch autocast amp keeps network master weights in FP32 and downcasts weights before a layers fwd pass.
So within an autocast context (as suggested here) the matmul here will be autocast to fp16
(Note torch.add is done in fp32; list of ops which are autocast; by default ops upcast to the input with highest precision; since batched_fc1_bias is fp32, the add will be done in fp32 and will output an fp32 answer)
So far everything is just standard torch.

My question is about these few lines of code. Since expert weights are in fp32, this will upcast input x to type fp32.
As a result the All2All communication is done using fp32 inputs.
Is this correct or am I missing some other cast?
(Note: in the cast at the end of this line, x has already been case to fp32).

It looks like the all2all is ALWAYS done using fp32 precision even if we are using an amp autocast context manager. Was this done deliberately or is this a bug? It seems like if the all2all is done using 16 bits we'd save 2x the BW.

Final note: as mentioned, if in the autocast context manager, although the all2all is done using fp32, autocast is still on and therefore the matmul's here are done using fp16.

Potential Bug: I'm not sure this does anything... That layer should already be in fp32 and when its run here autocast should still run it in fp16...
I THINK the right way to do this is something like this:

    def forward(self, x):
        if self.fp32_gate:
            x = x.float()
        with torch.autocast(device_type=x.device.type, enabled=not self.fp32_gate):
            out = self.wg(x)
            return out

Autocast is disabled here.
I'm pretty sure the suggested rewrite for gate autocast is correct and more understandable.

tutel/jit_kernels/sparse.py torch.float16 There is a bug in the calculation: the cuda calculation result is inconsistent with the CPU calculation result and the array is out of bounds

code :

import numpy as np
import torch
from tutel.jit_kernels import sparse as jit_kernel
print(torch.version)
def moe_dispatch_bwd_gate():
samples=2
capacity=2
hidden=2
num_experts=1
indices = [0,0]
locations = [0,0]
input = [0.4946, -0.0043, 0.5386, -0.8354]
dispatch = [0.7085, 0.8257, -0.1455, -0.1788]
#int32
indices_t = np.asarray(indices,dtype=np.int32)
locations_t = np.asarray(locations,dtype=np.int32)
#float / half
input_t = np.asarray(input,dtype=np.float16)
dispatch_t = np.asarray(dispatch,dtype=np.float16)
indices_gpu = torch.from_numpy(indices_t).cuda()
locations_gpu = torch.from_numpy(locations_t).cuda()
input_gpu = torch.from_numpy(input_t).cuda()
dispatch_gpu = torch.from_numpy(dispatch_t).cuda()
print("cuda:")
print("indices_gpu:",indices_gpu)
print("locations_gpu:",locations_gpu)
print("input_gpu:",input_gpu)
print("dispatch_gpu:",dispatch_gpu)
# call gpu func
grad_gates = torch.zeros([samples], dtype=input_gpu.dtype, device=input_gpu.device)
moe_dispatch_bwd_gate = jit_kernel.create_backward_gate(input_gpu.dtype, input_gpu.is_cuda)
moe_dispatch_bwd_gate(grad_gates, indices_gpu, locations_gpu, input_gpu, dispatch_gpu, extra=[samples, hidden, capacity])
print("grad_gates:",grad_gates)
# call cpu func
input_t = np.asarray(input,dtype=np.float32)
dispatch_t = np.asarray(dispatch,dtype=np.float32)
indices_cpu = torch.from_numpy(indices_t)
locations_cpu = torch.from_numpy(locations_t)
input_cpu = torch.from_numpy(input_t)
print("cpu:")
# print("input_cpu:",input_cpu)
dispatch_cpu = torch.from_numpy(dispatch_t)
grad_gates_cpu = torch.zeros([samples], dtype=input_cpu.dtype, device=input_cpu.device)
moe_dispatch_bwd_gate = jit_kernel.create_backward_gate(input_cpu.dtype, input_cpu.is_cuda)
moe_dispatch_bwd_gate(grad_gates_cpu, indices_cpu, locations_cpu, input_cpu, dispatch_cpu, extra=[samples, hidden, capacity])
print("grad_gates_cpu:",grad_gates_cpu)
if name == 'main':
moe_dispatch_bwd_gate()

Problem: cuda calculation result is inconsistent with CPU calculation result:
cuda:[0.4180, 0.0000]
cpu:[ 0.3469, -0.3082]

Cuda calculation process analysis:

When index=0, calculate the gradient of the first gate

Due to dispatched_ Input and reshaded_ Input is of type half2, which is equivalent to float pointer

Therefore, when i=0, the subscript index * (hidden)+i=0 of the distribution, and the subscript index * (hidden)+i=0 of the input, take the first two half data, and accumulate the result of the calculation_ gates1_ s_ On rf

Read value: patch=[0.7085, 0.8257], input=[0.4946, -0.0043]

I=0 Calculation result: grad_ gates1_ s_ rf = 0.7085 * 0.4946 + 0.8257 * (-0.0043) = 0.34687359

When i=1, the subscript index * (hidden)+i=1 of the distribution, and the subscript index * (hidden)+i=1 of the input, take the last two half data, and also add it to the first gate gradient

Read value: patch=[-0.1455, -0.1788], input=[0.5386, -0.8354]

I=1 calculation result grad_ gates1_ s_ rf += (0.5386 * (-0.1455) + (-0.8354) * (-0.1788) = 0.07100322)

Last grad_ gates1_ s_ rf = 0.34687359 + 0.07100322 = 0.41787681

When index=1, the gradient of the second gate is calculated. The initial subscript of input is 2. The array access is out of bounds. The illegal address value may be 0, resulting in the second gradient result of 0

Example on saving experts to one model when using distributed training

Hi Thanks for providing such a wonderful codebase.

I have seen and used the save & load in MoE on multiple GPUs, now I can save them on different ranks. But is there away to convert them to one model?

Say, I trained a 8 experts MoE on 8 GPUs, and now I want to do next stage inference on 1 GPUs.

Will you consider provide an example on doing so? or could you provide some ideas on how to implement it myself.

Cannot compile tutel kernels and got runtime error

I have installed tutel on my machine and have set up the related environment variables, such as the $CUDA_HOME and $CFLAGS.
However, when I try to run examples/hello_world.py, I got the following error:

[E custom_kernel.cpp:124] default_program(1): catastrophic error: cannot open source file "cuda_runtime.h"

1 catastrophic error detected in the compilation of "default_program".
Compilation terminated.
Failed to use NVRTC for JIT compilation in this Pytorch version, try another approach using CUDA compiler.. (To always disable NVRTC, please: export USE_NVRTC=0)

File "/private/home/hyhuang/.local/lib/python3.9/site-packages/tutel/impls/jit_compiler.py", line 26, in func
tutel_custom_kernel.invoke(inputs, ctx)
RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/tmp/pip-req-build-pcbbciia/tutel/custom/custom_kernel.cpp":40, please report a bug to PyTorch. CHECK_EQ fails.

I am using PyTorch 1.10.1 + CUDA 11.3. Is there any other parameter I should fix to use tutel?

fast_cumsum_sub_one fails when the module is wrapped by ORTModule

As title, I got the following errors when my module is wrapped with ORTModule

[E custom_kernel.cpp:123] default_program(14): error: identifier "tensor" is undefined

1 error detected in the compilation of "default_program".
 Failed to use NVRTC for JIT compilation in this Pytorch version, try another approach using CUDA compiler.. (To always disable NVRTC, please: export USE_NVRTC=0)
/tmp/torch-tutel-o0geuH.cu(14): error: identifier "tensor" is undefined

1 error detected in the compilation of "/tmp/torch-tutel-o0geuH.cu".
/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_training_manager.py:224: UserWarning: Fast path enabled - skipping checks. Rebuild graph: True, Execution agent: True, Device check: True
  f" Device check: {self._skip_check.is_set(_SkipCheck.SKIP_CHECK_DEVICE)}", UserWarning)
RuntimeError: There was an error while exporting the PyTorch model to ONNX:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_utils.py", line 254, in get_exception_as_string
    raise exception
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 389, in _get_exported_model
    **self._export_extra_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/__init__.py", line 280, in export
    custom_opsets, enable_onnx_checker, use_external_data_format)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 94, in export
    use_external_data_format=use_external_data_format)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 695, in _export
    dynamic_axes=dynamic_axes)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 459, in _model_to_graph
    _retain_param_name)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 422, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 373, in _trace_and_get_graph_from_model
    torch.jit._get_trace_graph(model, args, strict=False, _force_outplace=False, _return_inputs_states=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py", line 1160, in _get_trace_graph
    outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py", line 132, in forward
    self._force_outplace,
  File "/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py", line 118, in wrapper
    outs.append(self.inner(*trace_inputs))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1039, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_io.py", line 430, in forward
    return _transform_output_to_flat_tuple(self._original_module(*new_args, **new_kwargs))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1039, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "test_ort.py", line 17, in forward
    x_cumsum = fast_cumsum_sub_one(x, dim=0)
  File "/opt/conda/lib/python3.7/site-packages/tutel/jit_kernels/gating.py", line 83, in fast_cumsum_sub_one
    return get_cumsum_kernel(data.size(0), data.size(1))(data)
  File "/opt/conda/lib/python3.7/site-packages/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
    base_kernel(mask1.to(torch.int32).contiguous(), locations1)
  File "/opt/conda/lib/python3.7/site-packages/tutel/impls/jit_compiler.py", line 26, in func
    tutel_custom_kernel.invoke(inputs, __ctx__)
RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/tmp/pip-req-build-qjgbz25n/tutel/custom/custom_kernel.cpp":39, please report a bug to PyTorch. CHECK_EQ fails.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_ort.py", line 24, in <module>
    output = cumsum_module(input)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_utils.py", line 309, in _forward
    return ortmodule._torch_module.forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_utils.py", line 289, in _forward
    torch_module_ort.is_training()).forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 292, in forward
    log_level=self._debug_options.logging.log_level)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_fallback.py", line 151, in handle_exception
    raise exception
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 231, in forward
    build_gradient_graph = self._export_model(*inputs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 322, in _export_model
    schema, *inputs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 392, in _get_exported_model
    RuntimeError(f'There was an error while exporting the PyTorch model to ONNX: '
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_fallback_exceptions.py", line 72, in wrap_exception
    raise new_exception(raised_exception) from raised_exception
onnxruntime.training.ortmodule._fallback_exceptions.ORTModuleONNXModelException: There was an error while exporting the PyTorch model to ONNX:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_utils.py", line 254, in get_exception_as_string
    raise exception
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 389, in _get_exported_model
    **self._export_extra_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/__init__.py", line 280, in export
    custom_opsets, enable_onnx_checker, use_external_data_format)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 94, in export
    use_external_data_format=use_external_data_format)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 695, in _export
    dynamic_axes=dynamic_axes)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 459, in _model_to_graph
    _retain_param_name)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 422, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 373, in _trace_and_get_graph_from_model
    torch.jit._get_trace_graph(model, args, strict=False, _force_outplace=False, _return_inputs_states=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py", line 1160, in _get_trace_graph
    outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py", line 132, in forward
    self._force_outplace,
  File "/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py", line 118, in wrapper
    outs.append(self.inner(*trace_inputs))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1039, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule/_io.py", line 430, in forward
    return _transform_output_to_flat_tuple(self._original_module(*new_args, **new_kwargs))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1039, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "test_ort.py", line 17, in forward
    x_cumsum = fast_cumsum_sub_one(x, dim=0)
  File "/opt/conda/lib/python3.7/site-packages/tutel/jit_kernels/gating.py", line 83, in fast_cumsum_sub_one
    return get_cumsum_kernel(data.size(0), data.size(1))(data)
  File "/opt/conda/lib/python3.7/site-packages/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
    base_kernel(mask1.to(torch.int32).contiguous(), locations1)
  File "/opt/conda/lib/python3.7/site-packages/tutel/impls/jit_compiler.py", line 26, in func
    tutel_custom_kernel.invoke(inputs, __ctx__)
RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/tmp/pip-req-build-qjgbz25n/tutel/custom/custom_kernel.cpp":39, please report a bug to PyTorch. CHECK_EQ fails.

To reproduce the problem, please try the following code, thanks.

from torch_ort import ORTModule
from onnxruntime.training import ortmodule
ortmodule.ONNX_OPSET_VERSION=12
from onnxruntime.training.ortmodule._custom_autograd_function import enable_custom_autograd_support
enable_custom_autograd_support()

from tutel.jit_kernels.gating import fast_cumsum_sub_one

import torch

class CumsumModule(torch.nn.Module):
    def __init__(self):
        super(CumsumModule, self).__init__()
        self.param = torch.nn.Parameter(torch.ones(5, 5))
    def forward(self, x):
        x = x + self.param
        x_cumsum = fast_cumsum_sub_one(x, dim=0)
        return x_cumsum


input = torch.randint(0, 5, (5, 5), device='cuda:0')
cumsum_module = CumsumModule().to(device='cuda:0')
cumsum_module = ORTModule(cumsum_module)
output = cumsum_module(input)

module 'tutel_custom_kernel' has no attribute 'inject_source'

My cuda version is 11.4, python version is 3.6.5
Following the requirement, my torch and torchvision versions are torch==1.10.0+cu113 and torchvision==0.11.1+cu113.
Then I run
git clone https://github.com/microsoft/tutel --branch v0.1.x
python ./tutel/setup.py install --user
then run the tutorial:
python ./tutel/examples/helloworld.py --batch_size=16
but meet the following error:

Traceback (most recent call last):
  File "./tutel/examples/helloworld.py", line 118, in <module>
    output = model(x)
  File "/home/fanj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "./tutel/examples/helloworld.py", line 85, in forward
    result = self._moe_layer(input)
  File "/home/fanj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fanj/tutel/tutel/impls/moe_layer.py", line 424, in forward
    result_output, l_aux = self.gates[gate_index].apply_on_expert_fn(reshaped_input, self)
  File "/home/fanj/tutel/tutel/impls/moe_layer.py", line 73, in apply_on_expert_fn
    critical_data, l_loss = extract_critical(gates, self.top_k, self.capacity_factor, self.fp32_gate, self.batch_prioritized_routing)
  File "/home/fanj/tutel/tutel/impls/fast_dispatch.py", line 163, in extract_critical
    locations1 = compute_location(masks_se[0])
  File "/home/fanj/tutel/tutel/jit_kernels/gating.py", line 83, in fast_cumsum_sub_one
    return get_cumsum_kernel(int(data.size(0)), int(data.size(1)))(data)
  File "/home/fanj/tutel/tutel/jit_kernels/gating.py", line 68, in get_cumsum_kernel
    ''')
  File "/home/fanj/tutel/tutel/impls/jit_compiler.py", line 31, in generate_kernel
    return JitCompiler.create_raw(template)
  File "/home/fanj/tutel/tutel/impls/jit_compiler.py", line 21, in create_raw
    __ctx__ = tutel_custom_kernel.inject_source(source)
AttributeError: module 'tutel_custom_kernel' has no attribute 'inject_source'

Do you know how to solve this problem?
Thank you very much!

Question about multi-gate refer to multi-task learning

Thanks for your contribution and excellent work of tutel!
I am wondering can I use tutel to implement the multi-gate above experts like the picture as follows?
Screenshot 2021-12-26 at 10 52 16 PM

Currently, I can't see any similar solution in example files.

New Tutel checkpoint loading is incompatible with old models

Hi, I have been using Swin-MoE pre-trained models with Tutel. However, after the recent update in Tutel library in model loading format, the pre-trained model has different dict structure than the current required expert model resulting in loading error.
Can you please create compatible versions of these released pre-trained models? or release any script to do so?
Any help would be highly appreciated.

Error when doing deepcopy of the model

Hi, thanks for this awesome project!

I build my transformer model based on the MoeMlp layer. I use ema for better performance. However, when I trying to init my ema model with ema_model = copy.deepcopy(my_transformer_model), I encounter the error:

File "/opt/conda/lib/python3.8/copy.py", line 296, in _reconstruct
    value = deepcopy(value, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/opt/conda/lib/python3.8/copy.py", line 270, in _reconstruct
    state = deepcopy(state, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 161, in deepcopy
    rv = reductor(4)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroupNCCL' object

Could you help me with that? How can I use ema with tutel? Thanks!

NCCL Asynchronous update timeout crash with Tutel MoE

Hi, I am using Tutel library with MMAction framework to replicate Swin-v2 MoE performance described in the paper. However, I am facing this error when I try to train MoE in DDP setting.
Can someone please help me in resolving this error?
Alternatively, can you release the object detection code that was used in the Tutel paper.

E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLTOALL_BASE, Timeout(ms)=300000) ran for 306666 milliseconds before timing out.

[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

terminate called after throwing an instance of 'std::runtime_error'

  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLTOALL_BASE, Timeout(ms)=300000) ran for 306666 milliseconds before timing out.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6056 closing signal SIGTERM

How the experts' gradients are handled under data parallelism?

When count_per_node is set to negative, one expert should be paralleled on multiple GPUs like ZERO, with each GPU holding a slice of the expert's parameters. There are also all_gathers performed within the ffn_zero_group in the forward of the expert.

My question is how these gradients and parameter update of the expert is handled in Tutel under DP. Examples seem to indicate that there are no extra efforts for users to manually handle it. However, I cannot find the corresponding implementation in either moe_layer or TutelDistributedOptimizer.

Any help will be appreciated!

Error in load_importance_loss

Hi I had the errors when using load_importance_loss (the code works fine when using gshard_loss). Does anyone have an idea about it?

The error log (in one rank/node) is in below:

[4]:
  time      : 2022-07-06_11:47:24
  host      : SG-IDC1-10-51-2-36
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 55010)
  error_file: /tmp/torchelastic_kuhg0qco/none_62gucqgc/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
      return forward_call(*input, **kwargs)
    File "/mnt/lustre/bli/projects/Pretraining-DG/mae/models_moe_mae.py", line 75, in forward
      x_temp = self.mlp(self.norm2(x))
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
      return forward_call(*input, **kwargs)
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 231, in forward
      logits_dtype, (crit, l_aux) = routing()
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 218, in routing
      return logits.dtype, extract_critical(scores,
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/fast_dispatch.py", line 150, in extract_critical
      l_loss = loss_fn(scores, topk_indices) if loss_fn is not None else None
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 215, in <lambda>
      _loss_fn = lambda gates, topk_ids: losses.load_importance_loss(
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/losses.py", line 41, in load_importance_loss
      l_load = load_loss(scores_wo_noise, topk_logits, num_global_experts, gate_noise)
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/losses.py", line 23, in load_loss
      normal = Normal(
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/distributions/normal.py", line 54, in __init__
      super(Normal, self).__init__(batch_shape, validate_args=validate_args)
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/distributions/distribution.py", line 55, in __init__
      raise ValueError(
  ValueError: Expected parameter scale (Tensor of shape (1,)) of distribution Normal(loc: tensor([0.], device='cuda:4'), scale: tensor([0.], device='cuda:4')) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
  tensor([0.], device='cuda:4')

Tutel with pytorch automatic mixed precision package

Hi, thank you for your excellent package. I wonder whether tutel could be used seamless together with the automatic mixed precision package of PyTorch. If so, could you provide some hints on how to use it or provide an example python script?
Thank you very much!

How to implement Fairseq-MoE training checkpoint like Swin-MoE?

First, I want to thank the tutel team for open-sourcing this work, it's a very good and practical framework.
I want to use tutel's moe in fairseq nlp tasks, but I encountered a problem, the original checkpoint setting of fairseq can't save and load Experts parameters distributed on different GPUs. How should I modify the fairseq model to support checkpoints like Swin-moe?

What is the purpose of the "use_2dh" option?

Hi Tutel authors, thank you for this great framework.

I have a question about commit 901a65c. What is the purpose of the use_2dh option? And what problem does PrimAllToAll2D intend to solve? It would be great if you can provide more context. Thanks.

Error met when using multi nodes

Dear contributors,
I meet an error with tutel's moe layer.
The error occurred when I run the tutel/examples/helloworld_ddp.py in the torch distributed mode with more than one GPU node (i.e.: 16 GPUs on 2 machines).
However, It is fine when I run this script with 8 GPUs or less.

The error log is following:


[Benchmark] world_size = 16, dtype = float32, model_dim = 2048, hidden_size = 2048, samples = 65536, num_local_experts = 2, topK = 1, device = `cuda:0`
Traceback (most recent call last):
  File "tutel/examples/helloworld_ddp.py", line 154, in <module>
    output = model(x)
  File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "tutel/examples/helloworld_ddp.py", line 119, in forward
    result = self._moe_layer(input)
  File "/mnt/cache/zhujinguo/anaconda3/envs/xmodaler/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/moe_layer.py", line 387, in forward
    result_output, l_aux = self.gate.apply_on_expert_fn(reshaped_input, self.expert_fn, self.group)
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/moe_layer.py", line 103, in apply_on_expert_fn
    locations1 = self.compute_location(masks_se[0])
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/jit_kernels/gating.py", line 81, in fast_cumsum_sub_one
    return get_cumsum_kernel(data.size(0), data.size(1))(data)
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
    base_kernel(mask1.to(torch.int32).contiguous(), locations1)
  File "/mnt/lustre/zhujinguo/codes/tutel/tutel/impls/jit_compiler.py", line 39, in func
    tutel_custom_kernel.invoke_with_source(inputs, __ctx__, no_nvrtc, source)
RuntimeError: (0) == (cuModuleGetFunction(&gm.hFunc, gm.hMod, std::string(pos, tail - pos).c_str())) INTERNAL ASSERT FAILED at "/mnt/lustre/zhujinguo/codes/tutel/tutel/custom/custom_kernel.cpp":208, please report a bug to PyTorch. CHECK_EQ fails.

Also, I use tutel moe layer in another project, where the same thing happened.

Can this package support the one-gpu machine

Hi, dear guys of tutelage team.

I have run the script and do some small modifications.
python -u main_moe.py --cfg configs/swinmoe/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.yaml --data-path /data/user1/junyan/datasets/ImageNet/ImageNet_Val --batch-size 128 --resume checkpoints/swin_moe_small_patch4_window12_192_32expert_32gpu_22k/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth

However, I have received the error message:

File "main_moe.py", line 374, in
main(config)
File "main_moe.py", line 141, in main
max_accuracy = load_checkpoint(config, model_without_ddp, optimizer, lr_scheduler, loss_scaler, logger)
File "/data/user1/junyan/adv_training/Swin-Transformer/utils_moe.py", line 45, in load_checkpoint
msg = model.load_state_dict(checkpoint['model'], strict=False)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1039, in load_state_dict
load(self)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1037, in load
load(child, prefix + name + '.')
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1037, in load
load(child, prefix + name + '.')
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1037, in load
load(child, prefix + name + '.')
[Previous line repeated 3 more times]
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1034, in load
state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
File "/root/.local/lib/python3.6/site-packages/tutel/impls/moe_layer.py", line 54, in _load_from_state_dict
assert buff_name in state_dict, "Could not find parameter %s in state_dict." % buff_name
AssertionError: Could not find parameter layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_bias in state_dict.

I have only one gpu. I am not sure whether multiple gpus are essential for this task. Is there a possibility to run it on one gpu? Furthermore, how can I resolve this problem of error?

I am looking forward to your response.

Thanks a lot.

Best Regards!

AttributeError: module 'tutel_custom_kernel' has no attribute 'inject_source'

Library Installation steps.
git clone https://github.com/microsoft/tutel --branch main
python3 -m pip uninstall tutel -y
python3 ./setup.py install --user

Running the example code
python3 -m tutel.examples.helloworld --batch_size=16

Results in the following error

[Statistics] param count for MoE local_experts = 16785408, param count for MoE gate = 4096.

ExampleModel(
  (_moe_layer): MOELayer(
    Top-K(s) = ['k=2, noise=0.0'], Total-Experts = 2 [managed by 1 device(s)],
    (experts): FusedExpertsNetwork(model_dim=2048, hidden_size=2048, output_dim=2048, local_experts=2)
    (gates): ModuleList(
      (0): LinearTopKGate(
        (wg): Linear(in_features=2048, out_features=2, bias=False)
      )
    )
  )
)
[Benchmark] world_size = 1, dtype = float32, model_dim = 2048, hidden_size = 2048, samples = 16384, num_local_experts = 2, topK = 2, a2a_ffn_overlap_degree = 1, parallel_type = `auto`, device = `cuda:0`
Traceback (most recent call last):
  File "/home/skodge/anaconda3/envs/moe/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/skodge/anaconda3/envs/moe/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/skodge/tutel/tutel/examples/helloworld.py", line 120, in <module>
    output = model(x)
  File "/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/skodge/tutel/tutel/examples/helloworld.py", line 85, in forward
    result = self._moe_layer(input)
  File "/home/skodge/anaconda3/envs/moe/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/skodge/tutel/tutel/impls/moe_layer.py", line 220, in forward
    logits_dtype, (crit, l_aux) = routing()
  File "/home/skodge/tutel/tutel/impls/moe_layer.py", line 208, in routing
    return logits.dtype, extract_critical(scores,
  File "/home/skodge/tutel/tutel/impls/fast_dispatch.py", line 158, in extract_critical
    locations1 = compute_location(masks_se[0])
  File "/home/skodge/tutel/tutel/jit_kernels/gating.py", line 85, in fast_cumsum_sub_one
    return get_cumsum_kernel(int(data.size(0)), int(data.size(1)))(data)
  File "/home/skodge/tutel/tutel/jit_kernels/gating.py", line 29, in get_cumsum_kernel
    base_kernel = JitCompiler.generate_kernel({'batch_num': global_experts, 'num_samples': samples}, '''
  File "/home/skodge/tutel/tutel/impls/jit_compiler.py", line 31, in generate_kernel
    return JitCompiler.create_raw(template)
  File "/home/skodge/tutel/tutel/impls/jit_compiler.py", line 21, in create_raw
    __ctx__ = tutel_custom_kernel.inject_source(source)
AttributeError: module 'tutel_custom_kernel' has no attribute 'inject_source'

about compute_location and locations

Thanks for your excellent work of tutel.
I would like to know the function's function(at fast_dispatch.py

def compute_sorted_location(x, importance_scores):
    sorted_x = x[importance_scores.argsort(dim=0)]
    sorted_cumsum = fast_cumsum_sub_one(sorted_x) * sorted_x
    return sorted_cumsum[importance_scores.argsort(dim=0).argsort(dim=0)]

and the meaning of the parameters locations_s which is return value of the function extract_critical(at fast_dispatch.py too)

100x slower when using 4nodes than 1node to run the helloworld_ddp example

Hello, I meet a problem that it is 100x slower when using 2node than 1node to run the helloworld_ddp example.
I compile tutel with cuda11.3, pytorch 1.11 and nccl 2.9.9 on a nvidia-a100 GPU cluster with 100G IB.
When I run tutel.examples.helloworld_ddp on a single node with 8 gpus and batch size 16, the speed meets the results in your table(step_time = 0.012315). But when I test with 4nodes, the step time becomes about 1 second, which is about 100x slower.
Other multi-node tasks can normally run on my cluster, so I think maybe something is wrong with the environment when I build the project. It will be very nice if you can share the detailed environment information, such as the pytorch version, cuda version, g++ version, etc.
Thanks.

Training with Data and Expert Parallelism

How should I prepare my code (data loaders, model, etc..) in order to train in a both Data and Expert Parallel mode?
And what does it change from "auto", "model" and "data" --parallel type?

In my current setup I'm training in DDP wrapping the model with torch DistributedDataParallel and using the distributed sampler in the loaders.
Now I wanted to insert a MoE in the model with 2 experts (I have 2 gpus so 1 local expert) so using both Data and Expert Parallelism.
Some help would be appreciated.

Non-surface function utilities only work for contiguous input data

According to the paper, when the 'expert' value is set to 1, the score (scores = F.softmax(logits_w_noise, dim=1)) should always equal 1. Consequently, the output variable "y" (y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype), in the 'moe_layer.py' file on line 304) should be equal to the input variable "x". However, in my experiment, the "x" and "y" values are sometimes found to be different. This difference is first shown in "ctx.config.func_fwd(g, i, l, reshaped_input, dispatched_input, extra=[ctx.config.indices_[0].size(0), ctx.config.aligned_dim, ctx.config.capacity]) in fast_dispatch.py,line28" and the root source is "tutel_custom_kernel.invoke(inputs, extra, blocks, ctx) in jit_compiler.py line33". How can I fix this problem?

My code seems to hang when skip_remainder_batch=False.

Describe the bug
Hi, Authors. My code seems to hang when skip_remainder_batch=False.

To Reproduce
Steps to reproduce the behavior:

git clone https://github.com/microsoft/tutel --branch main
python3 -m pip uninstall tutel -y
python3 ./tutel/setup.py

cd ./tutel/tutel/examples/fairseq_moe
git clone https://github.com/facebookresearch/fairseq --branch main
cd fairseq/ && git checkout b5e7b250913120409b872a940fbafec4d43c7b13
# This patch is an example to train Fairseq MoE transformers.
# Note that the current patch only works for `legacy_ddp` backend, and `--checkpoint-activations` must be disabled.
git apply ../fairseq_patch.diff
python3 -m pip install omegaconf==2.0.5 hydra-core==1.0.7
python3 -m pip install --no-deps --editable .

#fix bug in https://github.com/facebookresearch/fairseq/blob/main/fairseq/tasks/translation.py#L441-L442
#get dataset followed by https://github.com/facebookresearch/fairseq/tree/main/examples/translation


 CUDA_VISIBLE_DEVICES=0,1  MOE=3 L_AUX_WT=0.01 SKIP_EXPERT=1 fairseq-train  fairseq/data-bin/iwslt14.tokenized.de-en     --arch transformer_iwslt_de_en --share-decoder-input-output-embed     --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0     --lr 10e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000     --dropout 0.3 --weight-decay 0.0001     --criterion label_smoothed_cross_entropy --label-smoothing 0.1     --max-tokens 4096 --eval-bleu     --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}'     --eval-bleu-detok moses     --eval-bleu-remove-bpe  --eval-bleu-print-samples     --best-checkpoint-metric bleu --maximize-best-checkpoint-metric  --ddp-backend legacy_ddp --max-update 100000

Logs

2022-08-09 10:51:01 | INFO | fairseq.utils | rank   0: capabilities =  7.5  ; total memory = 10.761 GB ;[0/1773]
NVIDIA GeForce RTX 2080 Ti
2022-08-09 10:51:01 | INFO | fairseq.utils | rank   1: capabilities =  7.5  ; total memory = 10.761 GB ; name =
NVIDIA GeForce RTX 2080 Ti
2022-08-09 10:51:01 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers**********
*************
2022-08-09 10:51:01 | INFO | fairseq_cli.train | training on 2 devices (GPUs/TPUs)
2022-08-09 10:51:01 | INFO | fairseq_cli.train | max tokens per device = 4096 and max sentences per device = Non
e
2022-08-09 10:51:01 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/checkpoint_last.pt
2022-08-09 10:51:01 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/checkpoint_last.pt
2022-08-09 10:51:01 | INFO | fairseq.trainer | loading train data for epoch 1
2022-08-09 10:51:01 | INFO | fairseq.data.data_utils | loaded 160,239 examples from: fairseq/data-bin/iwslt14.to
kenized.de-en/train.de-en.de
2022-08-09 10:51:01 | INFO | fairseq.data.data_utils | loaded 160,239 examples from: fairseq/data-bin/iwslt14.to
kenized.de-en/train.de-en.en
2022-08-09 10:51:01 | INFO | fairseq.tasks.translation | fairseq/data-bin/iwslt14.tokenized.de-en train de-en 16
0239 examples
2022-08-09 10:51:01 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 or --am
p
2022-08-09 10:51:01 | INFO | fairseq.data.iterators | grouped total_num_itrs = 551
epoch 001:   0%|                                                          | 0/551 [00:00<?, ?it/s]2022-08-09 10:
51:01 | INFO | fairseq.trainer | begin training epoch 1
2022-08-09 10:51:01 | INFO | fairseq_cli.train | Start iterating over samples
/home/xinglinpan/tutel/tutel/examples/fairseq_moe/fairseq/fairseq/utils.py:374: UserWarning: amp_C fused kernels
 unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  warnings.warn(
/home/xinglinpan/tutel/tutel/examples/fairseq_moe/fairseq/fairseq/utils.py:374: UserWarning: amp_C fused kernels
 unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  warnings.warn(
epoch 001: 100%|▉| 550/551 [02:12<00:00,  4.54it/s, loss=9.244, nll_loss=8.59, ppl=385.3, wps=31462022-08-09 10:
53:14 | INFO | fairseq_cli.train | begin validation on "valid" subset
                                                                                                 2022-08-09 10:5
3:19 | INFO | fairseq.tasks.translation | example hypothesis: they don't don't don't don't don't don't don't't't
't
2022-08-09 10:53:19 | INFO | fairseq.tasks.translation | example reference: they're just not moving.

The possible problem is that not all devices are provided with data in the last iteration on the valid, so alltoall is always pending other processes. If SKIP_MOE=1, there is no this phenomenon.

Is simple_all_reduce also required for capacity_factor > 0 cases?

My code seems to hang when unbalanced workloads exist in two different GPUs(i.e. scores.size(0) is unequal in different GPUs such as, at the end of a dataset). It further leads to inequality in the capacity of Line 178 in different GPUs. Is simple_all_reduce also required for capacity_factor > 0 cases?

if capacity_factor > 0:
capacity = top_k * int(capacity_factor * samples_per_expert)
else:
capacity = torch.max(torch.cat(locations_s, dim=0))
capacity = int(simple_all_reduce(capacity, group=group, op=torch.distributed.ReduceOp.MAX)) + 1
if capacity_factor < 0:
capacity = min(capacity, top_k * int(-capacity_factor * ((int(scores.size(0)) + num_global_experts - 1) // num_global_experts)))

RuntimeError: (0) == (cuModuleLoadDataEx(&hMod, image.c_str(), sizeof(options) / sizeof(*options), options, values)) INTERNAL ASSERT FAILED

Hi,

I installed tutel via python3 -m pip install --user --upgrade git+https://github.com/microsoft/tutel@main
I am running a test script

import torch
from tutel.jit_kernels.gating import fast_cumsum_sub_one

matrix = torch.randint(0, 100, (10000, 100), device='cuda')
cumsum_tutel = fast_cumsum_sub_one(matrix, dim=0) + 1

and facing error

[W custom_kernel.cpp:149] nvrtc: error: invalid value for --gpu-architecture (-arch)
 Failed to use NVRTC for JIT compilation in this Pytorch version, try another approach using CUDA compiler.. (To always disable NVRTC, please: export USE_NVRTC=0)
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    cumsum_tutel = fast_cumsum_sub_one(matrix, dim=0) + 1
  File "/home/jdhwang/.local/lib/python3.8/site-packages/tutel/jit_kernels/gating.py", line 22, in fast_cumsum_sub_one
    return torch.ops.tutel_ops.cumsum(data)
  File "/home/jdhwang/conda/envs/cl/lib/python3.8/site-packages/torch/_ops.py", line 502, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: (0) == (cuModuleLoadDataEx(&hMod, image.c_str(), sizeof(options) / sizeof(*options), options, values)) INTERNAL ASSERT FAILED at "/tmp/pip-req-build-c9h2prbs/tutel/custom/custom_kernel.cpp":205, please report a bug to PyTorch. CHECK_EQ fails.

following #203, I exported export USE_NVRTC=1 and I am using RTX4090 with torch ('2.0.0+cu117') and Cuda 11.7 (nvcc as well).

[installation errors] fatal error: nccl.h: No such file or directory

Hello, thanks for providing such a great work. However, I cannot install tutel successfully during
the compilation. I have exported the lib path of nccl_2.7.8-1-cuda10.1/include/nccl.h into LD_LIBRARY_PATH. But the error logs seem that it still cannot find the NCCL path.
Do you have any idea on how to solve this error? Thanks!

running install_lib
running build_py
running build_ext
building 'tutel_custom_kernel' extension
Emitting ninja build file /mnt/lustre/chengguangliang/zhouqianyu/segdgformer/cvpr_2023/tutel-main/build/temp.linux-x86_64-3.7/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] /mnt/cache/share/gcc/gcc-7.3.0/bin/g++ -MMD -MF /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/build/temp.linux-x86_64-3.7/
tutel/custom/custom_kernel.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/mnt/lustre/chengguangliang/miniconda3/lib/python3
.7/site-packages/torch/include -I/mnt/lustre/chengguangliang/miniconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/che
ngguangliang/miniconda3/lib/python3.7/site-packages/torch/include/TH -I/mnt/lustre/chengguangliang/miniconda3/lib/python3.7/site-packages/torch/include/TH
C -I/mnt/lustre/share/cuda-10.1/include -I/mnt/lustre/chengguangliang/miniconda3/include/python3.7m -c -c /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/tutel/custom/custom_kernel.cpp -o /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/build/temp.linux-x86_64-
3.7/tutel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -Wno-strict-aliasing -DUSE_GPU -DUSE_N
CCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENS
ION_NAME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/build/temp.linux-x86_64-3.7/tutel/custom/custom_kernel.o
/mnt/cache/share/gcc/gcc-7.3.0/bin/g++ -MMD -MF /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/build/temp.linux-x86_64-3.7/tutel/
custom/custom_kernel.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/mnt/lustre/chengguangliang/miniconda3/lib/python3.7/sit
e-packages/torch/include -I/mnt/lustre/chengguangliang/miniconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/mnt/lustre/chengguan
gliang/miniconda3/lib/python3.7/site-packages/torch/include/TH -I/mnt/lustre/chengguangliang/miniconda3/lib/python3.7/site-packages/torch/include/THC -I/m
nt/lustre/share/cuda-10.1/include -I/mnt/lustre/chengguangliang/miniconda3/include/python3.7m -c -c /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/tutel/custom/custom_kernel.cpp -o /mnt/lustre/chengguangliang/zhouqianyu/tutel-main/build/temp.linux-x86_64-3.7/tu
tel/custom/custom_kernel.o -Wno-sign-compare -Wno-unused-but-set-variable -Wno-terminate -Wno-unused-function -Wno-strict-aliasing -DUSE_GPU -DUSE_NCCL -D
TORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NA
ME=tutel_custom_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/mnt/lustre/chengguangliang/zhouqianyu/tutel-main/tutel/custom/custom_kernel.cpp:19:10: fatal error: nccl.h: No such file or directory
#include <nccl.h>
^~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Try installing without NCCL extension..

My machine details are as follows:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168

I used pytorch1.8.1 with cuda 10.1. I wonder whether tutal can be installed with cuda10.1?

I used the following commands for the installation:

export PATH=/mnt/lustre/share/gcc/gcc-5.4/bin/:$PATH
export LD_LIBRARY_PATH=/mnt/lustre/share/polaris/dep/nccl_2.7.8-1-cuda10.1/include/nccl.h:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/mnt/lustre/share/gcc/gmp-4.3.2/lib:/mnt/lustre/share/gcc/mpfr-2.4.2/lib:/mnt/lustre/share/gcc/mpc-0.8.1/lib:$LD_LIBRARY_PATH
export TORCH_CUDA_ARCH_LIST='3.5;5.0+PTX;6.0;7.0'

python setup.py install --user

The output of nccl_all_to_all_scatter_async may be incomplete when num_local_experts>1.

Describe the bug
The output of nccl_all_to_all_scatter_async may be incomplete.

To Reproduce
Steps to reproduce the behavior:

on host0(master): SKIP_EXPERT=1 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=host0 -m tutel.examples.helloworld --batch_size=4 --num_tokens=1 --model_dim=2 --hidden_size=2 --num_steps=1 --a2a_ffn_overlap_degree=1
on host1: SKIP_EXPERT=1 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=host0 -m tutel.examples.helloworld --batch_size=4 --num_tokens=1 --model_dim=2 --hidden_size=2 --num_steps=1 --a2a_ffn_overlap_degree=1

Log
The value of

y = y.view(self.world_size, -1, y.size(2))

tensor([[[ 1.5410, -0.2934],
[-1.0845, -1.3986]],
[[ 1.5410, -0.2934],
[ 0.4033, 0.8380]],
[[-2.1788, 0.5684],
[-1.0845, -1.3986]],
[[ 0.4033, 0.8380],
[-2.1788, 0.5684]]], device='cuda:0')

The value of

y = C.all_to_all(y, 0, 1, use_2dh=self.use_2dh, group=self.group)

tensor([[[ 1.5410, -0.2934],
[-1.0845, -1.3986]],
[[ 1.5410, -0.2934],
[ 0.4033, 0.8380]],
[[-2.1788, 0.5684],
[-1.0845, -1.3986]],
[[ 0.4033, 0.8380],
[-2.1788, 0.5684]]], device='cuda:0')

This is the result I expect. However, when
on host0(master): SKIP_EXPERT=1 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=host0 -m tutel.examples.helloworld --batch_size=4 --num_tokens=1 --model_dim=2 --hidden_size=2 --num_steps=1 --a2a_ffn_overlap_degree=2
on host1: SKIP_EXPERT=1 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=host0 -m tutel.examples.helloworld --batch_size=4 --num_tokens=1 --model_dim=2 --hidden_size=2 --num_steps=1 --a2a_ffn_overlap_degree=2

The value of

y = y.view(self.world_size, -1, y.size(2))

tensor([[[ 1.5410, -0.2934],
[-1.0845, -1.3986]],
[[ 1.5410, -0.2934],
[ 0.4033, 0.8380]],
[[-2.1788, 0.5684],
[-1.0845, -1.3986]],
[[ 0.4033, 0.8380],
[-2.1788, 0.5684]]], device='cuda:0')

The value of

y = a2a_ffn_overlap_forward(y, expert_fn=expert_fn, a2a_ffn_overlap_degree=a2a_ffn_overlap_degree, use_2dh=self.use_2dh, group=self.group)

tensor([[[ 0.0000, 0.0000],
[ 0.0000, 0.0000]],
[[ 1.5410, -0.2934],
[ 0.4033, 0.8380]],
[[ 0.0000, 0.0000],
[ 0.0000, 0.0000]],
[[ 0.4033, 0.8380],
[-2.1788, 0.5684]]], device='cuda:0')

It seems incomplete.

The possible code is:

CHECK_EQ(0, ncclGroupStart());
for (int j = 0; j < num_slices_per_split; j++) {
CHECK_EQ(0, ncclSend(
((char*)input.data_ptr()) + (j * num_split + calc_idx) * slice_size,
slice_size,
ncclInt8,
g_world_size * j / num_slices_per_split,
g_nccl_comm,
get_nccl_stream().stream()));
CHECK_EQ(0, ncclRecv(
((char*)output_list[calc_idx].data_ptr()) + j * slice_size,
slice_size,
ncclInt8,
g_world_size * j / num_slices_per_split,
g_nccl_comm,
get_nccl_stream().stream()));
}
CHECK_EQ(0, ncclGroupEnd());

It looks like the NCCL group keeps only the last send-recv pair in each peer.
There is no same problem when num_local_experts=1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.