I thought tensor parallelism would be an interesting idea. There's a <a href="https:/

Implement tensor parallelism about build-nanogpt HOT 4 CLOSED

marib00 commented on September 2, 2024

Implement tensor parallelism

from build-nanogpt.

Comments (4)

marib00 commented on September 2, 2024 1

Progress - disabling gradient clipping and fused AdamW actually makes it work (even with torch.compile!) 🎉🎉🎉
Next step - tensor parallel attention 👍

from build-nanogpt.

marib00 commented on September 2, 2024 1

Almost there but I've discovered a weird behaviour of torch.distributed.tensor.parallel.SequenceParallel(). It should be sharding across the sequence dimension i.e. [B, T, C] -> [B, T//_world_size, C] but it seems to be tiling instead i.e. [B, T, C] -> [B, T*_world_size, C], which obviously doesn't sit well with the loss function, which now gets _world_size times too many logits. Didn't realise how much I didn't know about parallelism! Investigation continues... 🧐

On the bright side, getting just shy of 150k tok/sec on a 2x 3090 configuration and can now work with a model, which wouldn't fit on a single 3090. Bad news for 8x A100 users - you have too many GPUs to shard the 12 attention heads 🤣

#protip: didn't know it was possible to use the VScode debugger with distributed workloads, turns out you can and all the [conditional] breakpoints, stepping into libraries etc. works like a dream! Here's my launch.json file for that - hope people find it useful:

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Distributed train_gpt2_tp.py",
            "type": "debugpy",
            "request": "launch",
            "purpose": ["debug-in-terminal"],
            "console": "integratedTerminal",
            "module":"torch.distributed.run",
            "args":["--standalone","--nproc_per_node=2","train_gpt2_tp.py"],
            "justMyCode": false,
        }
    ]
}

from build-nanogpt.

marib00 commented on September 2, 2024

Turns out RowwiseParallel(use_local_output=True) is the default, so x should already be a torch.Tensor... 🤔

from build-nanogpt.

marib00 commented on September 2, 2024

Ok, all done, except it disappointingly slow! 😮 Quite possible I've messed something up, so if anybody notices, do let me know please.

The repo available at https://github.com/marib00/build-nanogpt and the file of interest is train_gpt2_tp.py - I didn't touch any of the other files.

I have added some benchmarks to README.md.

from build-nanogpt.

Recommend Projects

Implement tensor parallelism about build-nanogpt HOT 4 CLOSED

Comments (4)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent