Code Monkey home page Code Monkey logo

Comments (4)

marib00 avatar marib00 commented on September 2, 2024 1

Progress - disabling gradient clipping and fused AdamW actually makes it work (even with torch.compile!) 🎉🎉🎉
Next step - tensor parallel attention 👍

from build-nanogpt.

marib00 avatar marib00 commented on September 2, 2024 1

Almost there but I've discovered a weird behaviour of torch.distributed.tensor.parallel.SequenceParallel(). It should be sharding across the sequence dimension i.e. [B, T, C] -> [B, T//_world_size, C] but it seems to be tiling instead i.e. [B, T, C] -> [B, T*_world_size, C], which obviously doesn't sit well with the loss function, which now gets _world_size times too many logits. Didn't realise how much I didn't know about parallelism! Investigation continues... 🧐

On the bright side, getting just shy of 150k tok/sec on a 2x 3090 configuration and can now work with a model, which wouldn't fit on a single 3090. Bad news for 8x A100 users - you have too many GPUs to shard the 12 attention heads 🤣

#protip: didn't know it was possible to use the VScode debugger with distributed workloads, turns out you can and all the [conditional] breakpoints, stepping into libraries etc. works like a dream! Here's my launch.json file for that - hope people find it useful:

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Distributed train_gpt2_tp.py",
            "type": "debugpy",
            "request": "launch",
            "purpose": ["debug-in-terminal"],
            "console": "integratedTerminal",
            "module":"torch.distributed.run",
            "args":["--standalone","--nproc_per_node=2","train_gpt2_tp.py"],
            "justMyCode": false,
        }
    ]
}

from build-nanogpt.

marib00 avatar marib00 commented on September 2, 2024

Turns out RowwiseParallel(use_local_output=True) is the default, so x should already be a torch.Tensor... 🤔

from build-nanogpt.

marib00 avatar marib00 commented on September 2, 2024

Ok, all done, except it disappointingly slow! 😮 Quite possible I've messed something up, so if anybody notices, do let me know please.

The repo available at https://github.com/marib00/build-nanogpt and the file of interest is train_gpt2_tp.py - I didn't touch any of the other files.

I have added some benchmarks to README.md.

from build-nanogpt.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.