Code Monkey home page Code Monkey logo

Comments (6)

kiya00 avatar kiya00 commented on July 17, 2024 1

seems already fixed, I tried it on today's container on H100:

root@c6f2bbeb93de:/opt/pytorch/lightning-thunder# torchrun --nproc-per-node=8 /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name dolly-v2-3b --compile thunder_inductor_cat_cudnn --distributed_mode fsdp --shard_mode zero2
Model name: dolly-v2-3b
Seq Length: 2048
Micro BS: 1
Global BS: 8
Number of Layers: 32
Number of parameters: 0.35B
Distributed Mode: fsdp
Sharding Mode: zero2
Bucketing: none
Compiler: thunder_inductor_cat_cudnn
Average iter time: 113.18 ms
Memory used: 17.74 GB
Tokens/s: 144833.85
Tokens/s/GPU: 18104.23
TFLOP/s: 2590.36

from lightning-thunder.

kiya00 avatar kiya00 commented on July 17, 2024 1

Thank you @wprazuch , I can reproduce it with 2 nodes

[viking-prod-230:1]:Time to instantiate model: 0.02 seconds.
[viking-prod-230:0]:An error occurred: KeyError – 't5905'

from lightning-thunder.

wprazuch avatar wprazuch commented on July 17, 2024

@IvanYashchuk this is related to the past issue.
Only this KeyError persists in the new container version :)
All the best,
WP

from lightning-thunder.

IvanYashchuk avatar IvanYashchuk commented on July 17, 2024

Running python thunder/benchmarks/benchmark_litgpt.py --model_name dolly-v2-3b --compile thunder_inductor_cat_cudnn --n_layers=1 doesn't reproduce the problem. Need to check on DGX when it's available.

from lightning-thunder.

IvanYashchuk avatar IvanYashchuk commented on July 17, 2024

Thank you, Yan, for checking this!
The reported error is also not reproducible with the provided tag pjnl-20240621 on a single node. Is the tag correct?

@wprazuch, in the issue description, you write that the error is seen when running on two nodes, but the reproducing command doesn't have the --nnodes= option. I don't know how the two-node run could differ from one node, but anything is possible, we'll check.

from lightning-thunder.

wprazuch avatar wprazuch commented on July 17, 2024

@IvanYashchuk Yes, you are right about --nnodes option, it should be set to 2. The issue only persists for multi-node setup, because we got it for:

  • nodes = 2
  • gpu_per_node = 8
  • fsdp
  • zero2 and zero3

I forgot to add the option in reproduction command - unfortunately, due to manual work needed for adjusting commands for this repo I missed that. I adjusted the command in the issue description - sorry for the confusion.

from lightning-thunder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.