Comments (6)
seems already fixed, I tried it on today's container on H100:
root@c6f2bbeb93de:/opt/pytorch/lightning-thunder# torchrun --nproc-per-node=8 /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name dolly-v2-3b --compile thunder_inductor_cat_cudnn --distributed_mode fsdp --shard_mode zero2
Model name: dolly-v2-3b
Seq Length: 2048
Micro BS: 1
Global BS: 8
Number of Layers: 32
Number of parameters: 0.35B
Distributed Mode: fsdp
Sharding Mode: zero2
Bucketing: none
Compiler: thunder_inductor_cat_cudnn
Average iter time: 113.18 ms
Memory used: 17.74 GB
Tokens/s: 144833.85
Tokens/s/GPU: 18104.23
TFLOP/s: 2590.36
from lightning-thunder.
Thank you @wprazuch , I can reproduce it with 2 nodes
[viking-prod-230:1]:Time to instantiate model: 0.02 seconds.
[viking-prod-230:0]:An error occurred: KeyError – 't5905'
from lightning-thunder.
@IvanYashchuk this is related to the past issue.
Only this KeyError persists in the new container version :)
All the best,
WP
from lightning-thunder.
Running python thunder/benchmarks/benchmark_litgpt.py --model_name dolly-v2-3b --compile thunder_inductor_cat_cudnn --n_layers=1
doesn't reproduce the problem. Need to check on DGX when it's available.
from lightning-thunder.
Thank you, Yan, for checking this!
The reported error is also not reproducible with the provided tag pjnl-20240621
on a single node. Is the tag correct?
@wprazuch, in the issue description, you write that the error is seen when running on two nodes, but the reproducing command doesn't have the --nnodes=
option. I don't know how the two-node run could differ from one node, but anything is possible, we'll check.
from lightning-thunder.
@IvanYashchuk Yes, you are right about --nnodes
option, it should be set to 2. The issue only persists for multi-node setup, because we got it for:
- nodes = 2
- gpu_per_node = 8
- fsdp
- zero2 and zero3
I forgot to add the option in reproduction command - unfortunately, due to manual work needed for adjusting commands for this repo I missed that. I adjusted the command in the issue description - sorry for the confusion.
from lightning-thunder.
Related Issues (20)
- Create a parametrized benchmark for LitGPT layer norm
- Create a parametrized benchmark for LitGPT RMSNorm
- Create a parametrized benchmark for LitGPT MLP variants
- Create a parametrized benchmark for LitGPT CausalSelfAttention
- Use `time.perf_counter_ns()` instead of `time.time_ns()`
- thunder.core.codeutils.to_printable has RecursionError when input is torch.Size
- type inference: mismatched dtype in cat operator HOT 4
- Support the scatter operator HOT 1
- Unknown attribute _base inside Megatron core HOT 3
- Guard nvFuser-based tests with minimum required compute capability or BF16 capability
- Segmentation fault for fp8 and thunder_cudnn HOT 1
- InterpreterError: Encountered exception TypeError: missing a required argument: 'value' while tracing HOT 3
- Interpreter log colors internal calls
- Support for Python 3.12
- Advanced Indexing with sequences HOT 8
- Apply `thunder.distributed.utils.sort_waits` appropriately when possible HOT 1
- Raise an error when PyTorch's activation checkpointing is used with Thunder-jitted model HOT 2
- _elementwise_binary_helper error when passing a tuple instead of a single value
- Error for falcon-40b from LitGPT
- TypeError: torch_device() got an unexpected keyword argument 'type' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lightning-thunder.