I have observed this on a couple of runs - the model gets to it's first evaluation ste

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Finetuning run times out at evaluation step on multiple devices about litgpt HOT 4 OPEN

ecatkins commented on May 24, 2024

Finetuning run times out at evaluation step on multiple devices

from litgpt.

Comments (4)

rasbt commented on May 24, 2024 1

Hm, I am not sure why it's slowing down so much in multi-GPU settings. It's speculation, but maybe if the GPUs have a slow connection, then the communication overhead is causing this. Btw we are also (potentially) adding a --skip_validation via #1228 to make the validation optional (but yeah, I think you are probably still want to validate and are more concerned about the slow down). Sorry, don't have a good explanation at the moment.

from litgpt.

rasbt commented on May 24, 2024

Thanks for reporting, and huh, that's a weird one, I haven't seen this before. As a sanity check I wonder what happens if you use the generate function to emulate the inference step in the finetuning step:

litgpt generate base \
  --prompt "Recommend a movie for me to watch during the weekend and explain the reason." \
  --checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.1

from litgpt.

ecatkins commented on May 24, 2024

@rasbt - That is completely fine FYI

(base) ubuntu@ip-10-0-0-185:~/sky_workdir$ litgpt generate base --prompt "Recommend a movie for me to watch during the weekend and explain the reason." --checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.1
Loading model 'checkpoints/mistralai/Mistral-7B-Instruct-v0.1/lit_model.pth' with {'name': 'Mistral-7B-Instruct-v0.1', 'hf_config': {'name': 'Mistral-7B-Instruct-v0.1', 'org': 'mistralai'}, 'scale_embeddings': False, 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 512, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'head_size': 128, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 8, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 14336, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 128}
Time to instantiate model: 0.11 seconds.
Time to load the model weights: 6.73 seconds.
Seed set to 1234
<s>[INST] Recommend a movie for me to watch during the weekend and explain the reason. [/INST] One great movie that I recommend you to watch during the weekend is "The Shawshank Redemption" released in 1994, directed by Frank Darabont and starring Tim Robbins and Morgan Freeman.


Time for inference 1: 2.38 sec total, 21.01 tokens/sec
Memory used: 14.54 GB

from litgpt.

alistairwgillespie commented on May 24, 2024

(task, pid=9078) distributed_backend=nccl
(task, pid=9078) All distributed processes registered. Starting with 4 processes
(task, pid=9078) ----------------------------------------------------------------------------------------------------
(task, pid=9078) 
(task, pid=9078) [rank: 1] Seed set to 1337
(task, pid=9078) [rank: 3] Seed set to 1337
(task, pid=9078) [rank: 2] Seed set to 1337
(task, pid=9078) [rank: 0] Seed set to 1337
(task, pid=9078) Number of trainable parameters: 76,652,544
(task, pid=9078) Number of non-trainable parameters: 7,241,732,096
(task, pid=9078) The longest sequence length in the train data is 535, the model's maximum sequence length is 535 and context length is 4096
(task, pid=9078) Validating ...
(task, pid=9078) Recommend a movie for me to watch during the weekend and explain the reason.
(task, pid=9078) Below is an instruction that describes a task. Write a response that appropriately completes the request.
(task, pid=9078) 
(task, pid=9078) ### Instruction:
(task, pid=9078) Recommend a movie for me to watch during the weekend and explain the reason.
(task, pid=9078) 
(task, pid=9078) ### Response:
(task, pid=9078) I recommend the movie "The Shawshank Redemption". It's a classic that is widely loved by many people. It's an excellent story with great characters, and it's sure to keep you engaged from start to finish. Additionally, it's a timeless film that has themes that are still relevant today. It's perfect to watch during the weekend because it's a long movie that will give you plenty of time to get absorbed in the story.

A few observations related to this.

The validation call at the start of the fit method takes a lot longer compared to a single GPU.
Setting eval.interval to a low number like 10 fixes validation during momentarily.

It feel likes the issue relates to something accumulative. Maybe memory or bandwidth? @rasbt

from litgpt.

Finetuning run times out at evaluation step on multiple devices about litgpt HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent