When attempting load a sharded checkpoint, we (<a class="user-mention notranslate" dat

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Thank you so much, <a class="user-mention notranslate" data-hovercard-type="user" data

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Composer crashes when attempting to load sharded checkpoint about llm-foundry HOT 3 OPEN

growlix commented on June 10, 2024 1

Composer crashes when attempting to load sharded checkpoint

from llm-foundry.

Comments (3)

hanlint commented on June 10, 2024 1

Hello @growlix , are you running this in fp8?

If so, this issue was fixed in mosaicml/composer#2907 and released in v0.19.0, so you should upgrade your composer version.

from llm-foundry.

growlix commented on June 10, 2024 1

Thank you so much, @hanlint! We are running in fp8. We'll update to v0.19.0 and give it a whirl!

from llm-foundry.

prigoyal commented on June 10, 2024

@hanlint , we tried composer 0.19.0 but we are still hitting the issue . Is there any change to the config we need to make?
we are specifying the load path as the shard prefix following this

30     trainer = Trainer(
31   File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1493, in __init__
32     self._rng_state = checkpoint.load_checkpoint(
33   File "/usr/lib/python3/dist-packages/composer/utils/checkpoint.py", line 366, in load_checkpoint
34     rng_state_dicts = load_sharded_checkpoint(
35   File "/usr/lib/python3/dist-packages/composer/utils/checkpoint.py", line 558, in load_sharded_checkpoint
36     optim_state = load_sharded_optimizer_state_dict(model_state_dict=state.state_dict()['model'],
37   File "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py", line 264, in load_sharded_optimizer_state_dict
38     layout_specs, dp_pg = _get_state_dict_2d_layout(model_state_dict)
39   File "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py", line 128, in _get_state_dict_2d_layout
40     specs[key] = (None, value.size())
41 AttributeError: '_io.BytesIO' object has no attribute 'size'
42 Traceback (most recent call last):
43   File "/fsx/users/prigoyal/experiments/prigoyal/science/20240227-16-10-13_bump-composer/bench-MPT1b-RPJ-fp8-noactckpt-noaccum-bs160-v5docker-flash-noqknorm-sharded-resume15ba/science/tools/train_llms.py", line 632, in <module>
44     main(cfg)
45   File "/fsx/users/prigoyal/experiments/prigoyal/science/20240227-16-10-13_bump-composer/bench-MPT1b-RPJ-fp8-noactckpt-noaccum-bs160-v5docker-flash-noqknorm-sharded-resume15ba/science/tools/train_llms.py", line 564, in main
46     trainer = Trainer(
47   File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1493, in __init__
48     self._rng_state = checkpoint.load_checkpoint(
49   File "/usr/lib/python3/dist-packages/composer/utils/checkpoint.py", line 366, in load_checkpoint
50     rng_state_dicts = load_sharded_checkpoint(
51   File "/usr/lib/python3/dist-packages/composer/utils/checkpoint.py", line 558, in load_sharded_checkpoint
52     optim_state = load_sharded_optimizer_state_dict(model_state_dict=state.state_dict()['model'],
53   File "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py", line 264, in load_sharded_optimizer_state_dict
54     layout_specs, dp_pg = _get_state_dict_2d_layout(model_state_dict)
55   File "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py", line 128, in _get_state_dict_2d_layout
56     specs[key] = (None, value.size())
57 AttributeError: '_io.BytesIO' object has no attribute 'size'

from llm-foundry.

Composer crashes when attempting to load sharded checkpoint about llm-foundry HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent