Code Monkey home page Code Monkey logo

Comments (16)

germanjke avatar germanjke commented on August 20, 2024 1

Hi, looks like i had some probles with load at my s3
Today everything fine, I can save and resume training from
Also thanks for advice to Lion optimizer, we can train with him no shard as well

from composer.

mvpatel2000 avatar mvpatel2000 commented on August 20, 2024

How much RAM are you using in this scenario? If you are using Adam as an optimizer, you should expect 12*parameter count gigabytes used for saving checkpoints. If you are doing multinode training, using sharded checkpoints would help with this: https://docs.mosaicml.com/projects/composer/en/latest/notes/distributed_training.html#saving-and-loading-sharded-checkpoints-with-fsdp

from composer.

germanjke avatar germanjke commented on August 20, 2024

more than 7 TB
Thanks, i'm using multi node training, so I will try to use

fsdp_config:
      state_dict_type: sharded

Will tell you results

from composer.

germanjke avatar germanjke commented on August 20, 2024

right now I'm using

load_path: s3://path/ep1-ba4000-rank0.pt to resume training, how I should name them if they are sharded?

from composer.

germanjke avatar germanjke commented on August 20, 2024

I see this process creates folder, I guess it's input to load_path

from composer.

germanjke avatar germanjke commented on August 20, 2024

I have this error

2023-11-15 16:14:28,863: rank7[277][MainThread]: DEBUG: composer.utils.checkpoint: State dict created.
2023-11-15 16:14:28,887: rank7[277][MainThread]: DEBUG: composer.utils.checkpoint: Saving sharded checkpoints to mypath/ep0-ba20/__7_0.distcp...
2023-11-15 16:15:07,819: rank7[277][MainThread]: DEBUG: composer.callbacks.checkpoint_saver: Checkpoint locally saved to mypath/ep0-ba20/__7_0.distcp
2023-11-15 16:15:07,820: rank7[277][MainThread]: DEBUG: composer.callbacks.checkpoint_saver: Uploading checkpoint to mypath/ep0-ba20/__7_0.distcp
FileNotFoundError: [Errno 2] No such file or directory: 
'mypath/ep0-ba20/__7_0.distcp'

But only on some node, another ones crashes cause of this one

from composer.

germanjke avatar germanjke commented on August 20, 2024

And yes, on rank 0 I dont have locally saved them, folder is empty

for example, rank 1

-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __10_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __11_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __12_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __13_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __14_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __15_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __8_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __9_0.distcp

rank 2

-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __16_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __17_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __18_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __19_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __20_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __21_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __22_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __23_0.distcp

...

rank 7

-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __56_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __57_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __58_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __59_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __60_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __61_0.distcp
-rw-r--r-- 1 root root 2495601003 Nov 15 16:14 __62_0.distcp
-rw-r--r-- 1 root root 2491668843 Nov 15 16:14 __63_0.distcp

why they are not saved/deleted to rank0 locally?

from composer.

germanjke avatar germanjke commented on August 20, 2024
save_interval: 20ba
save_num_checkpoints_to_keep: 0  # Important, this cleans up checkpoints saved to DISK
save_folder: s3://path/{run_name}/
save_filename: ep{epoch}-ba{batch}-rank{rank}.pt
save_overwrite: true

from composer.

germanjke avatar germanjke commented on August 20, 2024

and my final folder at s3 don't have some indexes (for example, 1, 7 etc)

2023-11-15 19:15:46 2502278697 __0_0.distcp
2023-11-15 19:15:54 2495601003 __10_0.distcp
2023-11-15 19:15:53 2495601003 __11_0.distcp
2023-11-15 19:15:55 2495601003 __12_0.distcp
2023-11-15 19:15:55 2495601003 __13_0.distcp
2023-11-15 19:15:53 2495601003 __14_0.distcp
2023-11-15 19:15:52 2495601003 __15_0.distcp
2023-11-15 19:15:52 2495601003 __16_0.distcp
2023-11-15 19:15:53 2495601003 __17_0.distcp
2023-11-15 19:15:52 2495601003 __18_0.distcp
2023-11-15 19:15:52 2495601003 __19_0.distcp
2023-11-15 19:15:53 2495601003 __20_0.distcp
2023-11-15 19:15:55 2495601003 __21_0.distcp
2023-11-15 19:15:52 2495601003 __22_0.distcp
2023-11-15 19:15:54 2495601003 __23_0.distcp
2023-11-15 19:15:59 2495601003 __24_0.distcp
2023-11-15 19:15:55 2495601003 __25_0.distcp
2023-11-15 19:15:58 2495601003 __26_0.distcp
2023-11-15 19:15:57 2495601003 __27_0.distcp
2023-11-15 19:15:58 2495601003 __28_0.distcp
2023-11-15 19:15:59 2495601003 __29_0.distcp
2023-11-15 19:15:45 2495601003 __2_0.distcp
2023-11-15 19:15:51 2495601003 __30_0.distcp
2023-11-15 19:15:54 2495601003 __31_0.distcp
2023-11-15 19:15:59 2495601003 __32_0.distcp
2023-11-15 19:15:58 2495601003 __33_0.distcp
2023-11-15 19:15:58 2495601003 __34_0.distcp
2023-11-15 19:15:58 2495601003 __35_0.distcp
2023-11-15 19:15:58 2495601003 __36_0.distcp
2023-11-15 19:15:55 2495601003 __37_0.distcp
2023-11-15 19:15:58 2495601003 __38_0.distcp
2023-11-15 19:15:57 2495601003 __39_0.distcp
2023-11-15 19:15:44 2495601003 __3_0.distcp
2023-11-15 19:15:59 2495601003 __40_0.distcp
2023-11-15 19:15:57 2495601003 __41_0.distcp
2023-11-15 19:15:57 2495601003 __42_0.distcp
2023-11-15 19:15:58 2495601003 __43_0.distcp
2023-11-15 19:15:59 2495601003 __44_0.distcp
2023-11-15 19:15:55 2495601003 __45_0.distcp
2023-11-15 19:15:59 2495601003 __46_0.distcp
2023-11-15 19:15:59 2495601003 __47_0.distcp
2023-11-15 19:15:58 2495601003 __48_0.distcp
2023-11-15 19:15:58 2495601003 __49_0.distcp
2023-11-15 19:15:44 2495601003 __4_0.distcp
2023-11-15 19:15:55 2495601003 __50_0.distcp
2023-11-15 19:15:57 2495601003 __51_0.distcp
2023-11-15 19:15:57 2495601003 __52_0.distcp
2023-11-15 19:15:54 2495601003 __53_0.distcp
2023-11-15 19:15:59 2495601003 __54_0.distcp
2023-11-15 19:15:59 2495601003 __55_0.distcp
2023-11-15 19:15:55 2495601003 __56_0.distcp
2023-11-15 19:15:57 2495601003 __57_0.distcp
2023-11-15 19:15:55 2495601003 __58_0.distcp
2023-11-15 19:15:55 2495601003 __59_0.distcp
2023-11-15 19:15:44 2495601003 __5_0.distcp
2023-11-15 19:15:55 2495601003 __60_0.distcp
2023-11-15 19:15:55 2495601003 __61_0.distcp
2023-11-15 19:15:54 2495601003 __62_0.distcp
2023-11-15 19:15:55 2491668843 __63_0.distcp
2023-11-15 19:15:45 2495601003 __6_0.distcp
2023-11-15 19:15:54 2495601003 __8_0.distcp
2023-11-15 19:15:55 2495601003 __9_0.distcp

from composer.

mvpatel2000 avatar mvpatel2000 commented on August 20, 2024

@germanjke can you please summarize this a bit as to what you are seeing into one message? im having a hard time finding the latest error since it seems like some of the issues were fixed

from composer.

germanjke avatar germanjke commented on August 20, 2024

@mvpatel2000 I have a no such file error, some shards are not saved, but most are saved successfully. It is not possible to collect a whole checkpoint due to the lack of several shards.

from composer.

mvpatel2000 avatar mvpatel2000 commented on August 20, 2024

Can you please verify these shards are produced locally?

from composer.

germanjke avatar germanjke commented on August 20, 2024

It’s not saved locally on node 0, and saved on all other nodes @mvpatel2000

from composer.

mvpatel2000 avatar mvpatel2000 commented on August 20, 2024

Can you please specify your entire FSDP config (and in general try to provide a reproducible example)? We've never seen this on our end so trying to debug

from composer.

germanjke avatar germanjke commented on August 20, 2024
# FSDP
fsdp_config:
  state_dict_type: sharded
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: false
  limit_all_gathers: true
  verbose: false

from composer.

mvpatel2000 avatar mvpatel2000 commented on August 20, 2024

Can you please try to give a reproducible example? That looks correct to me, and there is nothing special in code for node 0 about this :(

from composer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.