Code Monkey home page Code Monkey logo

Comments (5)

dptam avatar dptam commented on July 30, 2024 3

Sorry I think the config might be slightly off as it was meant for the 3B and not 11B versions. For the 11B variants, to fit into memory, we used a smaller batch size but still had an effect batch size of 8. Our hyperparameters werebatch_size=1 grad_accum_factor=8 eval_batch_size=2. Let us know if it still runs out of memory.

from t-few.

HaokunLiu avatar HaokunLiu commented on July 30, 2024 1

Thanks for your interest in our work!

It's hard to tell from the surface. Could you share with me the full log?
And if you are familiar with pytorch lightning, mind if add something like print("Memory usage at line [add something here]", torch.cuda.memory_allocated(device=None)) in the start and end of training_step of EncoderDecoder.py?

from t-few.

danielkorat avatar danielkorat commented on July 30, 2024

Hi @HaokunLiu
We added the prints and attached the logs here. Looks like it runs out of memory before starting the training.


(tfew3.7) unso@hf-paris-dgx-station-1:~/t-few$ python -m src.pl_train -c t011b.json+ia3.json+rte.json -k load_weight="pretrained_checkpoints/t011b_ia3_finish.pt" exp_name=t011b_rte_seed42_ia3_pretrained few_shot_random_seed=42 seed=42 > logfile
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Reusing dataset super_glue (/home/unso/.cache/huggingface/datasets/super_glue/rte/1.0.2/d040c658e2ddef6934fdd97deb45c777b6ff50c524781ea434e7219b56a428a7)
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Missing logger folder: /home/unso/t-few/exp_out/t011b_rte_seed42_ia3_pretrained/log
 | Name | Type            | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 11.1 B
-----------------------------------------------------
1.1 M   Trainable params
11.1 B  Non-trainable params
11.1 B  Total params
44,548.801Total estimated model params size (MB)
Traceback (most recent call last):
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
  "__main__", mod_spec)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/runpy.py", line 85, in _run_code
  exec(code, run_globals)
 File "/home/unso/t-few/src/pl_train.py", line 86, in <module>
  main(config)
 File "/home/unso/t-few/src/pl_train.py", line 57, in main
  trainer.fit(model, datamodule)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
  self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
  return trainer_fn(*args, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
  self._run(model, ckpt_path=ckpt_path)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
  self._dispatch()
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
  self.training_type_plugin.start_training(self)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
  self._results = trainer.run_stage()
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
  return self._run_train()
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
  self.fit_loop.run()
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
  self.advance(*args, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
  self.epoch_loop.run(data_fetcher)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
  self.advance(*args, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 193, in advance
  batch_output = self.batch_loop.run(batch, batch_idx)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
  self.advance(*args, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
  outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
  self.advance(*args, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 219, in advance
  self.optimizer_idx,
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 266, in _run_optimization
  self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 386, in _optimizer_step
  using_lbfgs=is_lbfgs,
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1652, in optimizer_step
  optimizer.step(closure=optimizer_closure)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 164, in step
  trainer.accelerator.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in optimizer_step
  self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 80, in optimizer_step
  return super().optimizer_step(model, optimizer, optimizer_idx, closure, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 163, in optimizer_step
  optimizer.step(closure=closure, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
  return wrapped(*args, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/torch/optim/optimizer.py", line 109, in wrapper
  return func(*args, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/transformers/optimization.py", line 528, in step
  loss = closure()
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 148, in _wrap_closure
  closure_result = closure()
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 160, in __call__
  self._result = self.closure(*args, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 142, in closure
  step_output = self._step_fn()
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 435, in _training_step
  training_step_output = self.trainer.accelerator.training_step(step_kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 216, in training_step
  return self.training_type_plugin.training_step(*step_kwargs.values())
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 213, in training_step
  return self.model.training_step(*args, **kwargs)
 File "/home/unso/t-few/src/models/EncoderDecoder.py", line 62, in training_step
  decoder_attention_mask=decoder_attention_mask,
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
  return forward_call(*input, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 1623, in forward
  return_dict=return_dict,
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
  return forward_call(*input, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 1020, in forward
  output_attentions=output_attentions,
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
  return forward_call(*input, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 696, in forward
  hidden_states = self.layer[-1](hidden_states)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
  return forward_call(*input, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 306, in forward
  forwarded_states = self.DenseReluDense(forwarded_states)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
  return forward_call(*input, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 285, in forward
  hidden_states = self.wo(hidden_states)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
  return forward_call(*input, **kwargs)
 File "/home/unso/miniconda3/envs/tfew3.7/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
  return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 79.35 GiB total capacity; 78.13 GiB already allocated; 3.62 MiB free; 78.20 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

from t-few.

HaokunLiu avatar HaokunLiu commented on July 30, 2024

I remember the code will print out all the args in the beginning. Could you share that with me?

from t-few.

eunseojo avatar eunseojo commented on July 30, 2024

thanks!

from t-few.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.