๐ Bug
I am running the asr_wsj recipe. It is training the word_lm (stage 6) since last night but does not produce any output, logging or otherwise.
When I run nvtop or Nvidia-smi the gpus seem to be busy with my jobs. I am running 4 GPUs in parallel. Early on there were some OOM problems that it tried to recover from. Is it possible it in some sort of weird infinite loop but is doing nothing?
Attached is the screen output - at the top you can see nvidia-smi is run along with the early OOM messages.
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/condabin/conda
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/bin/conda
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/bin/conda-env
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/bin/activate
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/bin/deactivate
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/etc/profile.d/conda.sh
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/etc/fish/conf.d/conda.fish
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/shell/condabin/Conda.psm1
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/shell/condabin/conda-hook.ps1
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/lib/python3.7/site-packages/xontrib/conda.xsh
no change /misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espnet-may142020/etc/profile.d/conda.csh
no change /home/map22/.bashrc
No action taken.
Tue Dec 8 22:30:53 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.36 Driver Version: 440.36 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... On | 00000000:02:00.0 Off | N/A |
| 23% 18C P8 9W / 250W | 1MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... On | 00000000:03:00.0 Off | N/A |
| 23% 21C P8 9W / 250W | 1MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... On | 00000000:82:00.0 Off | N/A |
| 23% 22C P8 8W / 250W | 1MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... On | 00000000:83:00.0 Off | N/A |
| 23% 22C P8 8W / 250W | 1MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Stage 3: Text Binarization for LM Training
./run.sh: binarizing word text...
Unable to get 4 GPUs
Stage 6: word LM Training
2020-12-08 22:32:29 | INFO | fairseq.distributed_utils | distributed init (rank 0): tcp://localhost:19801
2020-12-08 22:32:29 | INFO | fairseq.distributed_utils | distributed init (rank 2): tcp://localhost:19801
2020-12-08 22:32:29 | INFO | fairseq.distributed_utils | distributed init (rank 1): tcp://localhost:19801
2020-12-08 22:32:29 | INFO | fairseq.distributed_utils | distributed init (rank 3): tcp://localhost:19801
2020-12-08 22:32:39 | INFO | fairseq.distributed_utils | initialized host lion6.cs.nyu.edu as rank 3
2020-12-08 22:32:39 | INFO | fairseq.distributed_utils | initialized host lion6.cs.nyu.edu as rank 2
2020-12-08 22:32:39 | INFO | fairseq.distributed_utils | initialized host lion6.cs.nyu.edu as rank 0
2020-12-08 22:32:39 | INFO | fairseq.distributed_utils | initialized host lion6.cs.nyu.edu as rank 1
2020-12-08 22:32:39 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 1000, 'log_format': 'simple', 'tensorboard_logdir': None, 'wandb_project': None, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': True}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 4, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://localhost:19801', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'c10d', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'broadcast_buffers': False, 'distributed_wrapper': 'DDP', 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 4, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'tpu': False, 'distributed_num_procs': 4}, 'dataset': {'_name': None, 'num_workers': 0, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': 6400, 'batch_size': 256, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 6400, 'batch_size_valid': 512, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 25, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.001], 'min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'exp/wordlm_lstm', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 1000, 'keep_interval_updates': 5, 'keep_last_epochs': 5, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'model_parallel_size': 1, 'distributed_rank': 0}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 4}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': False, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False, 'eos_factor': None, 'subwordlm_weight': 0.8, 'oov_penalty': 0.0001, 'disable_open_vocab': False, 'apply_log_softmax': False, 'state_prior_file': None}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='lstm_wordlm_wsj', adam_betas='(0.9, 0.999)', adam_eps=1e-08, adaptive_softmax_cutoff=None, add_bos_token=False, all_gather_list_size=16384, arch='lstm_wordlm_wsj', batch_size=256, batch_size_valid='512', best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='cross_entropy', curriculum=0, data='data/wordlm_text', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_dropout_in=0.35, decoder_dropout_out=0.35, decoder_embed_dim=1200, decoder_embed_path=None, decoder_freeze_embed=False, decoder_hidden_size=1200, decoder_layers=3, decoder_out_embed_dim=1200, decoder_rnn_residual=False, device_id=0, dict='data/lang/wordlist_65000.txt', disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=4, distributed_wrapper='DDP', dropout=0.35, empty_cache_freq=0, eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, future_target=False, gen_subset='test', is_wordlm=True, keep_best_checkpoints=-1, keep_interval_updates=5, keep_last_epochs=5, localsgd_frequency=3, log_format='simple', log_interval=1000, lr=[0.001], lr_patience=0, lr_scheduler='reduce_lr_on_plateau', lr_shrink=0.5, lr_threshold=0.0001, max_epoch=25, max_target_positions=None, max_tokens=6400, max_tokens_valid=6400, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1.0, model_parallel_size=1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, nprocs_per_node=4, num_shards=1, num_workers=0, optimizer='adam', optimizer_overrides='{}', output_dictionary_size=-1, pad=1, past_target=False, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, profile=False, quantization_config_path=None, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_break_mode='eos', save_dir='exp/wordlm_lstm', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, self_target=False, sentence_avg=False, shard_id=0, share_embed=True, shorten_data_split_list='', shorten_method='none', skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, stop_time_hours=0, task='language_modeling_for_asr', tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tokens_per_sample=1024, tpu=False, train_subset='train', unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, wandb_project=None, warmup_init_lr=-1, warmup_updates=0, weight_decay=0.0, zero_sharding='none'), 'task': {'_name': 'language_modeling_for_asr', 'data': 'data/wordlm_text', 'sample_break_mode': 'eos', 'tokens_per_sample': 1024, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': None, 'shorten_method': 'none', 'shorten_data_split_list': '', 'seed': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'dict': 'data/lang/wordlist_65000.txt'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.999)', 'adam_eps': 1e-08, 'weight_decay': 0.0, 'use_old_adam': False, 'tpu': False, 'lr': [0.001]}, 'lr_scheduler': {'_name': 'reduce_lr_on_plateau', 'lr_shrink': 0.5, 'lr_threshold': 0.0001, 'lr_patience': 0, 'warmup_updates': 0, 'warmup_init_lr': -1.0, 'lr': [0.001], 'maximize_best_checkpoint_metric': False}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None}
2020-12-08 22:32:39 | INFO | espresso.tasks.language_modeling_for_asr | dictionary: 65003 types
2020-12-08 22:32:39 | INFO | fairseq.data.data_utils | loaded 503 examples from: data/wordlm_text/valid
2020-12-08 22:32:42 | INFO | fairseq_cli.train | LSTMLanguageModelEspresso(
(decoder): SpeechLSTMDecoder(
(dropout_in_module): FairseqDropout()
(dropout_out_module): FairseqDropout()
(embed_tokens): Embedding(65003, 1200, padding_idx=0)
(layers): ModuleList(
(0): LSTMCell(1200, 1200)
(1): LSTMCell(1200, 1200)
(2): LSTMCell(1200, 1200)
)
)
)
2020-12-08 22:32:42 | INFO | fairseq_cli.train | task: LanguageModelingForASRTask
2020-12-08 22:32:42 | INFO | fairseq_cli.train | model: LSTMLanguageModelEspresso
2020-12-08 22:32:42 | INFO | fairseq_cli.train | criterion: CrossEntropyCriterion)
2020-12-08 22:32:42 | INFO | fairseq_cli.train | num. model params: 112592400 (num. trained: 112592400)
2020-12-08 22:32:43 | INFO | fairseq.utils | CUDA enviroments for all 4 workers
2020-12-08 22:32:43 | INFO | fairseq.utils | rank 0: capabilities = 6.1 ; total memory = 10.917 GB ; name = GeForce GTX 1080 Ti
2020-12-08 22:32:43 | INFO | fairseq.utils | rank 1: capabilities = 6.1 ; total memory = 10.917 GB ; name = GeForce GTX 1080 Ti
2020-12-08 22:32:43 | INFO | fairseq.utils | rank 2: capabilities = 6.1 ; total memory = 10.917 GB ; name = GeForce GTX 1080 Ti
2020-12-08 22:32:43 | INFO | fairseq.utils | rank 3: capabilities = 6.1 ; total memory = 10.917 GB ; name = GeForce GTX 1080 Ti
2020-12-08 22:32:43 | INFO | fairseq.utils | CUDA enviroments for all 4 workers
2020-12-08 22:32:43 | INFO | fairseq_cli.train | training on 4 devices (GPUs/TPUs)
2020-12-08 22:32:43 | INFO | fairseq_cli.train | max tokens per GPU = 6400 and batch size per GPU = 256
2020-12-08 22:32:43 | INFO | fairseq.trainer | no existing checkpoint found exp/wordlm_lstm/checkpoint_last.pt
2020-12-08 22:32:43 | INFO | fairseq.trainer | loading train data for epoch 1
/misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espresso-dec082020/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:398: UserWarning: The check_reduction
argument in DistributedDataParallel
module is deprecated. Please avoid using it.
"The check_reduction
argument in DistributedDataParallel
"
2020-12-08 22:41:58 | INFO | fairseq.data.data_utils | loaded 1662964 examples from: data/wordlm_text/train
/misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espresso-dec082020/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:398: UserWarning: The check_reduction
argument in DistributedDataParallel
module is deprecated. Please avoid using it.
"The check_reduction
argument in DistributedDataParallel
"
/misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espresso-dec082020/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:398: UserWarning: The check_reduction
argument in DistributedDataParallel
module is deprecated. Please avoid using it.
"The check_reduction
argument in DistributedDataParallel
"
/misc/vlgscratch4/PichenyGroup/picheny/anaconda3/envs/espresso-dec082020/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:398: UserWarning: The check_reduction
argument in DistributedDataParallel
module is deprecated. Please avoid using it.
"The check_reduction
argument in DistributedDataParallel
"
2020-12-08 22:42:06 | INFO | fairseq.trainer | begin training epoch 1
/misc/vlgscratch5/PichenyGroup/picheny/espresso/fairseq/utils.py:347: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
/misc/vlgscratch5/PichenyGroup/picheny/espresso/fairseq/utils.py:347: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
/misc/vlgscratch5/PichenyGroup/picheny/espresso/fairseq/utils.py:347: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
/misc/vlgscratch5/PichenyGroup/picheny/espresso/fairseq/utils.py:347: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
2020-12-08 22:42:08 | INFO | root | Reducer buckets have been rebuilt in this iteration.
2020-12-08 22:42:14 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 1.55 GiB (GPU 1; 10.92 GiB total capacity; 7.68 GiB already allocated; 1.37 GiB free; 8.91 GiB reserved in total by PyTorch)
2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|
PyTorch CUDA memory summary, device ID 0 |
CUDA OOMs: 0 |
=========================================================================== |
Metric |
--------------------------------------------------------------------------- |
Allocated memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Allocations |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active allocs |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved segments |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable allocs |
from large pool |
from small pool |
=========================================================================== |
2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|
PyTorch CUDA memory summary, device ID 1 |
CUDA OOMs: 1 |
=========================================================================== |
Metric |
--------------------------------------------------------------------------- |
Allocated memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Allocations |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active allocs |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved segments |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable allocs |
from large pool |
from small pool |
=========================================================================== |
2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|
PyTorch CUDA memory summary, device ID 2 |
CUDA OOMs: 0 |
=========================================================================== |
Metric |
--------------------------------------------------------------------------- |
Allocated memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Allocations |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active allocs |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved segments |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable allocs |
from large pool |
from small pool |
=========================================================================== |
2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|
PyTorch CUDA memory summary, device ID 3 |
CUDA OOMs: 0 |
=========================================================================== |
Metric |
--------------------------------------------------------------------------- |
Allocated memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Allocations |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active allocs |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved segments |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable allocs |
from large pool |
from small pool |
=========================================================================== |
2020-12-08 22:42:14 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass
2020-12-08 22:42:14 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 1.55 GiB (GPU 2; 10.92 GiB total capacity; 7.66 GiB already allocated; 945.06 MiB free; 9.36 GiB reserved in total by PyTorch)
2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|
PyTorch CUDA memory summary, device ID 0 |
CUDA OOMs: 0 |
=========================================================================== |
Metric |
--------------------------------------------------------------------------- |
Allocated memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Allocations |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active allocs |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved segments |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable allocs |
from large pool |
from small pool |
=========================================================================== |
2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|
PyTorch CUDA memory summary, device ID 1 |
CUDA OOMs: 0 |
=========================================================================== |
Metric |
--------------------------------------------------------------------------- |
Allocated memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Allocations |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active allocs |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved segments |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable allocs |
from large pool |
from small pool |
=========================================================================== |
2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|
PyTorch CUDA memory summary, device ID 2 |
CUDA OOMs: 1 |
=========================================================================== |
Metric |
--------------------------------------------------------------------------- |
Allocated memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Allocations |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active allocs |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved segments |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable allocs |
from large pool |
from small pool |
=========================================================================== |
2020-12-08 22:42:14 | WARNING | fairseq.trainer | |===========================================================================|
PyTorch CUDA memory summary, device ID 3 |
CUDA OOMs: 0 |
=========================================================================== |
Metric |
--------------------------------------------------------------------------- |
Allocated memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable memory |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Allocations |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Active allocs |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
GPU reserved segments |
from large pool |
from small pool |
--------------------------------------------------------------------------- |
Non-releasable allocs |
from large pool |
from small pool |
=========================================================================== |
2020-12-08 22:42:14 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass