I followed the guide on training model in Azure Machine Learning and got an error after 8 hours of training. From the logs I can see that it completet the first 50 epochs and crashed when it tried to start 51/60. The error messages are listed below.
Any ideas on how to solve this?
Do we need to clear some kind of memory cache?
Do I need a bigger machine with better GPU?
How long should the training take in your tests?
Should I just retrain?
Summary:
Run failed: User program failed with ResourceExhaustedError: OOM when allocating tensor with shape[32,128,52,52] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node conv2d_12/convolution}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[{{node Mean_1}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Detailed:
Session ID: 4e001867-167d-49da-9618-6dcfc4053686
{"error":{"code":"UserError","message":"User program failed with ResourceExhaustedError: OOM when allocating tensor with shape[32,128,52,52] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc\n\t [[{{node conv2d_12/convolution}}]]\nHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.\n\n\t [[{{node Mean_1}}]]\nHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.\n","detailsUri":"https://aka.ms/azureml-known-errors","target":null,"details":[],"innerError":null,"debugInfo":{"type":"ResourceExhaustedError","message":"OOM when allocating tensor with shape[32,128,52,52] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc\n\t [[{{node conv2d_12/convolution}}]]\nHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.\n\n\t [[{{node Mean_1}}]]\nHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.\n","stackTrace":" File "/mnt/batch/tasks/shared/LS_root/jobs/aml-cviotedge-prod/azureml/yolov3_1576506500_05614ed0/mounts/workspaceblobstore/azureml/yolov3_1576506500_05614ed0/azureml-setup/context_manager_injector.py", line 115, in execute_with_context\n runpy.run_path(sys.argv[0], globals(), run_name="main")\n File "/azureml-envs/azureml_6536bbd782c8c80c6934351febe412f9/lib/python3.6/runpy.py", line 263, in run_path\n pkg_name=pkg_name, script_name=fname)\n File "/azureml-envs/azureml_6536bbd782c8c80c6934351febe412f9/lib/python3.6/runpy.py", line 96, in _run_module_code\n mod_name, mod_spec, pkg_name, script_name)\n File "/azureml-envs/azureml_6536bbd782c8c80c6934351febe412f9/lib/python3.6/runpy.py", line 85, in _run_code\n exec(code, run_globals)\n File "train.py", line 237, in \n _main(FLAGS.model, FLAGS.fine_tune_epochs, FLAGS.unfrozen_epochs, FLAGS.learning_rate)\n File "train.py", line 110, in _main\n callbacks=[logging, checkpoint, reduce_lr, early_stopping, lossHistory])\n File "/azureml-envs/azureml_6536bbd782c8c80c6934351febe412f9/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper\n return func(*args, **kwargs)\n File "/azureml-envs/azureml_6536bbd782c8c80c6934351febe412f9/lib/python3.6/site-packages/keras/engine/training.py", line 1732, in fit_generator\n initial_epoch=initial_epoch)\n File "/azureml-envs/azureml_6536bbd782c8c80c6934351febe412f9/lib/python3.6/site-packages/keras/engine/training_generator.py", line 220, in fit_generator\n reset_metrics=False)\n File "/azureml-envs/azureml_6536bbd782c8c80c6934351febe412f9/lib/python3.6/site-packages/keras/engine/training.py", line 1514, in train_on_batch\n outputs = self.train_function(ins)\n File "/azureml-envs/azureml_6536bbd782c8c80c6934351febe412f9/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3076, in call\n run_metadata=self.run_metadata)\n File "/azureml-envs/azureml_6536bbd782c8c80c6934351febe412f9/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in call\n run_metadata_ptr)\n File "/azureml-envs/azureml_6536bbd782c8c80c6934351febe412f9/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit\n c_api.TF_GetCode(self.status.status))\n","innerException":null,"data":null,"errorResponse":null}},"correlation":null,"environment":null,"location":null,"time":"0001-01-01T00:00:00+00:00"}