I first reported this on <a class="issue-link js-issue-link" data-error-text="Failed t

CUDA out of memory on one system but not another,about joeyballentine/esrgan

Comments (12)

joeyballentine commented on July 17, 2024

That's really strange as I have code to specifically prevent getting out of memory errors. Can you copy/paste or take a screenshot of the exact error you get? it's possible that what I'm checking for breaks in certain cases

from esrgan.

2haloes commented on July 17, 2024

I've put the script output below, I've tried taking a quick look myself but this is far out of my area of expertise

Upscaling ----------------------------------------   0% -:--:--
Traceback (most recent call last):
  File "C:\Users\2haloes\Documents\esrgan\utils\dataops.py", line 44, in auto_split_upscale
    result = upscale_function(lr_img)
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 532, in upscale
    output = self.process(img)
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 289, in process
    output = self.model(img_LR).data.squeeze(0).float().cpu().clamp_(0, 1).numpy()
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\2haloes\Documents\esrgan\utils\architecture.py", line 118, in forward
    x = self.model(x)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\container.py", line 119, in forward
    input = module(input)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\2haloes\Documents\esrgan\utils\block.py", line 92, in forward
    output = x + self.sub(x)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\container.py", line 119, in forward
    input = module(input)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\2haloes\Documents\esrgan\utils\block.py", line 317, in forward
    out = self.RDB2(out)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\2haloes\Documents\esrgan\utils\block.py", line 432, in forward
    x5 = self.conv5(torch.cat((x, x1, x2, x3, x4), 1))
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\container.py", line 119, in forward
    input = module(input)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "C:\Python39\lib\site-packages\torch\nn\modules\conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 658, in <module>
    app()
  File "C:\Python39\lib\site-packages\typer\main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "C:\Python39\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\Python39\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\Python39\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Python39\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\Python39\lib\site-packages\typer\main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 654, in main
    upscale.run()
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 237, in run
    rlt, depth = ops.auto_split_upscale(
  File "C:\Users\2haloes\Documents\esrgan\utils\dataops.py", line 54, in auto_split_upscale
    raise RuntimeError(e)
RuntimeError: CUDA error: out of memory

from esrgan.

joeyballentine commented on July 17, 2024

So my assumption was correct. The error message that you're getting is different from the usual one, so my check I do fails. I can update this to better handle this alternate error.

Out of curiosity, what version of pytorch are you running?

from esrgan.

joeyballentine commented on July 17, 2024

Let me know if that fixes it for you

from esrgan.

2haloes commented on July 17, 2024

Unfortunely it does not fix the issue, I've setup a debugger on my machine and found that the exception is properly caught but it jumps from line 51 to line 55 (clearing the VRAM to reraising the exception). I can take more of a look later though.

For the pytorch version, I'm currently running on 1.8.1+cu111

from esrgan.

joeyballentine commented on July 17, 2024

So it seems like what's going on is it's crashing again when it's in the process of clearing the VRAM. Does it give a different error for the second one?

from esrgan.

2haloes commented on July 17, 2024

The stack trace is different so it looks like the fix did work and this is throwing the same exception message

I also got a memory summery after the first out of memory exception is thrown which I've attached below the exception message.

  File "C:\Users\2haloes\Documents\esrgan\utils\dataops.py", line 45, in auto_split_upscale
    result = upscale_function(lr_img)
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 532, in upscale
    output = self.process(img)
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 289, in process
    output = self.model(img_LR).data.squeeze(0).float().cpu().clamp_(0, 1).numpy()
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\2haloes\Documents\esrgan\utils\architecture.py", line 118, in forward
    x = self.model(x)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\container.py", line 119, in forward
    input = module(input)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\2haloes\Documents\esrgan\utils\block.py", line 92, in forward
    output = x + self.sub(x)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\container.py", line 119, in forward
    input = module(input)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\2haloes\Documents\esrgan\utils\block.py", line 317, in forward
    out = self.RDB2(out)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\2haloes\Documents\esrgan\utils\block.py", line 432, in forward
    x5 = self.conv5(torch.cat((x, x1, x2, x3, x4), 1))
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\container.py", line 119, in forward
    input = module(input)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "C:\Python39\lib\site-packages\torch\nn\modules\conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 658, in <module>
    app()
  File "C:\Python39\lib\site-packages\typer\main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "C:\Python39\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\Python39\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\Python39\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Python39\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\Python39\lib\site-packages\typer\main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 654, in main
    upscale.run()
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 237, in run
    rlt, depth = ops.auto_split_upscale(
  File "C:\Users\2haloes\Documents\esrgan\utils\dataops.py", line 51, in auto_split_upscale
    torch.cuda.empty_cache()
  File "C:\Python39\lib\site-packages\torch\cuda\memory.py", line 114, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: out of memory

|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    3633 MB |    5153 MB |   15281 MB |   11648 MB |
|       from large pool |    3569 MB |    5089 MB |   15218 MB |   11648 MB |
|       from small pool |      63 MB |      63 MB |      63 MB |       0 MB |
|---------------------------------------------------------------------------|
| Active memory         |    3633 MB |    5153 MB |   15281 MB |   11648 MB |
|       from large pool |    3569 MB |    5089 MB |   15218 MB |   11648 MB |
|       from small pool |      63 MB |      63 MB |      63 MB |       0 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    7186 MB |    7186 MB |    7186 MB |       0 B  |
|       from large pool |    7120 MB |    7120 MB |    7120 MB |       0 B  |
|       from small pool |      66 MB |      66 MB |      66 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |    1778 MB |    1778 MB |    5385 MB |    3606 MB |
|       from large pool |    1776 MB |    1776 MB |    5327 MB |    3550 MB |
|       from small pool |       2 MB |       3 MB |      57 MB |      55 MB |
|---------------------------------------------------------------------------|
| Allocations           |     712    |     713    |     735    |      23    |
|       from large pool |       8    |       9    |      23    |      15    |
|       from small pool |     704    |     705    |     712    |       8    |
|---------------------------------------------------------------------------|
| Active allocs         |     712    |     713    |     735    |      23    |
|       from large pool |       8    |       9    |      23    |      15    |
|       from small pool |     704    |     705    |     712    |       8    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      42    |      42    |      42    |       0    |
|       from large pool |       9    |       9    |       9    |       0    |
|       from small pool |      33    |      33    |      33    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      33    |      33    |      46    |      13    |
|       from large pool |       5    |       5    |      13    |       8    |
|       from small pool |      28    |      28    |      33    |       5    |
|===========================================================================|

from esrgan.

joeyballentine commented on July 17, 2024

Wtf, it's running out of memory when clearing the memory... Could you try updating pytorch?

from esrgan.

2haloes commented on July 17, 2024

I've updated pytorch and the issue still occurs however it also now came with a suggestion to set an env variable (CUDA_LAUNCH_BLOCKING = 1) so I did that and it came back with this exception which looks a lot more useful

Traceback (most recent call last):
  File "C:\Users\2haloes\Documents\esrgan\utils\dataops.py", line 46, in auto_split_upscale
    result = upscale_function(lr_img)
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 538, in upscale
    output = self.process(img)
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 295, in process
    output = self.model(img_LR).data.squeeze(0).float().cpu().clamp_(0, 1).numpy()
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\2haloes\Documents\esrgan\utils\architecture.py", line 118, in forward
    x = self.model(x)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
    input = module(input)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\2haloes\Documents\esrgan\utils\block.py", line 92, in forward
    output = x + self.sub(x)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
    input = module(input)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\2haloes\Documents\esrgan\utils\block.py", line 317, in forward
    out = self.RDB2(out)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\2haloes\Documents\esrgan\utils\block.py", line 432, in forward
    x5 = self.conv5(torch.cat((x, x1, x2, x3, x4), 1))
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
    input = module(input)
  File "C:\Python39\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Python39\lib\site-packages\torch\nn\modules\conv.py", line 446, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "C:\Python39\lib\site-packages\torch\nn\modules\conv.py", line 442, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 192, 1080, 1920], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(192, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0000015321CA4A30
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 192, 1080, 1920,
    strideA = 398131200, 2073600, 1920, 1,
output: TensorDescriptor 0000015321CA58A0
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 64, 1080, 1920,
    strideA = 132710400, 2073600, 1920, 1,
weight: FilterDescriptor 0000015369D70F00
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 64, 192, 3, 3,
Pointer addresses:
    input: 0000000CBB400000
    output: 0000000C3C960000
    weight: 0000000B0CF6B000
Forward algorithm: 1


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 664, in <module>
    app()
  File "C:\Python39\lib\site-packages\typer\main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "C:\Python39\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\Python39\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\Python39\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Python39\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\Python39\lib\site-packages\typer\main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 660, in main
    upscale.run()
  File "C:\Users\2haloes\Documents\esrgan\upscale.py", line 243, in run
    rlt, depth = ops.auto_split_upscale(
  File "C:\Users\2haloes\Documents\esrgan\utils\dataops.py", line 58, in auto_split_upscale
    raise RuntimeError(e)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 192, 1080, 1920], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(192, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0000015321CA4A30
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 192, 1080, 1920,
    strideA = 398131200, 2073600, 1920, 1,
output: TensorDescriptor 0000015321CA58A0
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 64, 1080, 1920,
    strideA = 132710400, 2073600, 1920, 1,
weight: FilterDescriptor 0000015369D70F00
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 64, 192, 3, 3,
Pointer addresses:
    input: 0000000CBB400000
    output: 0000000C3C960000
    weight: 0000000B0CF6B000
Forward algorithm: 1

from esrgan.

joeyballentine commented on July 17, 2024

This is super weird, I've never seen this happen for inference before.

My next suggestion would just be to update your drivers and restart your computer and see if it still happens. I've had a similar error when training before and I just needed to restart my PC.

from esrgan.

2haloes commented on July 17, 2024

Looks like updating to CUDA 11.5 has resolved the issue

While the error can still occur if things go too over the top (64x a 1080P image) it looks like enabling the env variable stops things from getting out of hand enough to cause the script to stop

UPDATE: After a couple of hours, the 64x run with the env variable output a 16GB image file so it may be worth setting a toggle that disables async

from esrgan.

joeyballentine commented on July 17, 2024

64x on a 1080p image? That's a bit overkill lol. Glad it works now though

from esrgan.

CUDA out of memory on one system but not another about esrgan HOT 12 CLOSED

Comments (12)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent