Although my error looks similar to issue 3, I thought I should open this as a separate

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Out of memory error about digits HOT 13 CLOSED

nvidia commented on May 18, 2024

Out of memory error

from digits.

Comments (13)

nullterminated commented on May 18, 2024 1

That does the trick. It appears the default value is a batch size around 100. I made that 50 and now it is training without memory issues. Thanks!

from digits.

EnQing626 commented on May 18, 2024 1

After I changed the batch size but I also met the same issue. And then I changed the mode from GPU to CPU, it did work. The problem of my issue is my computer can't store the parameter of my net.

from digits.

klaimane commented on May 18, 2024 1

Thanks! Will first try tiling the data set to smaller fovs.

from digits.

lukeyeager commented on May 18, 2024

Hi, @nullterminated. The standard models were designed to fit on GPUs with 4GB of memory or more - that is why you're running out of memory. If you decrease the batch size, you should be able to run just about any network you want with 3GB of GPU memory. It will just take a bit longer.

from digits.

lukeyeager commented on May 18, 2024

I've added a feature that should help avoid these issues in the future. I'm going to go ahead and close this issue in faith that changing the batch size will fix your problem for you for the present. Please re-open it if you still have an issue here.

from digits.

klaimane commented on May 18, 2024

Same problem...
Set the batch size to 1 manually but it didn't work.
Are there any other solutions?

Hardware
Tesla K80 (#0)
Memory
7.56 GB / 11.2 GB (67.2%)
GPU Utilization
98%
Temperature
41 °C
Process #159052
CPU Utilization
113.0%
Memory
1.4 GB (0.6%)

from the output.log:

I1110 10:41:47.261068 158591 net.cpp:761] Ignoring source layer upscore_21classes
I1110 10:41:47.261715 158591 caffe.cpp:251] Starting Optimization
I1110 10:41:47.261730 158591 solver.cpp:279] Solving
I1110 10:41:47.261734 158591 solver.cpp:280] Learning Rate Policy: step
I1110 10:41:47.263772 158591 solver.cpp:337] Iteration 0, Testing net (#0)
I1110 10:41:55.251052 158591 solver.cpp:404] Test net output #0: accuracy = 0
I1110 10:41:55.251092 158591 solver.cpp:404] Test net output #1: loss = 3.04452 (* 1 = 3.04452 loss)
I1110 10:41:55.762611 158591 solver.cpp:228] Iteration 0, loss = 3.04452
I1110 10:41:55.762641 158591 solver.cpp:244] Train net output #0: loss = 3.04452 (* 1 = 3.04452 loss)
I1110 10:41:55.762686 158591 sgd_solver.cpp:106] Iteration 0, lr = 0.0001
I1110 10:42:03.353044 158591 solver.cpp:228] Iteration 4, loss = 2.76947
I1110 10:42:03.353076 158591 solver.cpp:244] Train net output #0: loss = 2.76947 (* 1 = 2.76947 loss)
I1110 10:42:03.353085 158591 sgd_solver.cpp:106] Iteration 4, lr = 0.0001
I1110 10:42:08.067147 158591 solver.cpp:228] Iteration 8, loss = 2.11253
I1110 10:42:08.067178 158591 solver.cpp:244] Train net output #0: loss = 2.11253 (* 1 = 2.11253 loss)
I1110 10:42:08.067185 158591 sgd_solver.cpp:106] Iteration 8, lr = 0.0001
I1110 10:42:13.411054 158591 solver.cpp:228] Iteration 12, loss = 1.45452
I1110 10:42:13.411083 158591 solver.cpp:244] Train net output #0: loss = 1.45452 (* 1 = 1.45452 loss)
I1110 10:42:13.411092 158591 sgd_solver.cpp:106] Iteration 12, lr = 0.0001
F1110 10:42:19.409853 158591 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x7ffff1b519fd google::LogMessage::Fail()
@ 0x7ffff1b537cc google::LogMessage::SendToLog()
@ 0x7ffff1b515ec google::LogMessage::Flush()
@ 0x7ffff1b540de google::LogMessageFatal::~LogMessageFatal()
@ 0x7ffff742b821 caffe::SyncedMemory::to_gpu()
@ 0x7ffff742ab89 caffe::SyncedMemory::mutable_gpu_data()
@ 0x7ffff72a6642 caffe::Blob<>::mutable_gpu_data()
@ 0x7ffff7409926 caffe::BaseConvolutionLayer<>::backward_gpu_gemm()
@ 0x7ffff745c27b caffe::DeconvolutionLayer<>::Forward_gpu()
@ 0x7ffff73070f5 caffe::Net<>::ForwardFromTo()
@ 0x7ffff7307467 caffe::Net<>::Forward()
@ 0x7ffff741f737 caffe::Solver<>::Step()
@ 0x7ffff741fff9 caffe::Solver<>::Solve()
@ 0x40a47b train()
@ 0x40752c main
@ 0x7fffe9ec1b15 __libc_start_main
@ 0x407d9d (unknown)

from digits.

gheinrich commented on May 18, 2024

What is your input image size?

from digits.

klaimane commented on May 18, 2024

The biggest one is 20.5mb (3008x3952).
I'm using variable sizes.

from digits.

gheinrich commented on May 18, 2024

Those are pretty large images and fully convolutional networks are quite memory hungry. Note that if you are using images of variable sizes, you need to set the batch size to 1 anyway (this is already set in FCN-Alexnet from the semantic segmentation example).

You could try resizing your images to a lower size, if that does not destroy too much information. That is the first thing I would try.

Another solution is to take random crops from the bigger images. You need labels to be cropped in the same way though so the usual crop parameter in Caffe's data layer cannot be used. A Python layer would be suitable to perform the cropping in the context of a quick experiment.

Another solution would be to increase the stride of the first convolutional layer to reduce the size of its output feature map. However you then need to make corresponding changes in the deconvolutional layer of the network and you need to calculate the new offset to apply in the final Crop layer, which isn't difficult but is a bit tedious.

from digits.

klaimane commented on May 18, 2024

smaller tiles solved the issue.

from digits.

gheinrich commented on May 18, 2024

Note that during inference you should be able to use the original image size (as demonstrated on the binary segmentation example) - up to a limit of course: inference needs about a third of the GPU memory required for training.

from digits.

klaimane commented on May 18, 2024

Thanks! Will keep that in mind.

from digits.

MRCSTJM commented on May 18, 2024

I want to train the GoogleNet model with 1024 x 1024 but it is out of memory. Then I resize images to 800 x 800 and the batch size is 10 and then it works but the accuracy is only about 80%. Then I resize images to 680 x 680 and the batch size is 20 and then it still works but the accuracy can go to 90%. It seems the batch size influences the accuracy. Is it right?

from digits.

Out of memory error about digits HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent