Comments (13)
That does the trick. It appears the default value is a batch size around 100. I made that 50 and now it is training without memory issues. Thanks!
from digits.
After I changed the batch size but I also met the same issue. And then I changed the mode from GPU to CPU, it did work. The problem of my issue is my computer can't store the parameter of my net.
from digits.
Thanks! Will first try tiling the data set to smaller fovs.
from digits.
Hi, @nullterminated. The standard models were designed to fit on GPUs with 4GB of memory or more - that is why you're running out of memory. If you decrease the batch size, you should be able to run just about any network you want with 3GB of GPU memory. It will just take a bit longer.
from digits.
I've added a feature that should help avoid these issues in the future. I'm going to go ahead and close this issue in faith that changing the batch size will fix your problem for you for the present. Please re-open it if you still have an issue here.
from digits.
Same problem...
Set the batch size to 1 manually but it didn't work.
Are there any other solutions?
Hardware
Tesla K80 (#0)
Memory
7.56 GB / 11.2 GB (67.2%)
GPU Utilization
98%
Temperature
41 °C
Process #159052
CPU Utilization
113.0%
Memory
1.4 GB (0.6%)
from the output.log:
I1110 10:41:47.261068 158591 net.cpp:761] Ignoring source layer upscore_21classes
I1110 10:41:47.261715 158591 caffe.cpp:251] Starting Optimization
I1110 10:41:47.261730 158591 solver.cpp:279] Solving
I1110 10:41:47.261734 158591 solver.cpp:280] Learning Rate Policy: step
I1110 10:41:47.263772 158591 solver.cpp:337] Iteration 0, Testing net (#0)
I1110 10:41:55.251052 158591 solver.cpp:404] Test net output #0: accuracy = 0
I1110 10:41:55.251092 158591 solver.cpp:404] Test net output #1: loss = 3.04452 (* 1 = 3.04452 loss)
I1110 10:41:55.762611 158591 solver.cpp:228] Iteration 0, loss = 3.04452
I1110 10:41:55.762641 158591 solver.cpp:244] Train net output #0: loss = 3.04452 (* 1 = 3.04452 loss)
I1110 10:41:55.762686 158591 sgd_solver.cpp:106] Iteration 0, lr = 0.0001
I1110 10:42:03.353044 158591 solver.cpp:228] Iteration 4, loss = 2.76947
I1110 10:42:03.353076 158591 solver.cpp:244] Train net output #0: loss = 2.76947 (* 1 = 2.76947 loss)
I1110 10:42:03.353085 158591 sgd_solver.cpp:106] Iteration 4, lr = 0.0001
I1110 10:42:08.067147 158591 solver.cpp:228] Iteration 8, loss = 2.11253
I1110 10:42:08.067178 158591 solver.cpp:244] Train net output #0: loss = 2.11253 (* 1 = 2.11253 loss)
I1110 10:42:08.067185 158591 sgd_solver.cpp:106] Iteration 8, lr = 0.0001
I1110 10:42:13.411054 158591 solver.cpp:228] Iteration 12, loss = 1.45452
I1110 10:42:13.411083 158591 solver.cpp:244] Train net output #0: loss = 1.45452 (* 1 = 1.45452 loss)
I1110 10:42:13.411092 158591 sgd_solver.cpp:106] Iteration 12, lr = 0.0001
F1110 10:42:19.409853 158591 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x7ffff1b519fd google::LogMessage::Fail()
@ 0x7ffff1b537cc google::LogMessage::SendToLog()
@ 0x7ffff1b515ec google::LogMessage::Flush()
@ 0x7ffff1b540de google::LogMessageFatal::~LogMessageFatal()
@ 0x7ffff742b821 caffe::SyncedMemory::to_gpu()
@ 0x7ffff742ab89 caffe::SyncedMemory::mutable_gpu_data()
@ 0x7ffff72a6642 caffe::Blob<>::mutable_gpu_data()
@ 0x7ffff7409926 caffe::BaseConvolutionLayer<>::backward_gpu_gemm()
@ 0x7ffff745c27b caffe::DeconvolutionLayer<>::Forward_gpu()
@ 0x7ffff73070f5 caffe::Net<>::ForwardFromTo()
@ 0x7ffff7307467 caffe::Net<>::Forward()
@ 0x7ffff741f737 caffe::Solver<>::Step()
@ 0x7ffff741fff9 caffe::Solver<>::Solve()
@ 0x40a47b train()
@ 0x40752c main
@ 0x7fffe9ec1b15 __libc_start_main
@ 0x407d9d (unknown)
from digits.
What is your input image size?
from digits.
The biggest one is 20.5mb (3008x3952).
I'm using variable sizes.
from digits.
Those are pretty large images and fully convolutional networks are quite memory hungry. Note that if you are using images of variable sizes, you need to set the batch size to 1 anyway (this is already set in FCN-Alexnet from the semantic segmentation example).
You could try resizing your images to a lower size, if that does not destroy too much information. That is the first thing I would try.
Another solution is to take random crops from the bigger images. You need labels to be cropped in the same way though so the usual crop parameter in Caffe's data layer cannot be used. A Python layer would be suitable to perform the cropping in the context of a quick experiment.
Another solution would be to increase the stride of the first convolutional layer to reduce the size of its output feature map. However you then need to make corresponding changes in the deconvolutional layer of the network and you need to calculate the new offset to apply in the final Crop
layer, which isn't difficult but is a bit tedious.
from digits.
smaller tiles solved the issue.
from digits.
Note that during inference you should be able to use the original image size (as demonstrated on the binary segmentation example) - up to a limit of course: inference needs about a third of the GPU memory required for training.
from digits.
Thanks! Will keep that in mind.
from digits.
I want to train the GoogleNet model with 1024 x 1024 but it is out of memory. Then I resize images to 800 x 800 and the batch size is 10 and then it works but the accuracy is only about 80%. Then I resize images to 680 x 680 and the batch size is 20 and then it still works but the accuracy can go to 90%. It seems the batch size influences the accuracy. Is it right?
from digits.
Related Issues (20)
- How to upload pretrained model with digits:20.03-tensorflow-py3
- Isn't the dataset parallelized?
- Convert YOLO Labels to KITTI Labels
- AWS DIGITS: ERROR: ValueError: invalid literal for int() with base 10
- Failed to build Caffe on Xavier NX
- KITTI Data Trains in TF but not in Digits
- DIGITS OBJECT DETECT DATASET TRAINING ERROR HOT 1
- map validation
- How to change digits default gpu number '0'?
- DIGITS 6 docker - Can't train MNIST demo
- Nvidia A100 & Caffe Container HOT 1
- where to download Digit 21.2 ?
- Building Caffe with CUDA
- Detectnet, Multiple class object detection with an imbalanced dataset, Poor results (mAP) for one class
- Module Creation erros
- cannot see detectnet bounding boxes using Caffe model on Nano
- Is there a way to backup the entire database? HOT 1
- I'm confused between which version of DIGITS to install
- Inference HOT 7
- DIGITS DOCKET CONTAINER INSTALLING SUNNY PLUGIN HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from digits.