Hi, I'm trying to train pix2pix on a particular image labeling task.

I have 1.4GB free as I can limit the memory that use the gpu? <div class="snippet-

Some ideas are here: <a class="issue-link js-issue-link" data-error-text="Failed to lo

luajit out of memory during training about pix2pix HOT 16 CLOSED

rdaniel commented on May 22, 2024

luajit out of memory during training

from pix2pix.

Comments (16)

phillipi commented on May 22, 2024

I would check to make sure the gpu is really not being used at all. You can use nvidia-smi to check how much gpu memory is being used. I haven't tested CPU mode carefully so there's a chance some memory is still allocated on the gpu.

If that's not the issue, then it's a bit surprising you would run out of memory. Can you profile how much memory your system is using for the process?

General strategies for reducing memory usage would be to reduce the batch size (try setting it to 1), and reduce the image size (will require modifying the net architectures, e.g., by removing the first and last layers of netG and the first layer of netD).

from pix2pix.

rdaniel commented on May 22, 2024

I turned off the display and the code ran fine. On the CPU it is pretty slow, of course, so I'm going to run it on an AWS GPU instance now.

from pix2pix.

b4zz4 commented on May 22, 2024

I have 1.4GB free as I can limit the memory that use the gpu?

...
Epoch: [1][     399 /      400]	 Time: 3.719  DataTime: 0.001    Err_G: 3.4267  Err_D: 0.0299  ErrL1: 0.3935	
End of epoch 1 / 200 	 Time Taken: 938.975	
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-4014/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
...

from pix2pix.

khaulahzia commented on May 22, 2024

How to turn off the display?

from pix2pix.

khaulahzia commented on May 22, 2024

How to turn off the display?

from pix2pix.

phillipi commented on May 22, 2024

On the command line, you can pass display=0

from pix2pix.

khaulahzia commented on May 22, 2024

i am facing the following error when running training command:
transferring to gpu...
done
THCudaCheck FAIL file=/home/admink/torch/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/admink/torch/install/bin/luajit: /home/admink/torch/install/share/lua/5.1/nn/Module.lua:309: cuda runtime error (2) : out of memory at /home/admink/torch/extra/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
[C]: in function 'Tensor'
/home/admink/torch/install/share/lua/5.1/nn/Module.lua:309: in function 'flatten'
/home/admink/torch/install/share/lua/5.1/nn/Module.lua:326: in function 'getParameters'
train.lua:445: in main chunk
[C]: in function 'dofile'
...mink/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
How could this issue be resolved?

from pix2pix.

khaulahzia commented on May 22, 2024

because i still get the same error again and again even if i switched off the display.
i could train fine with just 3 images of size 600600 and maximum of 10 images of size 128128.
More than that i fail to train the model with the error i mentioned above. How could i train the model on more images at least 500?

from pix2pix.

phillipi commented on May 22, 2024

Some ideas are here: #107

Other quick fixes:

Run in CPU mode
Run on a cloud service that has more GPU memory, e.g., Amazon EC2
Make models smaller (e.g., set ngf and ndf to be 32 rather than 64)

Really it should be possible to avoid running out of memory as you scale the dataset size. The memory should be constant for any sized dataset as images are only loaded on demand. There may be a leak or some inefficiency that causes more memory to be used for larger datasets. I don't have a quick fix, but I would look at the behavior at the end of an epoch, where temporary variables are cleared and reallocated, since this is where you are getting the memory error.

from pix2pix.

khaulahzia commented on May 22, 2024

Running in CPU mode will in any way affect the quality of training?

…

On 4 Oct 2017 12:31 a.m., "Phillip Isola" ***@***.***> wrote: Some ideas are here: #107 <#107> Other quick fixes: 1. Run in CPU mode 2. Run on a cloud service that has more GPU memory, e.g., Amazon EC2 3. Make models smaller (e.g., set ngf and ndf to be 32 rather than 64) Really it should be possible to avoid running out of memory as you scale the dataset size. The memory should be constant for any sized dataset as images are only loaded on demand. There may be a leak or some inefficiency that causes more memory to be used for larger datasets. I don't have a quick fix, but I would look at the behavior at the end of an epoch, where temporary variables are cleared and reallocated, since this is where you are getting the memory error. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Ac5J0VyyGVDTJxyCRjgvuzYFylO3jv25ks5soouAgaJpZM4L57m7> .

from pix2pix.

phillipi commented on May 22, 2024

It shouldn't.

from pix2pix.

khaulahzia commented on May 22, 2024

So what would be the difference of running on GPU or CPU?

…

On 4 Oct 2017 1:15 a.m., "Phillip Isola" ***@***.***> wrote: It shouldn't. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Ac5J0XiBwu9BXwmAUnOu9_YQ4caEQTwYks5sopX3gaJpZM4L57m7> .

from pix2pix.

phillipi commented on May 22, 2024

The GPU is much faster :) Functionality should be the same, although I haven't thoroughly tested CPU mode myself (seems like others have had it work).

from pix2pix.

khaulahzia commented on May 22, 2024

So you are suggesting me the slower way for the sake of memory 😏

…

On 4 Oct 2017 1:39 a.m., "Phillip Isola" ***@***.***> wrote: The GPU is much faster :) Functionality should be the same, although I haven't thoroughly tested CPU mode myself (seems like others have had it work). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Ac5J0ReQPkgoVqzPG4HD32iYfp4ajrGeks5sopuUgaJpZM4L57m7> .

from pix2pix.

phillipi commented on May 22, 2024

haha yeah it's an unfortunate tradeoff

from pix2pix.

khaulahzia commented on May 22, 2024

DATA_ROOT=./datasets/alphabet1 name=blob_placement2 which_direction=AtoB display=0 ngf=16 ndf=16 gpu=0 cudnn=0 batchsize=10 save_epoch_freq=5 th train.lua worked for me. Thanks.

…

On Tue, Oct 3, 2017 at 10:25 PM, Phillip Isola ***@***.***> wrote: haha yeah it's an unfortunate tradeoff — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Ac5J0drVSTaooA9E1ZfPyuG4f6DCHstcks5sorRhgaJpZM4L57m7> .

from pix2pix.

luajit out of memory during training about pix2pix HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent