Hello there! I completed succesfully the awesome tutorial except the training (Seq

The first time, after I run $ python Seq2Seq.py <b

out of memory with GTX980 about facebook-messenger-bot HOT 9 CLOSED

Hunterwolf88 commented on September 13, 2024

out of memory with GTX980

from facebook-messenger-bot.

Comments (9)

adeshpande3 commented on September 13, 2024

Is it giving you this error immediately when you start the training loop, or is it maybe after a particular number of iterations?

from facebook-messenger-bot.

Hunterwolf88 commented on September 13, 2024

The first time, after I run
$ python Seq2Seq.py
I got

/home/hunterwolf/anaconda3/envs/tensorflow27/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Since we cant find an embedding matrix, how many dimensions do you want your word vectors to be?:

After prompting the number (I tried 256, 512, 1000 with no real difference) I got this error after 30-40 minutes. The files Seq2SeqXTrain.py and Seq2SeqYTrain.npy are created but nothing else.

If I try again the error occurs after 5-10 seconds, if I delete them and try again the error occurs after about half an hour again. During this time I have no output, the terminal looks frozen, but I can use the computer normally.

from facebook-messenger-bot.

adeshpande3 commented on September 13, 2024

If the error occurs after 5-10 seconds when you run again, I'm assuming that most of the 30-40 minutes will be spent in the createTrainingMatrices function. Which means that the out of memory error is probably happening at the very beginning of the training loop (correct me if I'm wrong). Therefore it seems like there is something wrong with the amount of memory the initial model is taking up. Maybe you could post information about the size of Seq2SeqXTrain, Seq2SeqYTrain, batchSize, numLSTMUnits, etc and then we can try to see if there's some value that is off.

from facebook-messenger-bot.

Hunterwolf88 commented on September 13, 2024

Hello, thank you so much, I'm still trying to understand what is happening!

I tried different batch sizes (even 10, 5 ,2) but nothing changes.

I noticed from the nvidia-smi that the GPU memory (4 GB total) is almost full, but everything else is 0%, like:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     Off  | 00000000:01:00.0  On |                  N/A |
|  0%   45C    P0    48W / 196W |   3670MiB /  4040MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I can also see a computational process appears in the list below while training.

Seq2SeqXTrain.npy and Seq2SeqYTrain.npy are both 4.1 MB.

thanks for your help

from facebook-messenger-bot.

adeshpande3 commented on September 13, 2024

What I think is happening is that there is something in the model initialization that is too large. Basically something in the below code.

I don't believe anything in the training procedure (like batch size) is wrong because the program fails almost immediately after starting the training loop. If it's running out of memory at the very beginning, that means that something in the initial model is going wrong.

I'd recommend starting with a very simple neural network architecture (without the call to tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq), see if there are any problems in training, and then slowly work your way back to the original code.

from facebook-messenger-bot.

Hunterwolf88 commented on September 13, 2024

Thank you, it worked.
Nothing to do with the batch size.

I also replaced my GTX 980 with a GTX 1080ti and now looks like I can run Seq2Seq also with the default settings.

from facebook-messenger-bot.

adeshpande3 commented on September 13, 2024

Hmm, so was the fix somewhere in the code, or was the fix changing the GPU?

from facebook-messenger-bot.

Hunterwolf88 commented on September 13, 2024

The problem was (for the GTX 980) an actual OOM, although I initially thought it was a computational issue due to some misconfiguration in my Python environment or in the Training script, seeing the GPU unused during all the process.
This allowed me to identify the problem for my low-memory GPU, as you stated, in the network architecture itself and I was able to solve my problem with similar situations (so I think it can work in this case too), but I'm still too noob to find and suggest a valuable working fix (sorry).
If it can help you I'll try to investigate further, for example, at this point, I'd like to know if the Seq2SeqXTrain and Seq2SeqYTrain size has to do with the OOM in this scenario.

from facebook-messenger-bot.

adeshpande3 commented on September 13, 2024

Yeah, my initial thinking is that the size for those two matrices probably doesn't affect it, but it's more dependent on the size and computations in your initial network. Anyway, thanks, and glad your problem got solved :)

from facebook-messenger-bot.

out of memory with GTX980 about facebook-messenger-bot HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent