Code Monkey home page Code Monkey logo

Comments (9)

adeshpande3 avatar adeshpande3 commented on September 13, 2024

Is it giving you this error immediately when you start the training loop, or is it maybe after a particular number of iterations?

from facebook-messenger-bot.

Hunterwolf88 avatar Hunterwolf88 commented on September 13, 2024

The first time, after I run
$ python Seq2Seq.py
I got

/home/hunterwolf/anaconda3/envs/tensorflow27/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Since we cant find an embedding matrix, how many dimensions do you want your word vectors to be?:

After prompting the number (I tried 256, 512, 1000 with no real difference) I got this error after 30-40 minutes. The files Seq2SeqXTrain.py and Seq2SeqYTrain.npy are created but nothing else.

If I try again the error occurs after 5-10 seconds, if I delete them and try again the error occurs after about half an hour again. During this time I have no output, the terminal looks frozen, but I can use the computer normally.

from facebook-messenger-bot.

adeshpande3 avatar adeshpande3 commented on September 13, 2024

If the error occurs after 5-10 seconds when you run again, I'm assuming that most of the 30-40 minutes will be spent in the createTrainingMatrices function. Which means that the out of memory error is probably happening at the very beginning of the training loop (correct me if I'm wrong). Therefore it seems like there is something wrong with the amount of memory the initial model is taking up. Maybe you could post information about the size of Seq2SeqXTrain, Seq2SeqYTrain, batchSize, numLSTMUnits, etc and then we can try to see if there's some value that is off.

from facebook-messenger-bot.

Hunterwolf88 avatar Hunterwolf88 commented on September 13, 2024

Hello, thank you so much, I'm still trying to understand what is happening!

I tried different batch sizes (even 10, 5 ,2) but nothing changes.

I noticed from the nvidia-smi that the GPU memory (4 GB total) is almost full, but everything else is 0%, like:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     Off  | 00000000:01:00.0  On |                  N/A |
|  0%   45C    P0    48W / 196W |   3670MiB /  4040MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I can also see a computational process appears in the list below while training.

Seq2SeqXTrain.npy and Seq2SeqYTrain.npy are both 4.1 MB.

thanks for your help

from facebook-messenger-bot.

adeshpande3 avatar adeshpande3 commented on September 13, 2024

What I think is happening is that there is something in the model initialization that is too large. Basically something in the below code.

image

I don't believe anything in the training procedure (like batch size) is wrong because the program fails almost immediately after starting the training loop. If it's running out of memory at the very beginning, that means that something in the initial model is going wrong.

I'd recommend starting with a very simple neural network architecture (without the call to tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq), see if there are any problems in training, and then slowly work your way back to the original code.

from facebook-messenger-bot.

Hunterwolf88 avatar Hunterwolf88 commented on September 13, 2024

Thank you, it worked.
Nothing to do with the batch size.

I also replaced my GTX 980 with a GTX 1080ti and now looks like I can run Seq2Seq also with the default settings.

from facebook-messenger-bot.

adeshpande3 avatar adeshpande3 commented on September 13, 2024

Hmm, so was the fix somewhere in the code, or was the fix changing the GPU?

from facebook-messenger-bot.

Hunterwolf88 avatar Hunterwolf88 commented on September 13, 2024

The problem was (for the GTX 980) an actual OOM, although I initially thought it was a computational issue due to some misconfiguration in my Python environment or in the Training script, seeing the GPU unused during all the process.
This allowed me to identify the problem for my low-memory GPU, as you stated, in the network architecture itself and I was able to solve my problem with similar situations (so I think it can work in this case too), but I'm still too noob to find and suggest a valuable working fix (sorry).
If it can help you I'll try to investigate further, for example, at this point, I'd like to know if the Seq2SeqXTrain and Seq2SeqYTrain size has to do with the OOM in this scenario.

from facebook-messenger-bot.

adeshpande3 avatar adeshpande3 commented on September 13, 2024

Yeah, my initial thinking is that the size for those two matrices probably doesn't affect it, but it's more dependent on the size and computations in your initial network. Anyway, thanks, and glad your problem got solved :)

from facebook-messenger-bot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.