def run_epoch_generator(self, sess, model, data_generator, return_output=False, traini

Memory leak about dcrnn HOT 6 OPEN

liyaguang commented on July 26, 2024 1

Memory leak

from dcrnn.

Comments (6)

tanwimallick commented on July 26, 2024 3

It is better to define loss node in the graph in class DCRNNModel initialization. Then inside run_epoch_generator model.loss and model.mae can be used.

For a quick fix, I initialized the training and testing loss separately during the initialization of DCRNNSupervisor.

preds = self._train_model.outputs
labels = self._train_model.labels[..., :output_dim]

self.preds_test = self._test_model.outputs
self.labels_test = self._test_model.labels[..., :output_dim]

self._train_loss = self._loss_fn(preds=preds, labels=labels)
self._test_loss = self._loss_fn(preds=self.preds_test, labels=self.labels_test)

Inside run_epoch_generator:

if training:
             fetches = {
                 'loss': self._train_loss,
                 'mae': self._train_loss,
                 'global_step': tf.train.get_or_create_global_step()
             }
else:
            fetches = {
                 'loss': self._test_loss,
                 'mae': self._test_loss,
                 'global_step': tf.train.get_or_create_global_step()
            }

In the paper, how did you plot the learned localized filters centered at different nodes (Figure 7 in the paper)? Is that code available?

from dcrnn.

ivechan commented on July 26, 2024 1

Is there any solution or suggestion? :)

from dcrnn.

ivechan commented on July 26, 2024 1

It seems that the following codes will add nodes into computation graph per epoch.
Every epoch we create new nodes in graph so that the graph will be larger and larger.

labels = model.labels[..., :output_dim]
loss = self._loss_fn(preds=preds, labels=labels)

A possible solution is that creating loss node in graph in class DCRNNModel initialization instead of
in function run_epoch_generator.

from dcrnn.

liyaguang commented on July 26, 2024

Thanks for your kind information. I will investigate this issue. Besides, it is appreciated if you can provide more information, e.g., the error message, log, parameters, etc.

from dcrnn.

tanwimallick commented on July 26, 2024

The error massage is:
2019-06-06 20:04:31.386792: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 43.75MiB. Current allocation summary follows.
2019-06-06 20:04:31.386936: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256): Total Chunks: 664, Chunks in use: 664. 166.0KiB allocated for chunks. 166.0KiB in use in bin. 8.9KiB client-requested in use in bin.

2019-06-06 20:04:31.396827: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at matmul_op.cc:478 : Resource exhausted: OOM when allocating tensor with shape[44800,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

I was trying to plot the memory consumption after each epoch. I got the following plot

The hyperparameter configuration was:

batch_size: 256, cl_decay_steps: 2000, filter_type: 'laplacian', horizon': 12, input_dim: 2, l1_decay': 0, max_diffusion_step: 1, num_nodes: 175, num_rnn_layers: 2, output_dim: 1, rnn_units: 64, seq_len: 12,
use_curriculum_learning: True, base_lr: 0.01, epochs: 62, epsilon: 0.001, global_step: 0, lr_decay_ratio: 0.05, max_grad_norm: 9, max_to_keep: 100, min_learning_rate: 2e-06, optimizer': adagrad, patience: 50, steps: [20, 30, 40, 50], test_every_n_epochs: 10

I got the error after 30 epochs.

from dcrnn.

parkitny commented on July 26, 2024

Any further updates on when this fix will be added?

from dcrnn.

Memory leak about dcrnn HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent