I'm not sure if this is specific to the GGNN or applies to all classifiers, but memory

Notes on diagnosing the issue I added <code class="

Initial results: we have a definite leak! <a target="_blank" rel="no

A long-running LSTM job died after running out of memory: <div class="highlight hi

Memory grows over time during run script about programl HOT 9 CLOSED

chriscummins commented on July 17, 2024

Memory grows over time during run script

from programl.

Comments (9)

ChrisCummins commented on July 17, 2024

Notes on diagnosing the issue

I added //third_party/py/memory_profiler package and a --run_with_memory_profilier flag to help with diagnosis. Test run:

# Build the binary:
$ bazel build //deeplearning/ml4pl/models/ggnn
# Run the script:
$ mprof run bazel-bin/deeplearning/ml4pl/models/ggnn/ggnn \
    --graph_db="$DB?programl_reachability" \
    --log_db='sqlite:////tmp/deleteme.db' \
    --run_with_memory_profiler \
    --epoch_count=5 \
    --test_on=none \
    --max_node_count_limit_handler=skip \
    --max_train_per_epoch=1000 \
    --max_val_per_epoch=100 \
    --graph_batch_node_count=15000 \
    --detailed_batch_types=train,val,test \
    --vmodule='*'=5
# Plot the profile data:
$ mrpof plot

from programl.

ChrisCummins commented on July 17, 2024

Initial results: we have a definite leak!

What's interesting is that there is a significant memory bump in the training epoch, but not in the validation epoch:

ilename: /home/cec/phd/bazel-bin/deeplearning/ml4pl/models/ggnn/ggnn.runfiles/phd/deeplearning/ml4pl/models/run.py| 0/5 [01:02<?, ? epoch/s]

Line #    Mem usage    Increment   Line Contents
================================================
   199   2294.8 MiB   2294.8 MiB     @memory_profiler.profile
   200                               def RunOneEpoch(self, test_on: str, save_on):
   201                                 # Create the batch iterators ahead of time so that they can asynchronously
   202                                 # start reading from the graph database.
   203   2426.5 MiB      0.0 MiB       batch_iterators = {
   204                                   epoch_type: batch_iterator_lib.MakeBatchIterator(
   205                                     model=self.model,
   206                                     graph_db=self.graph_db,
   207                                     splits=self.splits,
   208                                     epoch_type=epoch_type,
   209                                     ctx=self.ctx,
   210                                   )
   211   2426.5 MiB     70.1 MiB         for epoch_type in [epoch.Type.TRAIN, epoch.Type.VAL, epoch.Type.TEST]
   212                                 }
   213
   214   2580.4 MiB    153.9 MiB       train_results, _ = self.RunEpoch(epoch.Type.TRAIN, batch_iterators)
   215
   216   2580.6 MiB      0.2 MiB       val_results, val_improved = self.RunEpoch(epoch.Type.VAL, batch_iterators)
   217
   218   2580.6 MiB      0.0 MiB       if val_improved and (
   219   2580.6 MiB      0.0 MiB         test_on == "improvement" or test_on == "improvement_and_last"
   220                                 ):
   221                                   self.RunEpoch(epoch.Type.TEST, batch_iterators)
   222   2580.6 MiB      0.0 MiB       elif test_on == "improvement_and_last" and self.ctx.i == self.ctx.n - 1:
   223                                   self.RunEpoch(epoch.Type.TEST, batch_iterators)
   224
   225                                 # Determine whether to make a checkpoint.
   226   2580.6 MiB      0.0 MiB       if save_on == schedules.SaveOn.EVERY_EPOCH or (
   227                                   save_on == schedules.SaveOn.VAL_IMPROVED and val_improved
   228                                 ):
   229   2580.6 MiB      0.0 MiB         self.model.SaveCheckpoint()
   230
   231   2580.6 MiB      0.0 MiB       if test_on == "every":
   232                                   self.RunEpoch(epoch.Type.TEST, batch_iterators)

It would be worth making the train/val epochs the same size to see if this still holds.

Edit: Yes, this appears to hold.

from programl.

ChrisCummins commented on July 17, 2024

Next step is reproduce using the Zero-R to see if the issue is model-specific:

$ bazel build //deeplearning/ml4pl/models/zero_r
# Run the script:
$ mprof run bazel-bin/deeplearning/ml4pl/models/zero_r/zero_r \
    --graph_db="$DB?programl_reachability" \
    --log_db='sqlite:////tmp/deleteme.db' \
    --run_with_memory_profiler \
    --epoch_count=5 \
    --test_on=none \
    --max_train_per_epoch=1000 \
    --max_val_per_epoch=1000 \
    --graph_reader_order=in_order \
    --vmodule='*'=5
# Plot the profile data:
$ mrpof plot

Findings: the issue is reproducible on the Zero-R, so is not specific to the GNN.

from programl.

ChrisCummins commented on July 17, 2024

Running with verbose logging revealed something interesting - even after the script has 'completed', the batch iterators are still reading graphs from the database and producing batches.

I added some extra logging and found an error in my logic - we were instantiating iterators for all epoch types (train, val, test), but were only using the test iterator under certain conditions (e.g. on improvement). So there with the default queue size of 10 and 64 MB reader chunks, we were leaking > 640 MB of data for every unused test iterator. I fixed this, and now the memory usage of the Zero-R looks more stable:

from programl.

ChrisCummins commented on July 17, 2024

Worth noting that this Zero-R run took 50 minutes to process 50k graphs - this is unnecessarily slow. There seems to be some scaling issues when loading from large datasets, likely in the graph_database_reader that may be worth investigating.

from programl.

ChrisCummins commented on July 17, 2024

A long-running LSTM job died after running out of memory:

$ mprof run bazel-bin/deeplearning/ml4pl/models/lstm/lstm \
    --graph_db="$DB?programl_reachability" \
    --proto_db="$DB?programl_graph_protos" \
    --log_db="$DB?programl_dataflow_logs" \
    --epoch_count=100 \
    --test_on=none \
    --batch_size=64 \
    --max_train_per_epoch=10000 \
    --max_val_per_epoch=2000 \
    --padded_sequence_length=5000 \
    --detailed_batch_types=val,test \
    --vmodule='*'=5 \
    --batch_scores_averaging_method=binary

# <snip...>
NodeLstm:191216223015:cc1 train [ 38 / 100] accuracy=82.92%, precision=0.547, recall=0.307, loss=0.243806 in 1m 37s 705ms

# <snip...>
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

I'm not sure if the exception coming from C++ is relevant, or just the "straw that broke the camel's back".

I reworked the graph2seq caching mechanism and will re-run the job to see if the problem persists.

from programl.

Zacharias030 commented on July 17, 2024

resolved?

from programl.

ChrisCummins commented on July 17, 2024

This doesn't appear to affect the GGNN, but the LSTM still needs ironing out. Not going to close this just yet.

from programl.

ChrisCummins commented on July 17, 2024

I'm willing to consider this resolved - the last of the LSTM issues seems to have been addressed by limiting the size of in-memory string caches during graph encoding.

from programl.

Memory grows over time during run script about programl HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent