Code Monkey home page Code Monkey logo

Comments (9)

ChrisCummins avatar ChrisCummins commented on July 17, 2024

Notes on diagnosing the issue

I added //third_party/py/memory_profiler package and a --run_with_memory_profilier flag to help with diagnosis. Test run:

# Build the binary:
$ bazel build //deeplearning/ml4pl/models/ggnn
# Run the script:
$ mprof run bazel-bin/deeplearning/ml4pl/models/ggnn/ggnn \
    --graph_db="$DB?programl_reachability" \
    --log_db='sqlite:////tmp/deleteme.db' \
    --run_with_memory_profiler \
    --epoch_count=5 \
    --test_on=none \
    --max_node_count_limit_handler=skip \
    --max_train_per_epoch=1000 \
    --max_val_per_epoch=100 \
    --graph_batch_node_count=15000 \
    --detailed_batch_types=train,val,test \
    --vmodule='*'=5
# Plot the profile data:
$ mrpof plot

from programl.

ChrisCummins avatar ChrisCummins commented on July 17, 2024

Initial results: we have a definite leak!

Figure_1

What's interesting is that there is a significant memory bump in the training epoch, but not in the validation epoch:

ilename: /home/cec/phd/bazel-bin/deeplearning/ml4pl/models/ggnn/ggnn.runfiles/phd/deeplearning/ml4pl/models/run.py| 0/5 [01:02<?, ? epoch/s]

Line #    Mem usage    Increment   Line Contents
================================================
   199   2294.8 MiB   2294.8 MiB     @memory_profiler.profile
   200                               def RunOneEpoch(self, test_on: str, save_on):
   201                                 # Create the batch iterators ahead of time so that they can asynchronously
   202                                 # start reading from the graph database.
   203   2426.5 MiB      0.0 MiB       batch_iterators = {
   204                                   epoch_type: batch_iterator_lib.MakeBatchIterator(
   205                                     model=self.model,
   206                                     graph_db=self.graph_db,
   207                                     splits=self.splits,
   208                                     epoch_type=epoch_type,
   209                                     ctx=self.ctx,
   210                                   )
   211   2426.5 MiB     70.1 MiB         for epoch_type in [epoch.Type.TRAIN, epoch.Type.VAL, epoch.Type.TEST]
   212                                 }
   213
   214   2580.4 MiB    153.9 MiB       train_results, _ = self.RunEpoch(epoch.Type.TRAIN, batch_iterators)
   215
   216   2580.6 MiB      0.2 MiB       val_results, val_improved = self.RunEpoch(epoch.Type.VAL, batch_iterators)
   217
   218   2580.6 MiB      0.0 MiB       if val_improved and (
   219   2580.6 MiB      0.0 MiB         test_on == "improvement" or test_on == "improvement_and_last"
   220                                 ):
   221                                   self.RunEpoch(epoch.Type.TEST, batch_iterators)
   222   2580.6 MiB      0.0 MiB       elif test_on == "improvement_and_last" and self.ctx.i == self.ctx.n - 1:
   223                                   self.RunEpoch(epoch.Type.TEST, batch_iterators)
   224
   225                                 # Determine whether to make a checkpoint.
   226   2580.6 MiB      0.0 MiB       if save_on == schedules.SaveOn.EVERY_EPOCH or (
   227                                   save_on == schedules.SaveOn.VAL_IMPROVED and val_improved
   228                                 ):
   229   2580.6 MiB      0.0 MiB         self.model.SaveCheckpoint()
   230
   231   2580.6 MiB      0.0 MiB       if test_on == "every":
   232                                   self.RunEpoch(epoch.Type.TEST, batch_iterators)

It would be worth making the train/val epochs the same size to see if this still holds.

Edit: Yes, this appears to hold.

from programl.

ChrisCummins avatar ChrisCummins commented on July 17, 2024

Next step is reproduce using the Zero-R to see if the issue is model-specific:

$ bazel build //deeplearning/ml4pl/models/zero_r
# Run the script:
$ mprof run bazel-bin/deeplearning/ml4pl/models/zero_r/zero_r \
    --graph_db="$DB?programl_reachability" \
    --log_db='sqlite:////tmp/deleteme.db' \
    --run_with_memory_profiler \
    --epoch_count=5 \
    --test_on=none \
    --max_train_per_epoch=1000 \
    --max_val_per_epoch=1000 \
    --graph_reader_order=in_order \
    --vmodule='*'=5
# Plot the profile data:
$ mrpof plot

Figure_1

Findings: the issue is reproducible on the Zero-R, so is not specific to the GNN.

from programl.

ChrisCummins avatar ChrisCummins commented on July 17, 2024

Running with verbose logging revealed something interesting - even after the script has 'completed', the batch iterators are still reading graphs from the database and producing batches.

I added some extra logging and found an error in my logic - we were instantiating iterators for all epoch types (train, val, test), but were only using the test iterator under certain conditions (e.g. on improvement). So there with the default queue size of 10 and 64 MB reader chunks, we were leaking > 640 MB of data for every unused test iterator. I fixed this, and now the memory usage of the Zero-R looks more stable:

Figure_1

from programl.

ChrisCummins avatar ChrisCummins commented on July 17, 2024

Worth noting that this Zero-R run took 50 minutes to process 50k graphs - this is unnecessarily slow. There seems to be some scaling issues when loading from large datasets, likely in the graph_database_reader that may be worth investigating.

from programl.

ChrisCummins avatar ChrisCummins commented on July 17, 2024

A long-running LSTM job died after running out of memory:

$ mprof run bazel-bin/deeplearning/ml4pl/models/lstm/lstm \
    --graph_db="$DB?programl_reachability" \
    --proto_db="$DB?programl_graph_protos" \
    --log_db="$DB?programl_dataflow_logs" \
    --epoch_count=100 \
    --test_on=none \
    --batch_size=64 \
    --max_train_per_epoch=10000 \
    --max_val_per_epoch=2000 \
    --padded_sequence_length=5000 \
    --detailed_batch_types=val,test \
    --vmodule='*'=5 \
    --batch_scores_averaging_method=binary

# <snip...>
NodeLstm:191216223015:cc1 train [ 38 / 100] accuracy=82.92%, precision=0.547, recall=0.307, loss=0.243806 in 1m 37s 705ms

# <snip...>
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Figure_1

I'm not sure if the exception coming from C++ is relevant, or just the "straw that broke the camel's back".

I reworked the graph2seq caching mechanism and will re-run the job to see if the problem persists.

from programl.

Zacharias030 avatar Zacharias030 commented on July 17, 2024

resolved?

from programl.

ChrisCummins avatar ChrisCummins commented on July 17, 2024

This doesn't appear to affect the GGNN, but the LSTM still needs ironing out. Not going to close this just yet.

from programl.

ChrisCummins avatar ChrisCummins commented on July 17, 2024

I'm willing to consider this resolved - the last of the LSTM issues seems to have been addressed by limiting the size of in-memory string caches during graph encoding.

from programl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.