Comments (9)
Notes on diagnosing the issue
I added //third_party/py/memory_profiler
package and a --run_with_memory_profilier
flag to help with diagnosis. Test run:
# Build the binary:
$ bazel build //deeplearning/ml4pl/models/ggnn
# Run the script:
$ mprof run bazel-bin/deeplearning/ml4pl/models/ggnn/ggnn \
--graph_db="$DB?programl_reachability" \
--log_db='sqlite:////tmp/deleteme.db' \
--run_with_memory_profiler \
--epoch_count=5 \
--test_on=none \
--max_node_count_limit_handler=skip \
--max_train_per_epoch=1000 \
--max_val_per_epoch=100 \
--graph_batch_node_count=15000 \
--detailed_batch_types=train,val,test \
--vmodule='*'=5
# Plot the profile data:
$ mrpof plot
from programl.
Initial results: we have a definite leak!
What's interesting is that there is a significant memory bump in the training epoch, but not in the validation epoch:
ilename: /home/cec/phd/bazel-bin/deeplearning/ml4pl/models/ggnn/ggnn.runfiles/phd/deeplearning/ml4pl/models/run.py| 0/5 [01:02<?, ? epoch/s]
Line # Mem usage Increment Line Contents
================================================
199 2294.8 MiB 2294.8 MiB @memory_profiler.profile
200 def RunOneEpoch(self, test_on: str, save_on):
201 # Create the batch iterators ahead of time so that they can asynchronously
202 # start reading from the graph database.
203 2426.5 MiB 0.0 MiB batch_iterators = {
204 epoch_type: batch_iterator_lib.MakeBatchIterator(
205 model=self.model,
206 graph_db=self.graph_db,
207 splits=self.splits,
208 epoch_type=epoch_type,
209 ctx=self.ctx,
210 )
211 2426.5 MiB 70.1 MiB for epoch_type in [epoch.Type.TRAIN, epoch.Type.VAL, epoch.Type.TEST]
212 }
213
214 2580.4 MiB 153.9 MiB train_results, _ = self.RunEpoch(epoch.Type.TRAIN, batch_iterators)
215
216 2580.6 MiB 0.2 MiB val_results, val_improved = self.RunEpoch(epoch.Type.VAL, batch_iterators)
217
218 2580.6 MiB 0.0 MiB if val_improved and (
219 2580.6 MiB 0.0 MiB test_on == "improvement" or test_on == "improvement_and_last"
220 ):
221 self.RunEpoch(epoch.Type.TEST, batch_iterators)
222 2580.6 MiB 0.0 MiB elif test_on == "improvement_and_last" and self.ctx.i == self.ctx.n - 1:
223 self.RunEpoch(epoch.Type.TEST, batch_iterators)
224
225 # Determine whether to make a checkpoint.
226 2580.6 MiB 0.0 MiB if save_on == schedules.SaveOn.EVERY_EPOCH or (
227 save_on == schedules.SaveOn.VAL_IMPROVED and val_improved
228 ):
229 2580.6 MiB 0.0 MiB self.model.SaveCheckpoint()
230
231 2580.6 MiB 0.0 MiB if test_on == "every":
232 self.RunEpoch(epoch.Type.TEST, batch_iterators)
It would be worth making the train/val epochs the same size to see if this still holds.
Edit: Yes, this appears to hold.
from programl.
Next step is reproduce using the Zero-R to see if the issue is model-specific:
$ bazel build //deeplearning/ml4pl/models/zero_r
# Run the script:
$ mprof run bazel-bin/deeplearning/ml4pl/models/zero_r/zero_r \
--graph_db="$DB?programl_reachability" \
--log_db='sqlite:////tmp/deleteme.db' \
--run_with_memory_profiler \
--epoch_count=5 \
--test_on=none \
--max_train_per_epoch=1000 \
--max_val_per_epoch=1000 \
--graph_reader_order=in_order \
--vmodule='*'=5
# Plot the profile data:
$ mrpof plot
Findings: the issue is reproducible on the Zero-R, so is not specific to the GNN.
from programl.
Running with verbose logging revealed something interesting - even after the script has 'completed', the batch iterators are still reading graphs from the database and producing batches.
I added some extra logging and found an error in my logic - we were instantiating iterators for all epoch types (train, val, test), but were only using the test iterator under certain conditions (e.g. on improvement). So there with the default queue size of 10 and 64 MB reader chunks, we were leaking > 640 MB of data for every unused test iterator. I fixed this, and now the memory usage of the Zero-R looks more stable:
from programl.
Worth noting that this Zero-R run took 50 minutes to process 50k graphs - this is unnecessarily slow. There seems to be some scaling issues when loading from large datasets, likely in the graph_database_reader that may be worth investigating.
from programl.
A long-running LSTM job died after running out of memory:
$ mprof run bazel-bin/deeplearning/ml4pl/models/lstm/lstm \
--graph_db="$DB?programl_reachability" \
--proto_db="$DB?programl_graph_protos" \
--log_db="$DB?programl_dataflow_logs" \
--epoch_count=100 \
--test_on=none \
--batch_size=64 \
--max_train_per_epoch=10000 \
--max_val_per_epoch=2000 \
--padded_sequence_length=5000 \
--detailed_batch_types=val,test \
--vmodule='*'=5 \
--batch_scores_averaging_method=binary
# <snip...>
NodeLstm:191216223015:cc1 train [ 38 / 100] accuracy=82.92%, precision=0.547, recall=0.307, loss=0.243806 in 1m 37s 705ms
# <snip...>
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
I'm not sure if the exception coming from C++ is relevant, or just the "straw that broke the camel's back".
I reworked the graph2seq caching mechanism and will re-run the job to see if the problem persists.
from programl.
resolved?
from programl.
This doesn't appear to affect the GGNN, but the LSTM still needs ironing out. Not going to close this just yet.
from programl.
I'm willing to consider this resolved - the last of the LSTM issues seems to have been addressed by limiting the size of in-memory string caches during graph encoding.
from programl.
Related Issues (20)
- Relax grpcio version requirement
- Loss Function of ProGraML HOT 1
- Not able to use programl HOT 12
- Support for LLVM 14.0.0 HOT 4
- End-to-End Debug with LLVM, ProGraML & gdb HOT 1
- Features drop when using to_networkx() and resulting networkx graph is incompatible with torch_geometric from networkx HOT 2
- Demo for pretrained checkpoint HOT 1
- Graph visualization from my self-define .json/.gexf file HOT 5
- GraphCreationError when using libstdc++.so.6 without GLIBC_2.27, GLIBCXX_3.4.20, and GLIBCXX_3.4.21 HOT 1
- Reproduce ICML '21 experiments HOT 3
- Metadata difference between original LLVM file and produced Graph HOT 5
- Generating ProgramL graphs from partial code HOT 4
- How does ProGraML capture Loop optimization (loop unroll) information during graph generation? HOT 1
- Error due to DGLHeteroGraph imports HOT 2
- Data not present in pydot graph. HOT 1
- Missing GNN models HOT 1
- pip install HOT 2
- Add list of supported operating systems
- ProGraML package breaks with recent DGL api changes HOT 3
- How to train a GNN model? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from programl.