I followed the readme: <div class="snippet-clipboard-content notranslate position-

Did you have a look <a href="https://github.com/jonathan-laurent/AlphaZero.jl/issues/1

In ./games/connect-four/params.jl I halved <code clas

Yes, I had a look at <a class="issue-link js-issue-link" data-error-text="Failed to lo

How to prevent GPU OOM? about alphazero.jl HOT 9 CLOSED

pepa65 commented on July 20, 2024

How to prevent GPU OOM?

from alphazero.jl.

Comments (9)

smart-fr commented on July 20, 2024 1

Did you have a look here?
Reducing the mem_buffer_size might work, even though it's not clear why.
If not, you could share your params.jl file.

from alphazero.jl.

jonathan-laurent commented on July 20, 2024 1

From your stack trace, AlphaZero.jl crashes during the gradient-update phase, not during self-play. So my guess is that you should also lower your batch_size in LearningParams.

from alphazero.jl.

smart-fr commented on July 20, 2024 1

I'm not sure it makes any difference in Julia or if it should incur a crash during the gradient-update phase, but in your params.jl, a coma seems to have been replaced by a point in the definition of mem_buffer_size: "[0]. [80_000]".

from alphazero.jl.

smart-fr commented on July 20, 2024 1

Delete your sessions/connect-four folder and restart.

from alphazero.jl.

pepa65 commented on July 20, 2024 1

I missed batch_size in LearningParams, quartered that. Now running with the 2 batch_sizes quartered, and with mem_buffer_size=PLSchedule([0], [80_000])). It keeps running now..! I decided also to delete sessions/connect-four, and see what happens. Thanks for all of your help so far!

Update: It's still going, now on iteration 4; won 32% on iteration 3).
Update: It finished after a few days, and is playable!

from alphazero.jl.

jonathan-laurent commented on July 20, 2024

Yes, you probably want smaller networks (e.g. less filters, less layers) and smaller batch sizes.

from alphazero.jl.

pepa65 commented on July 20, 2024

In ./games/connect-four/params.jl I halved NetLib.ResNetHP(num_filters) to start with. Didn't find any reference to layers, but when I halved all parameters with batch_size in it, it crashes. Even if I only modify num_filters, it doesn't run. What would be an example of a working set of parameters for smaller GPUs? (I have RTX3050.)

from alphazero.jl.

pepa65 commented on July 20, 2024

Yes, I had a look at #174 but O can't seem to make any modifications that work. Right now I have the current repo games/connect-four/params.jl except this diff:

--- a/games/connect-four/params.jl
+++ b/games/connect-four/params.jl
@@ -5,7 +5,7 @@
 Network = NetLib.ResNet
 
 netparams = NetLib.ResNetHP(
-  num_filters=128,
+  num_filters=32, #128,
   num_blocks=5,
   conv_kernel_size=(3, 3),
   num_policy_head_filters=32,
@@ -66,8 +66,9 @@ params = Params(
   use_symmetries=true,
   memory_analysis=nothing,
   mem_buffer_size=PLSchedule(
-  [      0,        15],
-  [400_000, 1_000_000]))
+#  [      0,        15],
+#  [400_000, 1_000_000]))
+  [0]. [80_000]))
 
 #####
 ##### Evaluation benchmark
@@ -93,7 +94,8 @@ benchmark_sim = SimParams(
   arena.sim;
   num_games=256,
   num_workers=256,
-  batch_size=256,
+  #batch_size=256,
+  batch_size=16,
   alternate_colors=false)
 
 benchmark = [

(It crashes...)

from alphazero.jl.

pepa65 commented on July 20, 2024

Thanks, replaced the dot with a comma (old eyes...)
When I run this, I get:

[ Info: Using the Flux implementation of AlphaZero.NetLib.

Loading environment from: sessions/connect-four

[ Info: Using modified parameters
ERROR: AssertionError: same_json(Network.hyperparams(env.bestnn), e.netparams)
Stacktrace:
 [1] Session(e::Experiment; dir::Nothing, autosave::Bool, nostdout::Bool, save_intermediate::Bool)
   @ AlphaZero.UserInterface ~/git/AlphaZero.jl/src/ui/session.jl:288
 [2] Session
   @ ~/git/AlphaZero.jl/src/ui/session.jl:273 [inlined]
 [3] train(e::Experiment; args::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ AlphaZero.Scripts ~/git/AlphaZero.jl/src/scripts/scripts.jl:26
 [4] train
   @ ~/git/AlphaZero.jl/src/scripts/scripts.jl:26 [inlined]
 [5] #train#15
   @ ~/git/AlphaZero.jl/src/scripts/scripts.jl:28 [inlined]
 [6] train(s::String)
   @ AlphaZero.Scripts ~/git/AlphaZero.jl/src/scripts/scripts.jl:28
 [7] top-level scope
   @ none:1

from alphazero.jl.

How to prevent GPU OOM? about alphazero.jl HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent