are you using some google collab or something?

For all in this thread, the new run_baseline_parallel

Where did you get 100 Gb of ram about pokemonredexperiments HOT 16 OPEN

GiotaroKugio commented on August 18, 2024

Where did you get 100 Gb of ram

from pokemonredexperiments.

Comments (16)

PWhiddy commented on August 18, 2024 4

I used https://vast.ai/
Their servers aren't quite as reliable or secure as regular cloud providers but they are much cheaper.
Also though, I'm working on a version of the training script that will uses less resources! Stay tuned

from pokemonredexperiments.

setomage commented on August 18, 2024 4

Joking Tone
100GB of RAM? Those be some Rookie numbers. :P

from pokemonredexperiments.

PWhiddy commented on August 18, 2024 3

For all in this thread, the new script run_baseline_parallel_fast.py trains much faster and only uses 15-20G of memory!

from pokemonredexperiments.

jsuarez5341 commented on August 18, 2024 1

Okay, I got it training with 8GB of RAM. Will see if it learns anything

from pokemonredexperiments.

cptmiche commented on August 18, 2024

Piggy-backing on this, are there any hardware requirements beyond a large chunk of RAM? I ask because this seems like a lot of fun to run through, and I run a rather beefy server architecture at home (I work in IT, and I'd rather run my own cloud than use someone else's, for labs and messing around with stuff).

My ESXI server has 28 physical cores (56 threads), and 192GB of RAM, but no descrete GPU. Is that workable, or do I need a dedicated GPU to run the training?

from pokemonredexperiments.

RussellMaggs commented on August 18, 2024

I could be completely wrong but I think no gpu is required to run this training currently.

Those specs are better than what I am running on so you should be able to run it no problem

from pokemonredexperiments.

Lawbayly commented on August 18, 2024

Piggy-backing on this, are there any hardware requirements beyond a large chunk of RAM? I ask because this seems like a lot of fun to run through, and I run a rather beefy server architecture at home (I work in IT, and I'd rather run my own cloud than use someone else's, for labs and messing around with stuff).

The more CPU's you have the more instances of training you can do (by adjusting num_cpu in run_baseline_parallel.py, I managed twice the amount of threads but your mileage may vary, adjusting that num_cpu higher or lower does cause the RAM usage to go up or down respectively, I'm running fine with num_cpu at 24 with a 12 thread CPU and 32GB of RAM).

from pokemonredexperiments.

setomage commented on August 18, 2024

Piggy-backing on this, are there any hardware requirements beyond a large chunk of RAM? I ask because this seems like a lot of fun to run through, and I run a rather beefy server architecture at home (I work in IT, and I'd rather run my own cloud than use someone else's, for labs and messing around with stuff).

My ESXI server has 28 physical cores (56 threads), and 192GB of RAM, but no descrete GPU. Is that workable, or do I need a dedicated GPU to run the training?

As the code stands CPU and RAM is needed, and it doesn't touch the GPU. So you could set your num_cpu to 28, and be golden. Maxing out my cores to 24, I only use about 90GB of RAM when things are going. I set my cores to 30 and it maxes out my CPUs, but only went to about 105GB of RAM used. So you should be golden.

from pokemonredexperiments.

cptmiche commented on August 18, 2024

Piggy-backing on this, are there any hardware requirements beyond a large chunk of RAM? I ask because this seems like a lot of fun to run through, and I run a rather beefy server architecture at home (I work in IT, and I'd rather run my own cloud than use someone else's, for labs and messing around with stuff).
My ESXI server has 28 physical cores (56 threads), and 192GB of RAM, but no descrete GPU. Is that workable, or do I need a dedicated GPU to run the training?

As the code stands CPU and RAM is needed, and it doesn't touch the GPU. So you could set your num_cpu to 28, and be golden. Maxing out my cores to 24, I only use about 90GB of RAM when things are going. I set my cores to 30 and it maxes out my CPUs, but only went to about 105GB of RAM used. So you should be golden.

Thank you!

from pokemonredexperiments.

Max-We commented on August 18, 2024

Thanks for sharing! I'm wondering if you used Docker to train your model in the cloud or did you take another route?

from pokemonredexperiments.

jsuarez5341 commented on August 18, 2024

I can almost guarantee that I can bring the RAM usage way down... porting it now, will need a bit of time. CPU cores will still be non-negotiable, the env is slow. 50 steps/second/core more or less

from pokemonredexperiments.

jsuarez5341 commented on August 18, 2024

Found it!

PokemonRedExperiments/baselines/run_baseline_parallel.py

Line 62 in fa01143

    
           model = PPO('CnnPolicy', env, verbose=1, n_steps=ep_length, batch_size=512, n_epochs=1, gamma=0.999)

n_steps is the number of frames per environment that you are keeping in memory. So 2048*8 for each of 44 environments... 720,896. Napkin math says 44 GB of observations without any optimizations. That batch size is not unheard of in RL, particularly for long games, but probably it can be made lower.

from pokemonredexperiments.

minermartijn commented on August 18, 2024

How long does a training session useally take?, Or do you just stop it and try the run_pretrained after? (Sorry new to this, but want to join in this fun!!)

from pokemonredexperiments.

setomage commented on August 18, 2024

How long does a training session useally take?, Or do you just stop it and try the run_pretrained after? (Sorry new to this, but want to join in this fun!!)

This is a trick question/answer.

(Please note I'm training my AI differently then most)
With running 24 cores(instants of the game) with ep_length at 8192 * 10, it takes just over an hour to get a session done. From there you can let the AI keep running another session, or press Control +C to stop it. From there you have to edit the to watch a run with your session folder and your Step file.

Normally starting a new training session with 2048 * 10, you should see your AI start to get the first gym badge. From there you let it keep training, and it gets better at doing it.

The current hangup is Mount Moon, but everyone are trying to find the missing key for this.

from pokemonredexperiments.

fangyuan-ksgk commented on August 18, 2024

In the parallel_fast.py file, what is the tricks which speed up the training?

I noticed that the batch_size shrinked by 4x and num_cpy shrinked by 3x, is there some other tricks that is adopted?

from pokemonredexperiments.

setomage commented on August 18, 2024

In the parallel_fast.py file, what is the tricks which speed up the training?

I noticed that the batch_size shrinked by 4x and num_cpy shrinked by 3x, is there some other tricks that is adopted?

That's actually the mini batch size. The normal batch sized is mathed into the system which also used the ep_length.

With the smaller batch size, it can look at the data fast then looking at the larger 512.

from pokemonredexperiments.

Where did you get 100 Gb of ram about pokemonredexperiments HOT 16 OPEN

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent