Code Monkey home page Code Monkey logo

Comments (16)

PWhiddy avatar PWhiddy commented on August 18, 2024 4

I used https://vast.ai/
Their servers aren't quite as reliable or secure as regular cloud providers but they are much cheaper.
Also though, I'm working on a version of the training script that will uses less resources! Stay tuned

from pokemonredexperiments.

setomage avatar setomage commented on August 18, 2024 4

Joking Tone
100GB of RAM? Those be some Rookie numbers. :P

Untitled

from pokemonredexperiments.

PWhiddy avatar PWhiddy commented on August 18, 2024 3

For all in this thread, the new script run_baseline_parallel_fast.py trains much faster and only uses 15-20G of memory!

from pokemonredexperiments.

jsuarez5341 avatar jsuarez5341 commented on August 18, 2024 1

Okay, I got it training with 8GB of RAM. Will see if it learns anything

from pokemonredexperiments.

cptmiche avatar cptmiche commented on August 18, 2024

Piggy-backing on this, are there any hardware requirements beyond a large chunk of RAM? I ask because this seems like a lot of fun to run through, and I run a rather beefy server architecture at home (I work in IT, and I'd rather run my own cloud than use someone else's, for labs and messing around with stuff).

My ESXI server has 28 physical cores (56 threads), and 192GB of RAM, but no descrete GPU. Is that workable, or do I need a dedicated GPU to run the training?

from pokemonredexperiments.

RussellMaggs avatar RussellMaggs commented on August 18, 2024

I could be completely wrong but I think no gpu is required to run this training currently.

Those specs are better than what I am running on so you should be able to run it no problem

from pokemonredexperiments.

Lawbayly avatar Lawbayly commented on August 18, 2024

Piggy-backing on this, are there any hardware requirements beyond a large chunk of RAM? I ask because this seems like a lot of fun to run through, and I run a rather beefy server architecture at home (I work in IT, and I'd rather run my own cloud than use someone else's, for labs and messing around with stuff).

The more CPU's you have the more instances of training you can do (by adjusting num_cpu in run_baseline_parallel.py, I managed twice the amount of threads but your mileage may vary, adjusting that num_cpu higher or lower does cause the RAM usage to go up or down respectively, I'm running fine with num_cpu at 24 with a 12 thread CPU and 32GB of RAM).

from pokemonredexperiments.

setomage avatar setomage commented on August 18, 2024

Piggy-backing on this, are there any hardware requirements beyond a large chunk of RAM? I ask because this seems like a lot of fun to run through, and I run a rather beefy server architecture at home (I work in IT, and I'd rather run my own cloud than use someone else's, for labs and messing around with stuff).

My ESXI server has 28 physical cores (56 threads), and 192GB of RAM, but no descrete GPU. Is that workable, or do I need a dedicated GPU to run the training?

As the code stands CPU and RAM is needed, and it doesn't touch the GPU. So you could set your num_cpu to 28, and be golden. Maxing out my cores to 24, I only use about 90GB of RAM when things are going. I set my cores to 30 and it maxes out my CPUs, but only went to about 105GB of RAM used. So you should be golden.

from pokemonredexperiments.

cptmiche avatar cptmiche commented on August 18, 2024

Piggy-backing on this, are there any hardware requirements beyond a large chunk of RAM? I ask because this seems like a lot of fun to run through, and I run a rather beefy server architecture at home (I work in IT, and I'd rather run my own cloud than use someone else's, for labs and messing around with stuff).
My ESXI server has 28 physical cores (56 threads), and 192GB of RAM, but no descrete GPU. Is that workable, or do I need a dedicated GPU to run the training?

As the code stands CPU and RAM is needed, and it doesn't touch the GPU. So you could set your num_cpu to 28, and be golden. Maxing out my cores to 24, I only use about 90GB of RAM when things are going. I set my cores to 30 and it maxes out my CPUs, but only went to about 105GB of RAM used. So you should be golden.

Thank you!

from pokemonredexperiments.

Max-We avatar Max-We commented on August 18, 2024

Thanks for sharing! I'm wondering if you used Docker to train your model in the cloud or did you take another route?

from pokemonredexperiments.

jsuarez5341 avatar jsuarez5341 commented on August 18, 2024

I can almost guarantee that I can bring the RAM usage way down... porting it now, will need a bit of time. CPU cores will still be non-negotiable, the env is slow. 50 steps/second/core more or less

from pokemonredexperiments.

jsuarez5341 avatar jsuarez5341 commented on August 18, 2024

Found it!

model = PPO('CnnPolicy', env, verbose=1, n_steps=ep_length, batch_size=512, n_epochs=1, gamma=0.999)

n_steps is the number of frames per environment that you are keeping in memory. So 2048*8 for each of 44 environments... 720,896. Napkin math says 44 GB of observations without any optimizations. That batch size is not unheard of in RL, particularly for long games, but probably it can be made lower.

from pokemonredexperiments.

minermartijn avatar minermartijn commented on August 18, 2024

How long does a training session useally take?, Or do you just stop it and try the run_pretrained after? (Sorry new to this, but want to join in this fun!!)

from pokemonredexperiments.

setomage avatar setomage commented on August 18, 2024

How long does a training session useally take?, Or do you just stop it and try the run_pretrained after? (Sorry new to this, but want to join in this fun!!)

This is a trick question/answer.

(Please note I'm training my AI differently then most)
With running 24 cores(instants of the game) with ep_length at 8192 * 10, it takes just over an hour to get a session done. From there you can let the AI keep running another session, or press Control +C to stop it. From there you have to edit the to watch a run with your session folder and your Step file.

Normally starting a new training session with 2048 * 10, you should see your AI start to get the first gym badge. From there you let it keep training, and it gets better at doing it.

The current hangup is Mount Moon, but everyone are trying to find the missing key for this.

from pokemonredexperiments.

fangyuan-ksgk avatar fangyuan-ksgk commented on August 18, 2024

In the parallel_fast.py file, what is the tricks which speed up the training?

I noticed that the batch_size shrinked by 4x and num_cpy shrinked by 3x, is there some other tricks that is adopted?

from pokemonredexperiments.

setomage avatar setomage commented on August 18, 2024

In the parallel_fast.py file, what is the tricks which speed up the training?

I noticed that the batch_size shrinked by 4x and num_cpy shrinked by 3x, is there some other tricks that is adopted?

That's actually the mini batch size. The normal batch sized is mathed into the system which also used the ep_length.

With the smaller batch size, it can look at the data fast then looking at the larger 512.

from pokemonredexperiments.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.