Single-Agent RL Atari Pong

Atari Pong Single-Agent Classic Reinforcement Learning (no Deep RL) as course project of Distributed Artificial Intelligence, University of Modena and Reggio Emilia, Italy

Observation preprocessing

The screen pixel observation is downsampled on rows and columns by a factor of 3 and 2 respectively. Reaching a shape of 53 x 80. I'm considering just the pixels from 35 to 92 i.e. cutting out the side walls and the scores to reduce the amount of pixels.

The states are calculated considering the resized screen values (described in the previous section) as:

$$53*80 (pos\_ball) * 53 (pos\_agent) * 6 (n\_actions) = 1 348 320 (states) * 4 (byte) = 5.4 MB$$

I made the assumption that i don't need to know the position of the competitor in order to win the game, indeed i counted the states only for agent_0. This assumption make the game partial observable.

Learning

In this project I invesigated the Q-Learning (RL) potentials regarding the extraction of smart behaviours. I focused mainly on the hard convergence problem due to sparsity i.e. the qtables are big. In order to tackle this problem I experimented the effects of gaussian reward (smoother reward) and qtable initialization.

Qtable Initialization

At first I was convinced that initializing the qtable with values different from zero could be a good solution as happens in neural networks. I soon realized that the random initialization weren't actually good. Indeed It introduced noise in the q-learning convergence (since it relies on qtable values).

The image above proves that behaviour. The random initialization works worse than a zero initialization.

Gaussian Rewards

In order to address the sparsity problem, I implemented a gaussian smoothing on the reward signal. Since exists a close relationship between the states and the screen's pixels, it makes sense to spead the reward spatially by smoothing (e.g. if a specific pixel is a great location to catch the ball than it's reasonable that the near ones are a good positions too).

It shows that the gaussian reward converge faster to a defined threshold. mCR10 is the mean over the last 10 steps of the cumulative reward signal.

Reward kernel: 3x3 vs 5x5

It shows that the 5x5 reward converge faster than the 3x3. mCR10 is the mean over the last 10 steps of the cumulative reward signal.

3x3 Kernel

The following images show the qtable state (in 3x3 smootherd reward setting) for each action of the racket.

The title of each subplots defines the coordinate position of the racket when the action is performed. The subplot itself shows the ball position. Basically It tells whether is good (white) or bad(black), for the racket, to be in that position (subplot number title) and doing that action.

5x5 Kernel

The following images show the qtable state for each action of the pong racket of a 5x5 smoothed reward training. The image meaning is the same described in the 3x3 reward section.

fmolivato / sarl_atari_pong Goto Github PK

sarl_atari_pong's Introduction

Single-Agent RL Atari Pong

Observation preprocessing

Learning

Qtable Initialization

Gaussian Rewards

Reward kernel: 3x3 vs 5x5

3x3 Kernel

5x5 Kernel

sarl_atari_pong's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent