This module contains the code necessary to implement a Thompson sampling strategy in a multi-armed bandit setting.
The module has been developed on Python 3.7.3.
Install requirements as:
pip3 install -r requirements.txt
Import class as
from thompson_sampling.thompson_sampling import Thompson
To initialize the class:
ts = Thompson(success_probs)
where
success_probs
: an array of floats in the [0, 1] interval, the success probability of each bandit;
plus the following optional parameters:
steps
: the number of steps to iterate over, defaults to 1000.alpha_damping
: reduces the tendency of TS towards exploitation (defaults to 1, no damping)beta_damping
: reduces the tendency of TS towards exploration (defaults to 1, no damping)alpha_init
: an array of non-negative floats, sets the initial conditions for the prior distribution of each bandit (defaults to 0, uniform priors)beta_init
: an array of non-negative floats, sets the initial conditions for the prior distribution of each bandit (defaults to 0, uniform priors)optimistic
: whether to use an optimistic TS strategy, whereby a lower bound is put on sampling (defaults to False)optimistic_threshold
: the lower bound for sampling in an optimistic strategy (defaults to 1.e-6)
The multi-armed bandit problem is an unsupervised-learning problem in which a fixed set of limited resources must be allocated between competing choices without prior knowledge of the rewards offered by each of them, which must be instead learned on the go.
The name refers to the hypothetical situation
in which a player must choose between a set of
Practically, the issue of the trade-off between exploitation and exploration is one faced in many online reinforcement-learning problems, such as offering a (new or relatively new) app user the optimal UX flow (success being continued user engagement), picking between different banner ads that can be displayed on a website (success being a click or a conversion), etc.
Thompson sampling (TS) is a solving approach to the aforementioned exploitation-exploration problem that attempts to converge towards the optimal solution by finding a balance between exploiting what is known to maximize immediate performance on the one hand, and investing to accumulate new information that may improve future performance on the other. The second aspect is what distinguishes this class of solutions from so-called greedy approaches, which instead only maximize (based on the historical data at hand) the immediate reward. Greedy approaches have a higher chance of getting stuck in local maxima, and cannot thoroughly explore the available parameter space, which in simpler terms means to try out options with larger uncertainties over their average outcomes.
While a greedy approach would assess at each round the expected success probability
The posterior probability distribution is then updated based on the observed result, so becoming the prior distribution for the next round. This update rule, mixing prior assumptions about the probability distribution and the empirical observations, is what makes TS a Bayesian approach: in Bayesian statistics, the probability of an event is based both on data (the fresh empirical observations, which in our case are the results of each round of lever pulling) and on prior information or beliefs about the event (in our case, the rewards expected from each machine, initially guessed at random and then adjusted after every round of pulling). This second aspect is what distinguishes Bayesian methods from purely frequentist ones, which are only based on the observed data and do not incorporate the concept of priors.
In the Beta-Bernoulli schematization of the bandit problem,
each time an action
Statistically, the prior distribution of expected rewards for a Bernoulli random variable
is the so-called Beta distribution
A useful property of the distributions so built is that each
As we play and gather evidence, our posterior distributions become more concentrated (more so, the more rewarding they are), increasing our chance of maximizing the sum of the collective rewards: exploration reduces over time, and exploitation is increasingly privileged as the uncertainties are reduced. Meanwhile the regret, or the sum of the rewards under an optimal strategy minus the actually collected rewards, asymptotically converge to zero.
Tullio Bagnoli
This project is licensed under the MIT license.