Portfolio Optimization with Deep Bayesian Bandits

AI Portfolio Manager - optimizing distribution of asset allocation
by means of reinforcement learning.

Implementation of Linear Full Posterior Bandits for portfolio optimization

This corresponds to the Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling paper, published in ICLR 2018.

@article{riquelme2018deep, title={Deep Bayesian Bandits Showdown: An Empirical
Comparison of Bayesian Deep Networks for Thompson Sampling},
author={Riquelme, Carlos and Tucker, George and Snoek, Jasper},
journal={International Conference on Learning Representations, ICLR.}, year={2018}}

Installation

WIP

Usage

WIP

Contextual Bandits

Contextual bandits are a rich decision-making framework where an algorithm has to choose among a set of k actions at every time step t, after observing a context (or side-information) denoted by X_t. The general pseudocode for the process if we use algorithm A is as follows:

At time t = 1, ..., T:
  1. Observe new context: X_t
  2. Choose action: a_t = A.action(X_t)
  3. Observe reward: r_t
  4. Update internal state of the algorithm: A.update((X_t, a_t, r_t))

The goal is to maximize the total sum of rewards: ∑_t r_t

Thompson Sampling

Thompson Sampling is a meta-algorithm that chooses an action for the contextual bandit in a statistically efficient manner, simultaneously finding the best arm while attempting to incur low cost. Informally speaking, we assume the expected reward is given by some function E[r_t | X_t, a_t] = f(X_t, a_t). Unfortunately, function f is unknown, as otherwise we could just choose the action with highest expected value: a_t^* = arg max_i f(X_t, a_t).

The idea behind Thompson Sampling is based on keeping a posterior distribution π_t over functions in some family f ∈ F after observing the first t-1 datapoints. Then, at time t, we sample one potential explanation of the underlying process: f_t ∼ π_t, and act optimally (i.e., greedily) according to f_t. In other words, we choose a_t = arg max_i f_t(X_t, a_i). Finally, we update our posterior distribution with the new collected datapoint (X_t, a_t, r_t).

The main issue is that keeping an updated posterior π_t (or, even, sampling from it) is often intractable for highly parameterized models like deep neural networks. The algorithms we list in the next section provide tractable approximations that can be used in combination with Thompson Sampling to solve the contextual bandit problem.

nasdin / portfolio_opt_ml Goto Github PK

portfolio_opt_ml's Introduction

Portfolio Optimization with Deep Bayesian Bandits

Implementation of Linear Full Posterior Bandits for portfolio optimization

Installation

Usage

Contextual Bandits

Thompson Sampling

portfolio_opt_ml's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent