Code Monkey home page Code Monkey logo

mdp-reinforcement-learning's Introduction

mdp-reinforcement-learning

Implementing RL algorithms (Value Iteration and Q-Learning) on 3 Markov Decision Process data sets. Goal is to create a best policy for each MDP.

Data Sets

We are given 3 data sets representing sample data from different Markov Decision Processes. Based on the sample states, actions, next states, and rewards, we were tasked to find the policy action for each state that maximizes the total expected reward. There is an online scoring function that provides the value of each policy.

RL Algorithms

I used two methods to learn the expected utility of each state – Value Iteration and Q-learning. I chose to use Value Iteration for the small and medium data sets because we had a known model for the MDP and there was a relatively high ratio of data points to existing states. On the other hand, Q-Learning was better for the large data set as the model was unknown and we had to perform some exploration. Knowing the expected utility of actions for each state, I was able to extract the policy for each MDP.

###Model-Based Reinforcement Learning: Value Iteration

For the first two data sets, I used Maximum Likelihood Estimation to learn the Transition and Reward Functions, and then implemented Value Iteration to find the expected value of each state. From this, I extracted the best policy by taking the action that corresponded to the highest expected utility for each state.

Results for Value Iteration on Small and Medium Datasets

Data Set # States Actions Score Time (in seconds) % of states with optimal policy found
Small 100 5 52971.6335 6.389 100
Medium 251,001 2 6339.6766 59.069 50.16

Small Dataset – Grid World

Value Iteration worked well for the small data set as we had a comprehensive data set where all states had been visited in the sample data. Hence, the Transition and Reward functions were quite accurate and allowed us to come up with relatively accurate values for the different states.

Medium Dataset – Inverted Cart Pendulum

The second data set had many more states than the first. As can be seen in Table 1, only about half of all the data sets were present in the sample. Hence, our Transition and Reward functions were less accurate and could not predict the exact utility of each state. States where no best action was found based on the calculated utilities were given a random action. Overall, this method was effective for finding the policy for states where we had almost complete information about the rewards in each state, but lacking when we had sparse data and no known model of the world.

###Model-Free Reinforcement Learning: Q-Learning For the large data set, Q-learning seemed to be the most appropriate method since the model of the MDP was unknown and many of the states were not present in the sample data. Q-learning is a learning algorithm where we search through the state space and apply incremental updates to the corresponding q-value (or expected utility) of each state as we observe rewards in every state. After searching through the space for a certain period of time, we use the approximated q-values of each state to determine the best policy for each state.

Results for Q-Learning on Large Dataset

Data Set # States Actions Score Time (in seconds) % of states with optimal policy found
Large 531,441 157 -366.0502 122.089 0.0129

Large Dataset – Unknown Model

As can be seen in the table, the majority of states were not given an optimal action policy. This can partially be attributed to constraints in the amount of data provided. The exploration that was implemented in the algorithm was limited to the (state, action, successor state) groups that had already been observed in the sample data and so did not represent true exploration of the state space.

To improve this algorithm further, I would allow exploration of the states by choosing a random start state and then proceeding through successor states, instead of following the data points given in the sample data. Another option is to use eligibility traces to allow states with an observed reward to propagate these utilities back to neighbor states, so that rewards of states observed later will be reflected in states already visited.

##Notes Code written for Stanford CS 238 class project.

mdp-reinforcement-learning's People

Contributors

ange3 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.