This example is drawn from “A Painless Q-Learning Tutorial” at http://mnemstudio.org/path-finding-q-learning-tutorial.htm which explains how to manually calculate iterations using the updating equation for Q-Learning, based on the Bellman Equation (image from https://www.is.uni-freiburg.de/ressourcen/business-analytics/13_reinforcementlearning.pdf):

The example explores path-finding through a house:

The question to be answered here is: **What’s the best way to get from Room 2 to Room 5 (outside)?** Notice that by answering this question using reinforcement learning, we will also know how to find optimal routes from any room to outside. And if we run the iterative algorithm again for a new target state, we can find out the optimal route from any room to that new target state.

Since Q-Learning is model-free, we don’t need to know how likely it is that our agent will move between any room and any other room (the transition probabilities). If you had observed the behavior in this system over time, you *might* be able to find that information, but it many cases it just isn’t available. So the key for this problem is to construct a Rewards Matrix that explains the benefit (or penalty!) of attempting to go from one state (room) to another.

Assigning the rewards is somewhat arbitrary, but you should give a large positive value to your target state and negative values to states that are impossible or highly undesirable. Here’s the guideline we’ll use for this problem:

- -1 if “you can’t get there from here”
- 0 if the destination is not the target state
- 100 if the destination
*is*the target state

We’ll start constructing our rewards matrix by listing the states we’ll come **FROM down the rows**, and the states we’ll go **TO in the columns**. First, let’s fill the diagonal with -1 rewards, because we don’t want our agent to stay in the same place (that is, move from Room 1 to Room 1, or from Room 2 to Room 2, and so forth). The final one gets a 100 because if we’re already in Room 5, we want to stay there.

Next, let’s move across the first row. Starting in Room 0, we only have one choice: go to Room 4. All other possibilities are blocked (-1):

Now let’s fill in the row labeled 1. From Room 1, you have two choices: go to Room 3 (which is not great but permissible, so give it a 0) or go to Room 5 (the target, worth 100)!

Continue moving row by row, determining if you can’t get there from here (-1), you can but it’s not the target (0), or it’s the target(100). You’ll end up with a final rewards matrix that looks like this:

Now create this rewards matrix in R:

R <- matrix(c(-1, -1, -1, -1, 0, -1, -1, -1, -1, 0, -1, 100, -1, -1, -1, 0, -1, -1, -1, 0, 0, -1, 0, -1, 0, -1, -1, 0, -1, 100, -1, 0, -1, -1, 0, 100), nrow=6, ncol=6, byrow=TRUE)

And run the code. Notice that we’re calling the target state 6 instead of 5 because even though we have a room labeled with a zero, our matrix starts with a 1s so we have to adjust:

source("https://raw.githubusercontent.com/NicoleRadziwill/R-Functions/master/qlearn.R") results <- q.learn(R,10000,alpha=0.1,gamma=0.8,tgt.state=6) > round(results) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 0 0 0 80 0 [2,] 0 0 0 64 0 100 [3,] 0 0 0 64 0 0 [4,] 0 80 51 0 80 0 [5,] 64 0 0 64 0 100 [6,] 0 80 0 0 80 100

You can read this table of average value to obtain policies. A policy is a “path” through the states of the system:

- Start at Room 0 (first row, labeled 1): Choose Room 4 (80), then from Room 4 choose Room 5 (100)
- Start at Room 1: Choose Room 5 (100)
- Start at Room 2: Choose Room 3 (64), from Room 3 choose Room 1 or Room 4 (80); from 1 or 4 choose 5 (100)
- Start at Room 3: Choose Room 1 or Room 4 (80), then Room 5 (100)
- Start at Room 4: Choose Room 5 (100)
- Start at Room 5: Stay at Room 5 (100)

**To answer the original problem, we would take route 2-3-1-5 or 2-3-4-5 to get out the quickest if we started in Room 2.** This is easy to see with a simple map, but is much more complicated when the maps get bigger.

Categories: Data Science, R

NOTE: The definition of the R matrix on the R-Bloggers page is wrong and **different** than it is on this page. The one on THIS page is correct, as can be seen by comparing the image of the rewards matrix to the numbers that are typed in below. Something odd happened with the code editor on WordPress that jumbled the numbers as it was being picked up by R-Bloggers.