Skip to main content

Sutton & Barto Gridworld example in C#

· 6 min read

Lately, I've been exploring various examples from Sutton and Barto's "Reinforcement Learning: An Introduction" book using C# and I already shared a few of them on this blog:

Today I'll be focusing on the gridworld example from chapter 3 of the book. The code is available in the existing repo as a new project. Gridworld is a simple example used to illustrate the Bellman equations and iterative policy evaluation. An excerpt from the book describes the environment:

The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of -1. Other actions result in a reward of 0, except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 and take the agent to A'. From state B, all actions yield a reward of +5 and take the agent to B'.

— Sutton & Barto, Reinforcement Learning: An Introduction, 2nd ed., Chapter 3.

The value function for each state is updated using the Bellman expectation equation for policy evaluation:

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)],sSv_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{s',r} p(s',r|s,a) [r + \gamma v_{\pi}(s')], \quad \forall s \in S

The components of the equation are:

  • vπ(s)v_{\pi}(s): the value of state ss under policy π\pi, this is what we want to compute.
  • π(as)\pi(a|s): the probability of taking action aa in state ss. This is called the policy.
  • p(s,rs,a)p(s',r|s,a): the probability of transitioning to state ss' and receiving reward rr after taking action aa in state ss.
  • γ\gamma: the discount rate, which determines the importance of future rewards and is a value between 0 and 1. In our case it is set to 0.9.

Now the example proceeds by giving us the policy: the agent selects each action with equal probability π(as)=14\pi(a|s) = \frac{1}{4}, so we can simplify the equation:

vπ(s)=14as,rp(s,rs,a)[r+γvπ(s)],sSv_{\pi}(s) = \frac{1}{4} \sum_{a} \sum_{s',r} p(s',r|s,a) [r + \gamma v_{\pi}(s')], \quad \forall s \in S

Because the environment is deterministic, for each state-action pair there is exactly one next state ss' and reward (probability 1). Therefore the update simplifies to:

vπ(s)=14a[r+γvπ(s)]v_{\pi}(s) = \frac{1}{4} \sum_{a} [r + \gamma v_{\pi}(s')]

Using this formula we iteratively update the value function for each state until convergence up to a certain tolerance.

The implementation

Regarding the implementation, I mostly followed the sample lisp code provided by the authors at http://incompleteideas.net/book/code/gridworld5x5.lisp. However I used clearer variable names, an enum for the actions, a better next-state and full-backup function and other minor improvements. If you look at the original full-backup it is actually also computing the next-state for a subset of cases, I decided to handle all cases in my NextState method and use the FullBackup only to compute the value of a given state-action pair.

Some of the Lisp code's complexity — which I preserved in the C# port — is the mapping between state indices (0–24) and grid coordinates (row and column). It's not clear why the original maps states to indices this way; I kept the mapping for fidelity to the original implementation.

As a side note, I executed the lisp code to validate the results and the methods I ported to C# using SBCL and apparently a function was missing so I added it and provided an updated lisp version in my repo here.

I decided to use a GridWorld class to hold the global state and the required functions.

Looking at the simplified Bellman equation we can see that we need to compute ss' given a starting state and an action, this is implemented in the NextState method of the GridWorld class:

int NextState(int state, Action action)
{
if (state == _specialStateA)
{
return _specialStateAPrime;
}

if (state == _specialStateB)
{
return _specialStateBPrime;
}

// OffGrid returns true if the action would take the agent off the grid
if (OffGrid(state, action))
{
return state;
}

var (row, col) = CoordinatesFromState(state);
return action switch
{
Action.East => StateFromCoordinates(row, col + 1),
Action.South => StateFromCoordinates(row + 1, col),
Action.West => StateFromCoordinates(row, col - 1),
Action.North => StateFromCoordinates(row - 1, col),
_ => throw new ArgumentOutOfRangeException(nameof(action), "Invalid action"),
};
}

The sum formula is adding one element for each action, the single element for a given action and state is computed in the FullBackup method:

double FullBackup(int state, Action a)
{
var nextState = NextState(state, a);
double reward = state switch
{
_ when state == _specialStateA => 10,
_ when state == _specialStateB => 5,
// implicitly handles off-grid moves
_ when nextState == state => -1,
_ => 0
};

return reward + (_gamma * _v[nextState]);
}

The implementation of the value function is the following. Consider that we have 4 actions so average is dividing by 4:

private double ValueFunction(int state)
=> Enum.GetValues<Action>()
.Select(a => FullBackup(state, a))
.Average();

The rest of the implementation is almost a 1-1 mapping from the lisp code to C#. The value function is updated in a loop until convergence.

How to run the sample

Grid diagram and state mapping

To make the indexing clear, here's the 5x5 grid used in the example (rows increase downward, columns increase to the right). Special states A and B and their primes A' and B' are shown in the grid where applicable.

Col 0Col 1Col 2Col 3Col 4
Row 001 (A)23 (B)4
Row 156789
Row 210111213 (B')14
Row 31516171819
Row 42021 (A')222324
Loading comments...