Skip to main content

4 posts tagged with "reinforcement-learning"

View All Tags

Sutton & Barto Gridworld example in C#

· 6 min read

Lately, I've been exploring various examples from Sutton and Barto's "Reinforcement Learning: An Introduction" book using C# and I already shared a few of them on this blog:

Today I'll be focusing on the gridworld example from chapter 3 of the book. The code is available in the existing repo as a new project. Gridworld is a simple example used to illustrate the Bellman equations and iterative policy evaluation. An excerpt from the book describes the environment:

The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of -1. Other actions result in a reward of 0, except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 and take the agent to A'. From state B, all actions yield a reward of +5 and take the agent to B'.

— Sutton & Barto, Reinforcement Learning: An Introduction, 2nd ed., Chapter 3.

Multi armed bandit exercise 2.5 with C#

· 6 min read

Recently I tried to code the 10 armed testbed example from chapter 2 of Sutton and Barto Reinforcement Learning: an introduction book.

The chapter continues introducing new theory elements and strategies to improve the approach shown in the 10 armed example. In particular, one of the points is about non-stationary problems.

The 10 armed testbed was a stationary problem, the probability distributions of the different actions don't change over time. If you remember the sample, at the beginning of the round we computed 10 random values, those values are then used to be the mean of a normal distribution from which we will pick the rewards at each step. The constant part is that this normal distributions don't change from a step to another, they stay the same for the whole round execution.

The focus of the exercise is to understand how the estimated reward computation impacts the performance of the ϵ\epsilon-greedy strategy. In the 10 armed testbed, the estimate reward was computed averaging the rewards obtained from each action when selected. Note that this approach consider each reward with the same relative value, however in a non-stationary problem, where probability distributions change over time we would like to give more weight or importance to more recent rewards because they represent more realistically the current distribution the reward is generated from.

The text of the exercise is

Design and conduct an experiment to demonstrate the difficulties that sample-average methods have for nonstationary problems. Use a modified version of the 10-armed testbed in which all the q(a)q_{*}(a) start out equal and then take independent random walks (say by adding a normally distributed increment with mean 0 and standard deviation 0.01 to all the q(a)q_{*}(a) on each step). Prepare plots like Figure 2.2 for an action-value method using sample averages, incrementally computed, and another action-value method using a constant step-size parameter, α\alpha = 0.1. Use ϵ=0.1\epsilon = 0.1 and longer runs, say of 10,000 steps.

Figure 2.2 refers to the average reward graph and the best arm selection rate graph, the same graphs I produced in the previous post. The ϵ=0.1\epsilon=0.1 refers to the ϵ\epsilon-greedy strategy to be used, both in the case of sample averages and in the constant step-size parameter.

Ten armed testbed for the Bandit problem with C#

· 10 min read

I'm continuing my attempt to reproduce examples from Reinforcement Learning: An Introduction book using C#.

In a previous post I reproduced the tic-tac-toe example with some improvements and clarification with respect to the original text. I think it's worth taking a look at it.

Today I'm reproducing the ten armed testbed for the Bandit problem, in particular I want to reproduce the two graphs showing the average reward improvements and the selection rate of the best arm.

The problem, as stated in the book is the following:

You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.

Tic-tac-toe reinforcement learning with C#

· 9 min read

A couple of weeks ago I wanted to take a look at reinforcement learning and possibly work on a very simple sample in C#. In search for a book to learn some basics I found Reinforcement Learning: An Introduction suggested in multiple places. The book is available for free as a PDF on the linked website, so I thought it would be a good starting point.

The book offers in the very first chapter, a tic-tac-toe example where an algorithm is described, albeit with not too much details. I decided to try to implement a C# version of that. Plenty of implementations are available online and the authors offer a lisp version on their website, so there is a wide range of option to explore and evaluate.