2 posts tagged with "k-armed bandit problem"

Multi armed bandit exercise 2.5 with C#

May 1, 2025 · 6 min read

Recently I tried to code the 10 armed testbed example from chapter 2 of Sutton and Barto Reinforcement Learning: an introduction book.

The chapter continues introducing new theory elements and strategies to improve the approach shown in the 10 armed example. In particular, one of the points is about non-stationary problems.

The 10 armed testbed was a stationary problem, the probability distributions of the different actions don't change over time. If you remember the sample, at the beginning of the round we computed 10 random values, those values are then used to be the mean of a normal distribution from which we will pick the rewards at each step. The constant part is that this normal distributions don't change from a step to another, they stay the same for the whole round execution.

The focus of the exercise is to understand how the estimated reward computation impacts the performance of the $\epsilon$ -greedy strategy. In the 10 armed testbed, the estimate reward was computed averaging the rewards obtained from each action when selected. Note that this approach consider each reward with the same relative value, however in a non-stationary problem, where probability distributions change over time we would like to give more weight or importance to more recent rewards because they represent more realistically the current distribution the reward is generated from.

The text of the exercise is

Design and conduct an experiment to demonstrate the difficulties that sample-average methods have for nonstationary problems. Use a modified version of the 10-armed testbed in which all the $q_{*}(a)$ start out equal and then take independent random walks (say by adding a normally distributed increment with mean 0 and standard deviation 0.01 to all the $q_{*}(a)$ on each step). Prepare plots like Figure 2.2 for an action-value method using sample averages, incrementally computed, and another action-value method using a constant step-size parameter, $\alpha$ = 0.1. Use $\epsilon = 0.1$ and longer runs, say of 10,000 steps.

Figure 2.2 refers to the average reward graph and the best arm selection rate graph, the same graphs I produced in the previous post. The $\epsilon=0.1$ refers to the $\epsilon$ -greedy strategy to be used, both in the case of sample averages and in the constant step-size parameter.

Ten armed testbed for the Bandit problem with C#

April 27, 2025 · 10 min read

I'm continuing my attempt to reproduce examples from Reinforcement Learning: An Introduction book using C#.

In a previous post I reproduced the tic-tac-toe example with some improvements and clarification with respect to the original text. I think it's worth taking a look at it.

Today I'm reproducing the ten armed testbed for the Bandit problem, in particular I want to reproduce the two graphs showing the average reward improvements and the selection rate of the best arm.

The problem, as stated in the book is the following:

You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.