Reinforcement Learning (Part 2) - K-arm Bandit Problem

This is the second part of the Reinforcement Learning series: Here I discuss about dynamically balancing between exploration and exploitation to reach optimal return in long run.

Aug 07, 2023

Introduction

This is a continuation of my previous post where I explore 𝜖-greedy methods to solve the k-arm bandit problem. If you are not yet familiar with these terms, please check out my previous post.

Reinforcement Learning (Part 1) - K-arm Bandit Problem

Subhrajyoty Roy

August 3, 2024

Reinforcement Learning (Part 1) - K-arm Bandit Problem

Why Reinforcement Learning

Read full story

In the last post, we saw how the 𝜖-greedy method, which does a bit more exploration with exploitation can outperform the greedy approach in the long run. We also saw how changing this exploration parameter 𝜖 can be decreased over time to obtain an even better result. But we ended up with a question: What kind of functions of the number of rounds do we need to choose? Should it be 10/t or 0.1/t or maybe even (1000-t) or something else?

This is the second part of the Reinforcement Learning series of blog posts.

Follow StatWizard and subscribe below to get interesting discussions on statistics, mathematics and data science delivered directly to your inbox.

A Bit of Change in the Problem

Before we considered 10 arms, each arm gave a random reward according to a normally distributed random variable. If you want to learn more about normal distribution, check out this awesome video by 3Blue1Brown. One problem with this normal distribution is that the generated random variable may even be negative. This is a bit weird for the reward distribution, as negative rewards look more like a punishment. Another problem is that the reward can potentially be extremely large and unbounded; however, almost all practical materialistic resources are limited, hence bounded.

For this post,

We will have 5 arms instead of 10 arms.
Each arm “a” has a coin inside it, which turns up head with probability pₐ.
When you pull the arm, it tosses the coin. If it is head, it gives a reward of 1, and if it is tail, there is no reward.
We will still have 1000 such rounds.
The goal is still the same, it is to maximize the total reward.

Here is a bit of Python code that now implements this reward function.

Upper Confidence Band Algorithm

The Upper Confidence Band (UCB) algorithm is one of the solutions to this problem, which lets us dynamically adjust the exploration-exploitation balance to get the best result. It was independently proposed by Agarwal, Katehakis and Robbins, after the seminal work by Lai and Robbins which proves some optimality properties about an algorithm that maintains a score that can provide the best return in the long run.

I won't go more mathematical than this, but if you want to know more, let me know in the comments, I'll make a post about it.

Assume that you are currently at round t. So, till this round, you have been maintaining your best estimate of about pₐ for each action, let us call that p̂a(t) p̂ₐ(t). Also, you have chosen arm “a” for Nₐ(t) number of times. Then, there are three things at play here.

1. If p̂ₐ(t) is high for an arm “a”, then that arm probably gives higher reward. So, you would like to exploit it more.

2. If Nₐ(t) is very low for an arm “a”, then there is not enough certainty about the estimate p̂ₐ(t), since you have less observations about the reward distribution of that arm. So, you would like to explore it more.

3. If the index of the current round t is high, then it means you must be nearing the end of your chances. In such cases, you should explore less and exploit more. As an extreme example, imagine there are 1000 rounds and you are at the last round. It clearly does not make sense to explore in this round as you cannot use the new information you have gained by exploring in future.

Combining this, we have an algorithm that maintains an "optimistic" score of the rewards for each arm, and chooses the arm that maximizes this optimistic UCB score.

\(UCB_t(a) = \widehat{p}_a(t) + \sqrt{\dfrac{2\log(t)}{N_a(t)}}\)

Note that, if p̂ₐ(t) is high, we are more likely to choose that arm. However, even when Nₐ(t) is low, this score can become high leading to exploring that particular arm as it has less number of observations.

Here, we write a Python code as before to simulate the performance of the UCB algorithm for 2000 such randomly simulated bandit problems.

Again, we plot the average reward obtained as a function of the number of rounds, for both the cases with the 10/t-greedy algorithm as well as the UCB algorithm. The UCB algorithm learns more slowly than the 10/t-greedy algorithm, but in the long run, it reaches the level of the same performance (or maybe slightly better).

UCB Variant - KLUCB Algorithm

There are many variants of the UCB algorithm which have been proposed over the last two decades. Here, we shall describe one variant of the algorithm called the KL-UCB algorithm, proposed by Garivier and Cappe. Here, the choice of upper confidence bound (UCB) score is given by

\(\max\{ q: KL(\widehat{p}_a(t), q) \leq \log(t)/N_a(t) \}\)

where KL(p, q) denotes the Kullback Leibler divergence between the probability distributions p and q, which is KL(p, q) = p log(p/q) + (1-p) log((1-p)/(1-q)). Note that, unlike the UCB algorithm, here the upper confidence bound is provided implicitly and must be obtained through some numerical procedure. Once this UCB score is obtained, again we choose the arm “a” which has the maximum UCB score as before.

The following code implements this KL-UCB algorithm.

Again we plot its performance over 1000 rounds.

Clearly, it is much better than the simple UCB algorithm, it is quite close to the 10/t-greedy approach during the initial rounds, which means this algorithm very quickly learns the optimal action.

Thompson Sampling - A Bayesian Touch

The Mathematical Concepts

If you carefully look at what we have been doing so far, there are 2 main things.

Based on whatever knowledge we have, we take some action like pulling a particular arm.
That arm gives us a reward, we observe that, and then we update our knowledge (or estimates) to incorporate that new information about the reward.

This is very similar to the concept of the Bayesian paradigm in statistics, where one starts with a prior belief about the unknown values (called parameters), observe the data, and then updates her belief about these values in the light of data, which is called posterior belief. This entire field is based on Bayes theorem which came from an essay by Thomas Bayes in 1763. Here's a quick video by StatQuest which provides a decent introduction to Bayes' theorem.

Thompson sampling is an algorithm based on this Bayes theorem, tuned to the particular case of reinforcement learning. Here, we start by assuming that every unknown pₐ follows a prior distribution with density

\(\pi(p_a) = \dfrac{\Gamma(\alpha_a + \beta_a)}{\Gamma(\alpha_a)\Gamma(\beta_a)} p_a^{\alpha_a - 1} (1-p_a)^{\beta_a - 1}, \ p_a \in [0, 1]\)

This is actually a Beta distribution with parameters 𝛼ₐ, 𝛽ₐ. Here's the core idea.

The higher value of 𝛼ₐ means it is more likely that pₐ is higher than 1/2. Higher values of 𝛽ₐ mean pₐ will usually be lower than 1/2.
The expectation (i.e., the mean) of this distribution is 𝛼ₐ/(𝛼ₐ + 𝛽ₐ).
As 𝛼ₐ or 𝛼ₐ increases, the distribution becomes more and more concentrated near its mean (or expectation).

Now for every arm a that we pull, we observe a reward rₐ(t) at round t, then rₐ(t) actually has a Bernoulli distribution which has a probability mass function

\(P(r_a(t)) = p_a^{r_a(t)} (1 - p_a)^{(1-r_a(t))}, \ r_a(t) = 0, 1\)

To verify this, you can simply put the values rₐ(t) = 0 or rₐ(t) = 1 and work out that the probability that rₐ(t) = 1 turns out to be exactly pₐ as expected.

The Algorithm

Now we are ready with all the mathematics that we need to understand Thompson sampling. Let's say we are at t-th round. We pull an arm “a”.

If the arm gave a reward of 1, then we should increase our estimate p̂ₐ(t) a bit. Since 𝔼(pₐ) = 𝛼ₐ / (𝛼ₐ + 𝛽ₐ), it can be achieved by increasing 𝛼ₐ, so we add 1 to the value of 𝛼ₐ for this particular arm.
If the arm gave a reward of 0, then we need to decrease our estimate. Similar to before, we can increase 𝛽ₐ this time by 1.
In both cases, either 𝛼ₐ or 𝛽ₐ increases, so in a way, uncertainty decreases, as the beta distribution becomes a little bit more concentrated around 𝛼ₐ / (𝛼ₐ + 𝛽ₐ).
One can also show that this is the posterior distribution of pₐ given the observed reward rₐ(t). Applying Bayes theorem, the posterior becomes

\(\pi(p_a \mid r_a(t)) \propto \pi(p_a) P(r_a(t) \mid p_a) \propto p_a^{r_a(t) + \alpha_a - 1} (1 - p_a)^{1- r_a(t) + \beta_a - 1}\)

which is again a Beta distribution but with new parameters rₐ(t) + 𝛼ₐ and (1 - rₐ(t)) + 𝛽ₐ as the alpha and beta parameters.

Finally, once we have the posterior distribution, we can use this to simulate a future observation of the pₐ values. This means simulating one possible k-arm bandit game that is consistent with the reward observations so far. Once we have simulated the pₐ values, clearly the best possible action would to be take the arm with the highest pₐ value (i.e., the highest probability of giving a reward).

This entire algorithm is called Thompson Sampling, proposed by William R. Thompson in 1993. Note that, in contrast to the greedy method where we find the best arm based on whatever knowledge we have so far, in Thompson sampling, we try to predict the multi-arm bandit game itself using the knowledge and use the best arm in the predicted game. You may want to check out this nice tutorial paper by Russo et al. on Thompson sampling.

Here's a bit of Python code that implements this Thompson sampling algorithm.

And here's the plot for the experienced rewards of this strategy.

Wow! It looks like the Thompson sampling is a clear winner here. This is because instead of focusing on the short run and the future rewards, Thompson Sampling focuses on the long run by guessing the multi-arm bandit game instead of the next possible rewards. Since the reward distributions of the arms don't change from time to time, Thompson Sampling can leverage this idea by sticking to the best arm of the most probable k-arm bandit game given the information collected so far.

Some Questions to think about

Here are some questions to ponder about.

Thompson sampling gives good performance since the reward distributions do not change from time to time. What happens if the reward distributions change? For instance, if a slot machine gives you a jackpot reward at round t, it is unlikely that it will also give you a jackpot reward at round (t+1), since most of its rewards have been taken out by the previous jackpot.
How do we model this situation mathematically, i.e., when the reward itself is changing the attributes of the slot machine?

I shall answer both of these questions in my next post of this series. Till then, feel free to explore (or exploit) the answers to these questions and let me know in the comments.