Reinforcement Learning (Part 4) - Temporal Difference Algorithms

4th part of the Reinforcement Learning Series: I discuss about how the idea of Bellman equations and Monte Carlo methods can be combined into a single algorithm called Temporal Difference Algorithm

Aug 24, 2023

Introduction

In my previous post (check below), we learned two contrasting methods of value function estimation. First, the Monte Carlo method which accumulates experiences. Second, the dynamic programming (sometimes also called bootstrapping) method which uses the Bellman equation. Both of these algorithms have good and bad effects.

Reinforcement Learning (Part 3) - Markov Decision Process

Subhrajyoty Roy

August 16, 2023

Reinforcement Learning (Part 3) - Markov Decision Process

Introduction

Read full story

The Monte Carlo method accumulates experiences, so it does not rely on your knowledge of how the environment behaves. This is particularly useful when you are playing a two-person board game where your opponent is a human, and you don't exactly know which move it is going to do. However, Monte Carlo requires you to reach the end of the game unless you can update your estimates, in other words, as you play the game, you are gaining knowledge about how the game works, but you are not putting them into use but storing them to be used later. This means you have large memory requirements for the games that run for a long time, and sometimes there is no end or terminal state (think of games like Temple Run or Subway Surfer which has no end unless you lose), so the Monte Carlo method never updates the estimates.

On the other end of the spectrum, we have the Dynamic programming method which uses its current estimates along with the Bellman equation to refine and adjust its own estimates, thus enabling one to work with these infinitely long games. However, the problem is it is not model-free, i.e., it requires knowledge of how the environment works, and often that is a big problem by itself.

This is the fourth part of the Reinforcement Learning series of blog posts.

Follow StatWizard and subscribe below to get interesting discussions on statistics, mathematics and data science delivered directly to your inbox.

Temporal Difference Learning

So in 1988, Richard Sutton came up with an algorithm that combines the benefits of these two algorithms. He named the method Temporal Difference, we shall why such a name is appropriate in a moment. He started by rewriting the Bellman equation as a two-step expectation as follows:

\(v^\pi(S_{t}) = \sum_{A_t} \sum_{S_{t+1}} \pi(A_t \mid S_t) p(S_{t+1} \mid S_t, A_t) \left( R_{t+1} + \gamma v^\pi(S_{t+1}) \right)\)

Here the quantity 𝜋(Aₜ∣Sₜ)p(Sₜ₊₁∣Sₜ, Aₜ) denotes the probability of a two-step process: from the current state Sₜ, we generate an action Aₜ according to policy 𝜋 and then environment gives back a state Sₜ₊₁. In a model-free design, we don't have access to the probability p(Sₜ₊₁∣Sₜ, Aₜ). So the Monte Carlo method avoids that by simulating the action and then receiving the state to accumulate the experience. Hence, in the above equation, we can similarly remove the expectation step and replace that with a simulation step, hence we believe that the following relation should hold approximately.

\(v^\pi(S_{t}) \approx R_{t+1} + \gamma v^\pi(S_{t+1})\)

Update Rule

Now we look at the equation that incrementally updates the value estimates for Monte Carlo, let the current state value being updated be s and it has been seen n times below. So,

\(v^\pi_{new}(s) = \dfrac{n v^\pi_{old}(s) + G_t}{n+1} = v^\pi_{old}(s) + \dfrac{1}{n+1}\left( G_t - v^\pi_{old}(s) \right)\)

It means that the new estimate is the old estimate plus a multiple of the error Gₜ − v𝜋_{old}(s) in the current estimate. It is as if Gₜ is the target that we are trying to estimate using the value function. However, we have established that v𝜋(Sₜ) ≈ Rₜ₊₁+𝛾v𝜋(Sₜ₊₁), so we can replace Gₜ by Rₜ₊₁+𝛾v𝜋(Sₜ₊₁). But we do not know the value v𝜋(Sₜ₊₁) yet, so we again approximate that by our current estimate of the value function. Hence we can consider an update equation as

\(v^\pi_{new}(S_t) = v^\pi_{old}(S_t) + \alpha\left( R_{t+1} + \gamma v^\pi_{old}(S_{t+1}) - v^\pi_{old}(S_t) \right)\)

This method is called Temporal Difference (TD), as the update equation now consists of the incremental changes in value estimates between two successive time points (𝛾v𝜋_{old}(Sₜ₊₁)−v𝜋_{old}(Sₜ)). The quantity 𝛼 is called the learning rate which is typically chosen to be a small value.

In essence, the TD algorithm proceeds with the following steps:

Start with an initial estimate of the value function and a starting state S₀.
For t = 1, 2, …:
1. Simulate one step of your policy, Aₜ and receive Rₜ₊₁, Sₜ₊₁, starting from Sₜ.
2. Use the update equation as shown above.
3. Keep doing this until convergence, i.e., the value estimates are not changing much.

### TD Algorithm for Maze Game

Now we will try to apply the TD algorithm to the same maze game from the last post. Just to recap, the maze game has a maze shown as follows in the following figure, starting at the red corner to reach the blue corner, avoiding the obstacles shown in Gray. The reward for reaching the end goal is 100 but hitting any obstacle is (-1).

Here is a code that performs the TD algorithm on the maze game.

Although the estimates differ from the Monte Carlo estimates and the Dynamic Programming estimates we calculated in the previous post, the heatmap shows that the relative ordering of the value estimates is maintained. This is the main principle of having a value estimate so that it enables you to compare between two states of the game like a chess grandmaster often knows which of the two chess board positions is more advantageous with just a glance at them.

n-step TD Variant

The Temporal Difference algorithm that we discussed so far is the 1-step version of it, this is because we are simulating the response for the environment for only 1 step ahead, and then using the Bellman backup equation to approximate the gain to update the estimate. We can generalize the same principle to create an n-step version of it, for any n. As you have guessed, it will simulate the game for n-steps ahead of the current state, and then approximate the gain by a discounted sum of rewards from all these n-steps. The update equation becomes

\(v^\pi_{new}(S_t) = v^\pi_{old}(S_t) + \alpha\left( R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots + \gamma^{n} v^\pi_{old}(S_{t+n}) - v^\pi_{old}(S_t) \right)\)

This now gives you control over how much information you want to propagate to back. (Doesn't it look like a hybrid version of a deep neural network with the flexibility to control how much you will propagate the gradients back for parameter estimation!).

Another example for a better understanding: Think of the 1-step TD is the strategy that a novice chess player like me would employ. I often only think of only 1 step ahead, just seeing a fork or an x-ray kind of move or 1 move immediate checkmate. Though I might only see only 0-step ahead if you play too well 😞. n-step TD is just extending the capability of that RL agent to play like a grandmaster, who can see several, maybe 15-20 (read n) moves ahead. And, if you want to create a chess world champion like AlphaZero, you might try your hand at 100-step TD! 😎

Let's see how the 5-step TD algorithm does in estimating the value for the maze game.

One of the interesting things to note here is that there are no negative values in the estimates for the 5-step temporal difference method. This is probably because as soon as the end goal is only 5 steps away from a state, the state becomes valuable. With the value of the discount factor 𝛾=0.9, the ultimate reward of 100 (i.e., the reward of reaching the end goal) becomes 100𝛾⁵ ≈ 59, which is clearly still a larger reward compared to the punishment of (-1) by hitting an obstacle.

TD(𝜆) Variant

Although Temporal Difference algorithms have their strengths there are still a few ideas we can use to improve them.

If we do an n-step TD method, it uses the reward up to the next n future steps. However, since we are waiting till n steps, we have the information to perform (n-1)-step TD method also, (n-2)-step TD also, and so on. (You get the pattern!) But we are just relying on the n-step look ahead thinking that the middle step look ahead patterns are not meaningful.
In dynamic programming, the Bellman equation consisted of a theoretical expectation of the gain, i.e., it was

\(v^\pi(S_{t}) = \sum_{A_t} \sum_{S_{t+1}} \pi(A_t \mid S_t) p(S_{t+1} \mid S_t, A_t) G_{t:(t+1)} \)

Here the last term is the gain we obtain by stepping from state Sₜ to Sₜ₊₁. In Monte Carlo, we approximate this theoretical expectation by simulating a single path. However, this is like approximating the probability of a coin turning up head by tossing the coin once. As you have guessed, it certainly won't be accurate. We need to toss the coin hundreds of times to get a reasonable approximation. This idea can also be theoretically shown using a result called the Law of Large Numbers. Therefore, the principle is to take aggregate multiple estimates together to reduce the uncertainty in estimation.

So, Richard Sutton came up with the idea that instead of using the n-step gain

\(G_{t:(t+n)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots + \gamma^{n-1}R_{t+n} + \gamma^n v^\pi(S_{t+n}) \)

we can use the average of 1-step gain, 2-step gain, and so on until n-step gain, i.e.,

\(\dfrac{1}{n} \left( G_{t:(t+1)} + G_{t:(t+2)} + \dots + G_{t:(t+n)} \right)\)

as the target in the TD update equation. In general, we can take any weighted average of these temporal difference gains and that can be regarded as a valid estimate for the discounted sum of future rewards.

Relating back to the chess example: Using the average of the n-step gains is to consult multiple chess players, ranging from novices who see only 1-step to a grandmaster seeing 20-steps ahead, and accumulate their suggestions. This is certainly better than relying on only one chess player, who might miss some of the obvious moves.

However, since again there is a long-term look ahead vs a short-term look ahead, we can use a discounting process to take the average. Thus, our new target becomes

\(G^\lambda_{t:(t+n)} = \dfrac{G_{t:(t+1)} + \lambda G_{t:(t+2)} + \lambda^2G_{t:(t+3)} + \dots + \lambda^{n-1}G_{t:(t+n)} }{1 + \lambda + \lambda^2 + \dots + \lambda^{n-1}}\)

for some 0 < 𝜆 < 1. And we have the corresponding update equation

\(v^\pi_{new}(S_t) = v^\pi_{old}(S_t) + \alpha\left( G^\lambda_{t:(t+n)} - v^\pi_{old}(S_t) \right) \)

This is called the n-step TD(𝜆) algorithm for value estimation. In principle, this is just reducing the uncertainty by averaging estimates from multiple n-step TD algorithms.

I won't demonstrate the code for n-step TD(𝜆) algorithm here, but I encourage the enthusiastic readers to try it out for the maze game and comment below the solution. Unfortunately, there is no prize for getting it, since I'm also on a tight budget here 😅!

Next in Queue

In my next post in the RL blog series, we will dive deep into how we can use these value estimates to find an optimal strategy for an RL agent. And we will use Python to code the algorithm, and then use the code to successfully land a space shuttle on the lunar surface inside a simulator. This should give a sneak peek at the amazing achievements of ISRO scientists who successfully landed Chadrayaan-3 on the surface of the moon yesterday, on the 23rd of August, 2023.

Thank you for reading StatWizard. This post is public so feel free to share it.

Reinforcement Learning (Part 4) - Temporal Difference Algorithms

4th part of the Reinforcement Learning Series: I discuss about how the idea of Bellman equations and Monte Carlo methods can be combined into a single algorithm called Temporal Difference Algorithm

Introduction

Reinforcement Learning (Part 3) - Markov Decision Process

Temporal Difference Learning

Update Rule

n-step TD Variant

TD(𝜆) Variant

Next in Queue

Discussion about this post