Reinforcement Learning Algorithms: Expected SARSA

In this post, we’ll extend our toolset for Reinforcement Learning by considering a new temporal difference (TD) method called Expected SARSA.

In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about SARSA and Q-Learning, two popular TD methods. We’ll see how Expected SARSA unifies the two.



Review of SARSA and Q-Learning

Let’s begin by reviewing the regular TD methods covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state \( S_t \) and decides what action \( A_t \) to perform based on the state. This is called the policy and can be characterized by a probability distribution, \( \pi( A_t | S_t) \).

As the agent does this action, it changes the environment which results in the next state \( S_{t+1} \). A reward signal \( R_{t+1} \) is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, \( G_t \). The discounted return is defined as:

$$ G_t \dot{=} R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … + \gamma^{T – t – 1} R_T $$

One important relationship we can infer from the definition of the return is:

$$ G_t = R_{t + 1} + \gamma G_{t+1} $$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

Taking the expected value of both sides and conditioning on the state \( S_t \), we arrive at the Bellman Equation:

$$ V_\pi(s) = E_\pi[ R_{t + 1} + \gamma V_\pi(S_{t+1}) | S_t = s] $$

If we condition on both the state \( S_t \) and the action \( A_t \), we get the Bellman Equation for the action-value:

$$ Q_\pi(s, a) = E_\pi \left[ R_{t + 1} + \gamma \sum_{a’} \pi(a’ |S_{t+1}) Q_\pi(S_{t+1}, a’) | S_t = s, A_t = a \right] $$

This will be the important relationship to consider when we learn about Expected SARSA.

Let’s go back a few steps.

The expected return given that the agent is in state \( S_t \) and performs action \( A_t \) at time \( t \) is given by the Q-table. Specifically:

$$ Q_\pi(s, a) = E_\pi[ G_t | S_t = s, A_t = a] $$

The Q-table can be used to determine what the best action will be since we can just choose whichever action \( a \) maximizes \( Q(s,a) \).

The problem is, we do not know \( Q(s,a) \)! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution \( p(s’, r | s, a) \).

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The full Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$ Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a) $$

As you recall, it’s possible to convert the formula for the sample mean into a recursive update – something that looks a bit like gradient descent.

This is convenient so that we can update \(Q\) each time we receive a new sample without having to sum up all \(N\) values each time. If we did that, our computations would get longer and longer as \(N\) grows!

The recursive update looks like this:

$$ Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ G^{(N)}(s,a) – Q_\pi^{(N-1)}(s, a) \right] $$

This shows us how to get the \(N\)th estimate from the \(N-1\)th estimate. By setting \( \alpha = \frac{1}{N} \), we get exactly the sample mean (this was derived in the course).

The Temporal Difference approach uses “bootstrapping”, where we use existing values in \(Q_\pi(s, a)\) to estimate the expected return.

The SARSA update looks as follows:

$$ Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ r + \gamma Q_\pi^{(N-1)}(s’, a’) – Q_\pi^{(N-1)}(s, a) \right] $$

Essentially, we have replaced the old “target”:

$$ G(s,a) $$

with a new “target”:

$$ r + \gamma Q(s’, a’) $$

This should remind you of the right-hand side of the Bellman Equation for Q, with all the expected values removed.

What is the significance of this?

With Monte Carlo, we would have to wait until the entire episode is complete in order to compute the return (since the return is the sum of all future rewards).

For very long episodes, this means that it will take a long time before any updates are made to your agent.

For infinitely long episodes, it’s not possible to compute the return at all! (You’d have to wait an infinite amount of time to get an answer.)

TD learning uses the immediate reward \( r \) and “estimates the rest” using the existing value of \( Q \).

Therefore, your agent learns on each step, rather than waiting until the end of an episode.

The Q-Learning “target” is:

$$ r + \gamma \max_a’ Q(s’, a’) $$

The main difference between SARSA and Q-Learning is that SARSA is on-policy while Q-Learning is off-policy.

This is because Q-Learning always uses the max, even when that action wasn’t taken. SARSA uses action \( a’ \) – whichever action was actually chosen for the next step. Q-Learning learns the value of the policy in which you take the max (even though your behavior policy may be different – e.g. epsilon-greedy). SARSA learns the value of the behavior policy.


Expected SARSA

For Expected SARSA, the target is:

$$ r + \gamma \sum_{a’} \pi( a’ | s’) Q(s’, a’) $$

This can also be written as:

$$ r + \gamma E_\pi [Q(s’, a’) | s’] $$

Thus, Expected SARSA uses the expected action-value for the next state \( s’ \), over all actions \( a’ \), weighted according to the policy distribution.

Like regular SARSA, this should remind you of the Bellman Equation for Q (even more so than regular SARSA since it now properly sums over the policy distribution).

You can think of regular SARSA as merely drawing samples from the Expected SARSA target distribution. Put a different way, SARSA does what Expected SARSA does, in expectation.

So how does Expected SARSA unify SARSA and Q-Learning?

In fact, Q-Learning is merely a special case of Expected SARSA.

First, recognize that Expected SARSA can be on-policy or off-policy. In the off-policy case, one could act according to some behavior policy \( b(a | s) \), while the target is computed according to the target policy \( \pi( a | s) \).

If \( b(a | s) \) is epsilon-greedy while \( \pi(a | s) \) is greedy, then Expected SARSA is equivalent to Q-Learning.

On the other hand, Expected SARSA generalizes SARSA as well, as discussed above.

Practical Considerations

1) Aside from all the theory above, the update itself is actually quite simple to implement. In some cases, you may find that it works better than SARSA or Q-Learning. Thus, it is a worthy addition to your RL toolbox and could form a part of your strategy for building agents to solve RL problems.

2) It’s slower than SARSA and Q-Learning since it requires summing over the action space on each update. So, you may have to decide whether the additional computation requirements are worth whatever increase in performance you observe.