Monte Carlo with Importance Sampling for Reinforcement Learning

In this post, we’ll extend our toolset for Reinforcement Learning by considering the Monte Carlo method with importance sampling.

In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about the Monte Carlo method. But that’s just the beginning. There is still more that can be done to improve the agent’s learning capabilities.

Review of Monte Carlo for Reinforcement Learning

Let’s begin by reviewing the regular Monte Carlo method covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state \( S_t \) and decides what action \( A_t \) to perform based on the state. This is called the policy and can be characterized by a probability distribution, \( \pi( A_t | S_t) \).

As the agent does this action, it changes the environment which results in the next state \( S_{t+1} \). A reward signal \( R_{t+1} \) is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, \( G_t \). The discounted return is defined as:

$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … + \gamma^{T – t – 1} R_T $$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

The expected return given that the agent is in state \( S_t \) and performs action \( A_t \) at time \( t \) is given by the Q-table. Specifically:

$$ Q_\pi(s, a) = E_\pi[ G_t | S_t = s, A_t = a] $$

The Q-table can be used to determine what the best action will be since we can just choose whichever action \( a \) maximizes \( Q(s,a) \).

The problem is, we do not know \( Q(s,a) \)! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution \( p(s’, r | s, a) \).

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$ Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a) $$

Where \( G^{(i)}(s,a) \) was the sample return when the agent was in state \( s \) and performed action \( a \) during the \( i \)’th episode.

Put simply: play a bunch of episodes, collect all the state-action-reward sequences, calculate the returns (by summing up the rewards), and then compute the average return for each state-action pair.


How can we ensure that we visit every state-action pair so that the whole Q-table is filled up with a sufficient number of samples?

Practically, we usually employ some kind of exploration strategy, such as epsilon-greedy. With epsilon-greedy, we perform the optimal action \( 1-\varepsilon \) of the time, and we pick a random action \( \varepsilon \) of the time. So \( \varepsilon \) is the probability of exploration.

The problem with this approach is that it leads to a suboptimal policy. Why? It means that approximately \( \varepsilon \) of the time, you are going to do something suboptimal!

(Note: it’s not exactly \( \varepsilon \) since choosing an action randomly can still lead to choosing the optimal action by chance.)

This is where we transition to the main topic of this article, which is how Importance Sampling can help us overcome this problem.

Monte Carlo with epsilon-greedy exploration is called an on-policy control method, because the action-value (Q-table) being estimated corresponds to the policy that the agent is following.

On the other hand, off-policy methods allow the agent to act according to one policy (called the behavior policy), while the action-value is computed for a different, eventually optimal policy (called the target policy).

Henceforth we will denote the target policy as \( \pi( a | s) \) and the behavior policy as \( b(a | s) \).

Importance Sampling

Suppose that we would like to estimate the expected value of some function \( s \) of a random variable \( X \) under some distribution \( \pi \). We write this as:

$$ E_\pi[ s(X) ] $$

If we know \( \pi \), then (assuming \( X \) is a discrete random variable) this can be computed as:

$$ E_\pi[ s(X) ] = \sum_x \pi(x)s(x) $$

If we don’t know \( \pi \), but \( X \) is sampled according to \( \pi \), then this expectation can be estimated by using the sample mean:

$$ E_\pi[ s(X) ] \approx \frac{1}{N}\sum_{i=1}^N s(X_i) $$

Now suppose that something prevents us from gathering samples according to the distribution \( \pi \), but it’s still possible to gather samples from a different distribution \( b \). i.e. We want \( X \sim \pi \) but we have \( X \sim b \) instead.

In this case, the above sample average does not give us the desired expectation. We would be estimating \( E_b[ s(X)] \) instead of \( E_\pi[ s(X) ] \).

The importance sampling solution is found by recognizing the following equalities:

&& E_\pi \left[ s(X) \right] \\
&=& \sum_x \pi(x)s(x) \\
&=& \sum_x \pi(x)s(x)\frac{b(x)}{b(x)} \\
&=& \sum_x b(x)s(x)\frac{\pi(x)}{b(x)} \\
&=& E_b \left[ s(X)\frac{\pi(X)}{b(X)} \right]

This tells us that it’s possible to estimate the expectation under \( \pi \) even when the samples are drawn according to \( b \). All we have to do is multiply by the importance sampling ratio, \( \frac{\pi(X)}{b(X)} \).

The only requirement is that \( b(x) \) is not 0 when \( \pi(x) \) is not 0. This is called “coverage”.

Applying Importance Sampling to Reinforcement Learning

In reinforcement learning, the return \( G_t \) is generated by acting according to the behavior policy \( b(a | s) \) with transition dynamics \( p(s’, r | s, a) \). But we would like to know the expectation of \( G_t \) under the target policy \( \pi(a | s) \) with the same transition dynamics.

In this case, the importance sampling ratio is a bit more complicated but can still be derived. \( G_t \) is a sample from the distribution of \( p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b \mspace{5mu} \forall \tau) \).

Basically this says: “the distribution of all the actions and states that happened after arriving in state \( S_t \), following the policy \( b \)”.

The distribution we want the expectation with respect to is the same thing, but with actions drawn according to \( \pi \). i.e. \( p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi \mspace{5mu} \forall \tau) \).

Thanks to the Markov property these distributions are easy to expand.

$$ p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b) = \prod_{\tau=t}^{T-1} b(A_\tau | S_\tau)p(S_{\tau+1} | S_\tau, A_\tau)$$

$$ p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi) = \prod_{\tau=t}^{T-1} \pi(A_\tau | S_\tau)p(S_{\tau+1} | S_\tau, A_\tau)$$

The importance sampling ratio is then just:

$$ \frac{p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi)}{p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b)} = \prod_{\tau=t}^{T-1} \frac{\pi(A_\tau | S_\tau)}{b(A_\tau | S_\tau)}$$

The transition dynamics cancel out because they are the same on both top and bottom.

This is convenient, because we know \( \pi \) and we know \( b \), but we do not know \( p \) (which is why we have to use Monte Carlo in the first place).

Let’s define this importance sampling ratio using the symbol \( \rho \).

$$ \rho_{t:T-1} \dot{=} \prod_{\tau=t}^{T-1} \frac{\pi(A_\tau | S_\tau)}{b(A_\tau | S_\tau)}$$

Using this importance sampling ratio, we can estimate \( Q_\pi(s,a) \) even though we are acting according to a different policy \( b \) and using the returns generated from this other policy.

$$ Q_\pi(s,a) \approx \frac{1}{N}\sum_{i=1}^N \rho^{(i)}(s,a) G^{(i)}(s,a) $$

Where \( G^{(i)}(s,a) \) was the sample return when the agent was in state \( s \) and performed action \( a \) during the \( i \)’th episode, and \( \rho^{(i)}(s,a) \) was the corresponding importance sampling ratio.

Importance Sampling for Monte Carlo Implementation

At this point, you know all the theory. All that you have to do now is plug in the above importance sampling ratio in the appropriate places in your existing Monte Carlo code, and you’ll be doing Monte Carlo with importance sampling.

Here are some important considerations.

Like the return \( G_t \), the importance sampling ratio is defined in terms of future values. i.e. the importance sampling ratio at time \( t \) depends on the probabilities of the behavior and target policies at time \( t+1, t+2, … \).

Therefore, it would make sense to loop backwards in time to compute this ratio, just like we loop backwards in time to compute the return.

Just like the return, the importance sampling ratio can be computed recursively.

Finally, you’ll recall that for the regular unweighted sample mean, it’s possible to perform constant-time updates every time we collect a new sample, instead of naively summing up all the samples and dividing by N.

$$ Q^{(i)}(s,a) \leftarrow Q^{(i-1)}(s,a) – \frac{1}{i}(G^{(i)}(s,a) – Q^{(i-1)}(s,a)) $$

Similarly, it’s possible to express the weighted sample mean update using a similar constant-time operation. i.e.

$$ Q^{(i)}(s,a) \leftarrow Q^{(i-1)}(s,a) – \alpha^{(i)}(G^{(i)}(s,a) – Q^{(i-1)}(s,a)) $$

As an exercise, try to derive what \( \alpha^{(i)} \) should be.

Last point: I haven’t discussed weighted importance sampling, which can be used to reduce the variance of the estimate. The weighted importance sampling estimate looks like this:

$$ Q_\pi(s,a) \approx \frac{\sum_{i=1}^N \rho^{(i)}(s,a) G^{(i)}(s,a)}{ \sum_{i=1}^N \rho^{(i)}(s,a) } $$


Let’s review why this is different from regular Monte Carlo.

Regular Monte Carlo (what I covered in my Reinforcement Learning course) is an on-policy control method.

We use epsilon-greedy because exploration is required to collect enough samples for all state-action pairs. Epsilon-greedy is suboptimal by definition. Our Q-table and final policy will thus be suboptimal.

What might be better is an off-policy control method, where we act according to a behavior policy which allows for exploration, but compute the Q-table according to the target greedy policy (the optimal policy).