What is the difference between epsilon-greedy and epsilon-soft policies?

Learn more about Reinforcement Learning in my course (75% off when you use the following link):

Artificial Intelligence: Reinforcement Learning in Python

A common question I get in my Reinforcement Learning class is:

“What is the difference between epsilon-greedy and epsilon-soft policies?”

At first glance, it may seem that these are the same thing. At times, in Sutton & Barto, it seems these 2 terms are used interchangeably.

Here’s the difference.

An epsilon-soft (\( \varepsilon-soft \)) policy is any policy where the probability of all actions given a state \(s\) is greater than some minimum value, specifically:

$$ \pi(a | s) \ge \frac{\varepsilon}{| \mathcal{A}(s) |} , \forall a \in \mathcal{A}(s) $$

The epsilon-greedy (\( \varepsilon-greedy \)) policy is a specific instance of an epsilon-soft policy.

Specifically, the epsilon-greedy policy can be defined as epsilon-greedy with respect to the action-value \( Q(s,a) \).

Let \( a^* \) be the greedy action with respect to \( Q(s,a) \), so that:

$$ a^* = \arg\max_a Q(s,a) $$

Then the epsilon-greedy policy assigns the following probabilities to the actions in \( \mathcal{A}(s) \):

$$\begin{eqnarray}
\pi(a | s) &=& 1 – \varepsilon + \frac{\varepsilon}{| \mathcal{A}(s) |}, &a =& a^* \\
\pi(a | s) &=& \frac{\varepsilon}{| \mathcal{A}(s) |}, &a \ne& a^*
\end{eqnarray}$$

This can be accomplished using the following code:

def epsilon_greedy(Q, s, epsilon):
  if np.random.random() < epsilon:
    return np.random.choice(action_space)
  else:
    return np.argmax(Q[s, :])

This assumes that Q is a Numpy-like array with 2 dimensions corresponding to the possible states and actions.

Based on this, we can see that all epsilon-greedy policies are epsilon-soft policies, but not all epsilon-soft policies are epsilon-greedy policies.

Learn more about Reinforcement Learning in my course (75% off more more when you use the following link):

Artificial Intelligence: Reinforcement Learning in Python