# What is the difference between epsilon-greedy and epsilon-soft policies?

February 27, 2020

Artificial Intelligence: Reinforcement Learning in Python

A common question I get in my Reinforcement Learning class is:

“What is the difference between epsilon-greedy and epsilon-soft policies?”

At first glance, it may seem that these are the same thing. At times, in Sutton & Barto, it seems these 2 terms are used interchangeably.

Here’s the difference.

An epsilon-soft ($$\varepsilon-soft$$) policy is any policy where the probability of all actions given a state $$s$$ is greater than some minimum value, specifically:

$$\pi(a | s) \ge \frac{\varepsilon}{| \mathcal{A}(s) |} , \forall a \in \mathcal{A}(s)$$

The epsilon-greedy ($$\varepsilon-greedy$$) policy is a specific instance of an epsilon-soft policy.

Specifically, the epsilon-greedy policy can be defined as epsilon-greedy with respect to the action-value $$Q(s,a)$$.

Let $$a^*$$ be the greedy action with respect to $$Q(s,a)$$, so that:

$$a^* = \arg\max_a Q(s,a)$$

Then the epsilon-greedy policy assigns the following probabilities to the actions in $$\mathcal{A}(s)$$:

$$\begin{eqnarray} \pi(a | s) &=& 1 – \varepsilon + \frac{\varepsilon}{| \mathcal{A}(s) |}, &a =& a^* \\ \pi(a | s) &=& \frac{\varepsilon}{| \mathcal{A}(s) |}, &a \ne& a^* \end{eqnarray}$$

This can be accomplished using the following code:

def epsilon_greedy(Q, s, epsilon):
if np.random.random() < epsilon:
return np.random.choice(action_space)
else:
return np.argmax(Q[s, :])


This assumes that Q is a Numpy-like array with 2 dimensions corresponding to the possible states and actions.

Based on this, we can see that all epsilon-greedy policies are epsilon-soft policies, but not all epsilon-soft policies are epsilon-greedy policies.