Learn more about Reinforcement Learning in my course (75% off when you use the following link):
Artificial Intelligence: Reinforcement Learning in Python
A common question I get in my Reinforcement Learning class is:
“What is the difference between epsilon-greedy and epsilon-soft policies?”
At first glance, it may seem that these are the same thing. At times, in Sutton & Barto, it seems these 2 terms are used interchangeably.
Here’s the difference.
An epsilon-soft (\( \varepsilon-soft \)) policy is any policy where the probability of all actions given a state \(s\) is greater than some minimum value, specifically:
$$ \pi(a | s) \ge \frac{\varepsilon}{| \mathcal{A}(s) |} , \forall a \in \mathcal{A}(s) $$
The epsilon-greedy (\( \varepsilon-greedy \)) policy is a specific instance of an epsilon-soft policy.
Specifically, the epsilon-greedy policy can be defined as epsilon-greedy with respect to the action-value \( Q(s,a) \).
Let \( a^* \) be the greedy action with respect to \( Q(s,a) \), so that:
$$ a^* = \arg\max_a Q(s,a) $$
Then the epsilon-greedy policy assigns the following probabilities to the actions in \( \mathcal{A}(s) \):
$$\begin{eqnarray}
\pi(a | s) &=& 1 – \varepsilon + \frac{\varepsilon}{| \mathcal{A}(s) |}, &a =& a^* \\
\pi(a | s) &=& \frac{\varepsilon}{| \mathcal{A}(s) |}, &a \ne& a^*
\end{eqnarray}$$
This can be accomplished using the following code:
def epsilon_greedy(Q, s, epsilon): if np.random.random() < epsilon: return np.random.choice(action_space) else: return np.argmax(Q[s, :])
This assumes that Q is a Numpy-like array with 2 dimensions corresponding to the possible states and actions.
Based on this, we can see that all epsilon-greedy policies are epsilon-soft policies, but not all epsilon-soft policies are epsilon-greedy policies.
Learn more about Reinforcement Learning in my course (75% off more more when you use the following link):