Bayesian Bandit Tutorial


To explain and provide code for the Bayesian bandit problem, without too much math.


Also known as the “multi-armed bandit”, the problem is as follows:

Suppose you are at a casino and have a choice between N slot machines. Each of the N slot machines (bandits) has an unknown probability of letting you win. i.e. Bandit 1 may have P(win) = 0.9. Bandit 2 may have P(win) = 0.3. We wish to maximize our winnings by playing the machine which has the highest probability of winning. The problem is determining which machine this is.




What is the inherent difficulty in determining the machine probabilities?

We know that to accurately determine the probability of an event, we have to take a large number of samples. The more samples we take, the more accurate that probability is.

To achieve the best possible accuracy, we should collect an infinite number of samples.


This should remind you of confidence intervals:


As n, the number of samples, increases, the confidence interval shrinks. Thus, to achieve maximum confidence, we should collect n = infinity samples.

Of course, that would take literally forever, so maybe we can stop at 1 million.

1 million sounds like a good number, right?

Then, we can be extremely confident in our prediction of the win rate of each bandit and that we have found the best bandit.

But wait… if one bandit is the best, that means the others are worse.

We have then played worse bandits millions of times!

We have not maximized our winnings. In fact, we may have just needlessly lost millions of times.


Explore vs. Exploit

This problem is known as explore vs. exploit. Should we “explore” – i.e. try a new machine in hopes that its probability of winning is higher? Or should we “exploit” – i.e. the machine we are using now is pretty good, so let’s keep going?

Our desire to obtain the most accurate estimate of the bandit win rates is fundamentally at odds with our desire to play only the bandit which has the highest win rate.

There are real-world applications to this dilemma. For example, suppose you are an online advertising platform and you want to show ads to users that have a high probability of being clicked. Do you show an ad that has a reasonably high probability so far, or do you choose a new ad that could have an even higher click-through rate (CTR)? Suppose you are the New York Times and you want to increase the click-through of articles on your home page. Should you post the same articles that are already known to be popular, or should you show a fresh article that you predict might be even more popular?

The trick is to use Bayes’ rule.


$$ p(x | y) = \frac{p(y | x)p(x)}{p(y)} $$


The Bayesian Bandit Solution

The idea: let’s not pull each arm 1000 times to get an accurate estimate of its probability of winning. Instead, let’s use the data we’ve collected so far to determine which arm to pull. If an arm doesn’t win that often, but we haven’t sampled it too much, then its probability of winning is low, but our confidence in that estimate is also low – so let’s give it a small chance in the future. However, if we’ve pulled an arm many times and it wins often, then its probability of winning is high, and our confidence in that estimate is also high – let’s give that arm a higher chance of being pulled.

Let’s rewrite Bayes’ rule for our problem:

$$ p(\theta | x) = \frac{p(x | \theta)p(\theta)}{p(x)} $$

Where \( \theta \) is the parameter we’re trying to estimate – win rate, and \( x \) is the data we’ve collected so far (# wins and # losses).


  • \( p(x | \theta) \) is the “likelihood” (i.e. the probability of observing the data x given our current belief about the parameters theta)
  • \( p(\theta) \) is the “prior” – our current assumption about the parameters
  • \( p(\theta | x) \) is the “posterior” – our updated assumption about the parameters after observing the data x

The “trick” in Bayesian bandits has to do with something mathematicians call “conjugate priors”. It saves us from having to do the actual calculation for Bayes’ rule.

We choose the Beta distribution for some convenient reasons (even though there is no deductive reason for us to do so):

  • It can have values between 0 and 1, which is nice for things like click-through rate, or in other words, for representing the probability of winning.
  • When the prior is a Beta distribution and the likelihood is a Bernoulli distribution, the posterior is also a Beta distribution.
  • The update formula for determining the posterior only involves addition and subtraction, whereas for other distributions it can be more complicated (I will show how the parameters are related below).

The Beta distribution has 2 parameters, a and b, which govern its shape. We refer to a specific distribution as \( Beta(a,b) \)



Initially, we set a=1 and b=1. This results in a uniform distribution, i.e. we assume nothing about our chances of winning – all probabilities of winning are equally likely. This is called a minimally informative prior.

To update the posterior given current data, we do the following:

  • Let our prior parameters be a, b.
  • Let x be the result of pulling an arm (either 0 or 1).
  • Then our posterior parameters are: a’ = a + x, b’ = b + 1 – x

Thus, our updated distribution is  \( Beta(a + x, b + 1 – x) \).

The algorithm for Bayesian Bandits

for number_of_trials:
  take random sample from each bandit with its current (a, b)
  for the bandit with the largest sample, pull its arm
  update that bandit with the data from the last pull

A visual representation of Bayesian Bandits

Below I plot the distributions of each Beta after a certain number of trials.







What is happening here? Notice that the red distribution, which has the highest probability of winning, gets sharper and sharper as the number of trials increases.

The key is in the sampling stage. When we sample from each distribution, the red distribution is much more likely to give us a number around 0.75, whereas the green and blue distributions will give us a number in a much wider range.

So once the red bandit has “proved itself” we are much more likely to choose it in the future, and we just don’t bother to choose the lower performing bandits, although there is a small chance we could.

A fat distribution means more “exploration”. A sharp distribution means more “exploitation” (if it has a relative high win rate). Note that as the programmer, you don’t choose whether or not to explore or exploit. You sample from the distributions, meaning it is randomly decided whether you should explore or exploit. Essentially the decision is random, with a bias toward bandits who have proven themselves to win more.

See the code here:…/

Or check out the full video course here (85% off coupon automatically applied):

Bayesian Machine Learning in Python: A/B Testing

These same concepts are explored more in-depth in my Reinforcement Learning course (89% off coupon automatically applied):

Artificial Intelligence: Reinforcement Learning in Python