To explain and provide code for the Bayesian bandit problem, without too much math.
Motivation
Also known as the “multi-armed bandit”, the problem is as follows:
Suppose you are at a casino and have a choice between N slot machines. Each of the N slot machines (bandits) has an unknown probability of letting you win. i.e. Bandit 1 may have P(win) = 0.9. Bandit 2 may have P(win) = 0.3. We wish to maximize our winnings by playing the machine which has the highest probability of winning. The problem is determining which machine this is.
What is the inherent difficulty in determining the machine probabilities?
We know that to accurately determine the probability of an event, we have to take a large number of samples. The more samples we take, the more accurate that probability is.
To achieve the best possible accuracy, we should collect an infinite number of samples.
This should remind you of confidence intervals:
As n, the number of samples, increases, the confidence interval shrinks. Thus, to achieve maximum confidence, we should collect n = infinity samples.
Of course, that would take literally forever, so maybe we can stop at 1 million.
1 million sounds like a good number, right?
Then, we can be extremely confident in our prediction of the win rate of each bandit and that we have found the best bandit.
But wait… if one bandit is the best, that means the others are worse.
We have then played worse bandits millions of times!
We have not maximized our winnings. In fact, we may have just needlessly lost millions of times.
Explore vs. Exploit
This problem is known as explore vs. exploit. Should we “explore” – i.e. try a new machine in hopes that its probability of winning is higher? Or should we “exploit” – i.e. the machine we are using now is pretty good, so let’s keep going?
Our desire to obtain the most accurate estimate of the bandit win rates is fundamentally at odds with our desire to play only the bandit which has the highest win rate.
There are real-world applications to this dilemma. For example, suppose you are an online advertising platform and you want to show ads to users that have a high probability of being clicked. Do you show an ad that has a reasonably high probability so far, or do you choose a new ad that could have an even higher click-through rate (CTR)? Suppose you are the New York Times and you want to increase the click-through of articles on your home page. Should you post the same articles that are already known to be popular, or should you show a fresh article that you predict might be even more popular?
The trick is to use Bayes’ rule.
$$ p(x | y) = \frac{p(y | x)p(x)}{p(y)} $$
The Bayesian Bandit Solution
The idea: let’s not pull each arm 1000 times to get an accurate estimate of its probability of winning. Instead, let’s use the data we’ve collected so far to determine which arm to pull. If an arm doesn’t win that often, but we haven’t sampled it too much, then its probability of winning is low, but our confidence in that estimate is also low – so let’s give it a small chance in the future. However, if we’ve pulled an arm many times and it wins often, then its probability of winning is high, and our confidence in that estimate is also high – let’s give that arm a higher chance of being pulled.
Where \( \theta \) is the parameter we’re trying to estimate – win rate, and \( x \) is the data we’ve collected so far (# wins and # losses).
Recall:
\( p(x | \theta) \) is the “likelihood” (i.e. the probability of observing the data x given our current belief about the parameters theta)
\( p(\theta) \) is the “prior” – our current assumption about the parameters
\( p(\theta | x) \) is the “posterior” – our updated assumption about the parameters after observing the data x
The “trick” in Bayesian bandits has to do with something mathematicians call “conjugate priors”. It saves us from having to do the actual calculation for Bayes’ rule.
We choose the Beta distribution for some convenient reasons (even though there is no deductive reason for us to do so):
It can have values between 0 and 1, which is nice for things like click-through rate, or in other words, for representing the probability of winning.
When the prior is a Beta distribution and the likelihood is a Bernoulli distribution, the posterior is also a Beta distribution.
The update formula for determining the posterior only involves addition and subtraction, whereas for other distributions it can be more complicated (I will show how the parameters are related below).
The Beta distribution has 2 parameters, a and b, which govern its shape. We refer to a specific distribution as \( Beta(a,b) \)
Initially, we set a=1 and b=1. This results in a uniform distribution, i.e. we assume nothing about our chances of winning – all probabilities of winning are equally likely. This is called a minimally informative prior.
To update the posterior given current data, we do the following:
Let our prior parameters be a, b.
Let x be the result of pulling an arm (either 0 or 1).
Then our posterior parameters are: a’ = a + x, b’ = b + 1 – x
Thus, our updated distribution is \( Beta(a + x, b + 1 – x) \).
The algorithm for Bayesian Bandits
for number_of_trials:
take random sample from each bandit with its current (a, b)
for the bandit with the largest sample, pull its arm
update that bandit with the data from the last pull
A visual representation of Bayesian Bandits
Below I plot the distributions of each Beta after a certain number of trials.
What is happening here? Notice that the red distribution, which has the highest probability of winning, gets sharper and sharper as the number of trials increases.
The key is in the sampling stage. When we sample from each distribution, the red distribution is much more likely to give us a number around 0.75, whereas the green and blue distributions will give us a number in a much wider range.
So once the red bandit has “proved itself” we are much more likely to choose it in the future, and we just don’t bother to choose the lower performing bandits, although there is a small chance we could.
A fat distribution means more “exploration”. A sharp distribution means more “exploitation” (if it has a relative high win rate). Note that as the programmer, you don’t choose whether or not to explore or exploit. You sample from the distributions, meaning it is randomly decided whether you should explore or exploit. Essentially the decision is random, with a bias toward bandits who have proven themselves to win more.
Today we will continue our discussion of linear regression by extending the ideas from simple linear regression to multiple linear regression.
Recall that in simple linear regression, the input is 1-D. In multiple linear regression, the input is N-dimensional (any number of dimensions). The output is still just a scalar (1-D).
So now our input data looks like this:
(X1, Y1), (X2, Y2), …, (Xm, Ym)
Where X is a vector and Y is a scalar.
But now instead of our hypothesis, h(), looking like this:
h(X) = aX + b
It looks like this:
Where each subscripted x is a scalar.
beta0 is also known as the “bias term”.
Another, more compact way of writing this is:
Where beta and x are vectors. When we transpose the first vector this is also called a “dot product” or “inner product”.
In this representation, we introduce a dummy variable x0 = 1, so that beta and x both contain the same number of elements (n+1).
In the case where the dimensionality of the input data is 2, we can still visualize our model, which is no longer a line, but a “plane of best fit”.
To solve for the beta vector, we do the same thing we did for simple linear regression: define an error function (we’ll use sum of squared error again), and take the derivative of J with respect to each parameter (beta0, beta1, …) and set them to 0 to solve for each beta.
This is a lot more tedious than in the 1-D case, but I would suggest as an exercise attempting at least the 2-D case.
As before, there is a “closed form” solution for beta:
Here, each (Xi, Yi) is a “sample” from the data.
Notice that in the first term we transpose the second Xi. This is an “outer product” and the result is an (n+1) x (n+1) vector.
The superscript -1 denotes a matrix inverse.
An even more compact form of this equation arises when we consider all the samples of X together in an m x (n+1) matrix, and all the samples of Y together in an m x 1 matrix:
As in the 1-D case, we use the R-square to measure how well the model fits the actual data (the formula is exactly the same).
Linear regression is one of the simplest machine learning techniques you can use. It is often useful as a baseline relative to more powerful techniques.
To start, we will look at a simple 1-D case.
Like all regressions, we wish to map some input X to some input Y.
ie.
Y = f(X)
With linear regression:
Y = aX + b
Or we can say:
h(X) = aX + b
Where “h” is our “hypothesis”.
You may recall from your high school studies that this is just the equation for a straight line.
When X is 1-D, or when “Y has one explanatory variable”, we call this “simple linear regression”.
When we use linear regression, we are using it to model linear relationships, or what we think may be linear relationships.
As with all supervised machine learning problems, we are given labeled data points:
(X1, Y1), (X2, Y2), (X3, Y3), …, (Xn, Yn)
And we will try to fit the line (aX + b) as best we can to these data points.
This means we have to optimize the parameters “a” and “b”.
How do we do this?
We will define an error function and then find the “a” and “b” that will make the error as small as possible.
You will see that many regression problems work this way.
What is our error function?
We could use the difference between the predicted Y and the actual Y like so:
But if we had equal amounts of errors where Y was bigger than the prediction, and where Y was smaller than the prediction, then the errors would cancel out, even though the absolute difference in errors is large.
Typically in machine learning, the squared error is a good place to start.
Now, whether or not the difference in the actual and predicted output is positive or negative, its contribution to the total error is still positive.
We call this sum the “sum of squared errors”.
Recall that we want to minimize it.
Recall from calculus that to minimize something, you want to take its derivative.
Because there are two parameters, we have to take the derivatives both with respect to a and with respect to b, set them to 0, and solve for a and b.
Luckily, because the error function is a quadratic it increases as (a,b) get further and further away from the minimum.
As an exercise I will let you calculate the derivatives.
You will get 2 equations (the derivatives) and 2 unknowns (a, b). From high school math you should know how to solve this by rearranging the terms.
Note that these equations can be solved analytically. Meaning you can just plug and chug the values of your inputs and get the final value of a and b by blindly using a formula.
Note that this method is also called “ordinary least squares”.
Measuring the error (R-squared)
To determine how well our model fits the data, we need a measure called the “R-square”.
Note that in classification problems, we can simply use the “classification rate”, which is the number of correctly classified inputs divided by the total number of inputs. With the real-valued outputs we have in regression, this is not possible.
Here are the equations we use to predict the R-square.
SS(residual) is the sum of squared error between the actual and predicted output. This is the same as the error we were trying to minimize before!
SS(total) is the sum of squared error between each sample output and the mean of all the sample outputs, i.e. What the residual error would be if we just predicted the average output every time.
So the R-square then, is just how much better our model is compared to predicting the mean each time. If we just predicted the mean each time, the R-square would be 1-1=0. If our model is perfect, then the R-square would be 1-0=1.
Something to think about: If our model performs worse than predicting the mean each time, what would be the R-square value?
Limitations of Linear Regression
It only models linear equations. You can model higher order polynomials (link to later post) but the model is still linear in its parameters.
It is sensitive to outliers. Meaning if we have one data point very far away from all the others, it could “pull” the regression line in its direction, away from all the other data points, just to minimize the error.