Previously we looked at the Bayes classifier for MNIST data, using a multivariate Gaussian to model each class.

We use the same dimensionality reduced dataset here.

The K-Nearest Neighbor (KNN) classifier is also often used as a “simple baseline” classifier, but there are a couple distinctions from the Bayes classifier that are interesting.

1) KNN does not use probability distributions to model data. In fact, many powerful classifiers do not assume any probability distribution on the data.

2) KNN is a “lazy” classifier. No work is actually done to train the model. It just saves the input points X and all the labels Y.

At classification time, the predicted class/label is chosen by looking at the “k nearest neighbors” of the input test point.

We choose “k” beforehand.

Here is a visual example for k = 3:

So for 3-NN, we go through all the training points, find the 3 closest ones, and choose the label that shows up the most.

i.e. If we get 2 whites and 1 black, we choose white. This is called “voting”.

Here is another example, where we might get a different answer depending on whether k = 3 or k = 5.

The idea is, our data should be clustered in “space”. Points from the same class should be near each other. In fact, this is an idea that prevails for pretty much all machine learning algorithms, so you should understand it well.

If data points that are near each other are from the same class, then it follows that a test point can be classified by looking at its neighbors’ classes.

In pseudocode, it might look like this:

function predict(x’):
Q = [] // we’ll use this to store the k-nearest neighbors
for x,y in training data:
if Q.length < k or dist(x,x’) < max distance in Q:
add dist(x,x’) and y to Q
if Q.length > k: remove max distance in Q
return the y that shows up most in Q

A naive implementation might look through every training point, which is O(N), and then to find the max in Q it would be O(k), for a total of O(kN).

You could use a sorted data structure for Q making the search and insert O(logk).

The Naive Bayes classifier is a simple classifier that is often used as a baseline for comparison with more complex classifiers.

It is also conceptually very simple and as you’ll see it is just a fancy application of Bayes rule from your probability class.

We will use the famous MNIST data set for this tutorial.

The MNIST dataset is a set of handwritten digits, and our job is to build a computer program that takes as input an image of a digit, and outputs what digit it is.

Recall Bayes rule:

$$ P(c | x) = \frac{P(x | c)P(c)}{P(x)} $$

If you’re like me, you may have found this notation a little confusing at first.

Here \( x \) represents the image, or more precisely, the pixel values of the image formatted as a vector, and \( c \) represents the digit, which can be 0, 1, …, 9.

Sidenote: Images are grayscale and of size 28×28, which, if flattened, yields a vector of length 28×28=784. If you look at the code (linked below) you can see how this is done, and also that we scale the pixel values to be between 0…1. Scaling isn’t strictly necessary, but can be useful for many machine learning algorithms.

We can read the left side \( P(c | x) \) as “the probability that the class is \( c \) given the data \( x \)”. (this is called the “posterior”)

We can read the right side \( P(x | c) \) as “the probability that the data \( x \) belongs to the class \( c \)”. (this is called the “likelihood”)

One little efficiency trick we can do:

We don’t actually care about the value of \( P(c | x) \).

We care about the value of \( c \) itself. That tells us “which digit” the image belongs to.

The class chosen is simply the one that yields the highest probability for that data:

You will notice that \( P(x) \) is constant for all values of \( c \) in \( P(c | x) \).

So when I take the argmax over \( \frac{ P(x | c)P(c) }{ P(x) } \) I can ignore \( P(x) \).

As a simple example, suppose I have that \( A > B \), or numerically, say, \( 10 > 5 \).

If I multiply or divide these numbers by a positive constant (\( P(x) \) will always be positive) then the relationship still holds.

Ex. \( 2A > 2B \), or \( 20 > 10 \).

Using this information, we can simplify our problem so that, in order to choose “which digit” given an image, all we need to do is calculate this argmax (notice \( P(x) \) is removed):

$$ c^* = argmax_{c}{ P(x | c)P(c) }$$

The next step we can take is to think about how to calculate \( P(c) \).

This is just counting.

If I have 100 students in my class, and I want to figure out the probability that a student is born in January, how can I do that?

I simply count up all the students born in January, and divide by the total number of students.

If I want to know the probability of getting heads when I flip a coin, I flip the coin a bunch of times, and divide the number of heads by the total number of coin flips.

The challenge is choosing a model that accurately fits the data for \( P(x | c) \).

As a thought-exercise, think about how you’d do this naively.

Each image is of size 28×28, which means there are 784 (=28×28) pixels per image. Each pixel can take on integer values in the range 0..255 inclusive.

So, if you modeled this as a discrete probability distribution, you’d have \( 255^{784} \) different possibilities. That’s way more than the number of images you have (~50, 000), and hence, you’d never be able to use the “counting method” (used above for calculating \( P(c) \)) to accurately measure those probabilities.

To make the problem tractable and easily computable, we recall that pixels represent light intensity, and light intensity is actually continuous. It’s only discrete inside a computer because computers are discrete.

A reasonable first-guess for modeling continuous data is the multivariate Gaussian or the multivariate Normal.

Note that because the data are continuous, we are not actually calculating probabilities, but probability densities, on the right for \( P(x | c) \). Luckily, Bayes rule still holds for probability densities.

Another thing to note is that because probabilities are very small when dimensionality is high, we’ll work with log-likelihood instead of likelihood. Then instead of getting numbers very close to 0, which is inaccurate when using a computer to represent them, we’ll just get negative numbers.

Which you should try to derive yourself. (D is the dimensionality)

By the way, to calculate \( \mu \) and \( \Sigma \), you can use the sample mean and covariance: https://en.wikipedia.org/wiki/Sample_mean_and_covariance. Note that it’s \( P(x | c) \), not just \( P(x) \). So, if we want to calculate the mean and covariance for all the images of the digit 9, then we’d first grab only the images of the digit 9 (ignoring the rest of the images), and calculate the sample mean and covariance from this subset.

Earlier, we wanted the argmax over \( P( x | c)P(c) \). Since \( log(AB) = log(A) + log(B) \), then using log probabilities, we can choose the digit class using:

This works since the \( log() \) function is monotonically increasing. If \( A > B \) then \( log(A) > log(B) \). Try any 2 numbers on your calculator if you don’t believe me.

Now this problem is tractable.

Training the classifier would work as follows:

For each class = 0..9
get all x’s (images) for the class
save the mean and covariance of those x’s with the class

Prediction using the classifier would work as follows:

Given a test point x:
Calculate the probability that x is in each class c
Return the class c that yields the highest posterior probability

What makes a Bayes classifier a Naive Bayes classifier?

So far we have only discussed general Bayes classifiers.

A Naive Bayes classifier is one that assumes all input dimensions of \( x \) are independent.

Recall that when 2 random variables \( A \) and \( B \) are independent, their joint probability can be expressed as the product of their individual probabilities:

$$ P(A, B) = P(A)P(B) $$

For the Gaussian case, this means that instead of having a joint Gaussian with a full covariance matrix \( \Sigma \), we can instead express it as a product of 784 individual univariate Gaussians:

One advantage of Naive Bayes is that we don’t have to worry about any interactions between any \( x_i \) and \( x_j \) if \( i \neq j \).

In more practical terms, before we had \( \Sigma \), which is of size \( D \times D = 784^2 \).

Now, we only have \( \sigma_i^2, i=1…784 \).

That’s 784 times less numbers you have to store.

You can express the joint distribution of 784 individual univariate Gaussians as one big multivariate Gaussian, it just means that the covariance matrix \( \Sigma \) will have zeros everywhere except along the diagonal, which just stores the 784 univariate variances.

Having 784 individual variances means we don’t have to invert \( \Sigma \) to calculate the PDF or log PDF, which leads to even more savings.

The downside of Naive Bayes is that the Naive assumption (that all input dimensions are independent) is most often incorrect.

So what do we get after training our model?

Visually, we can “see” what the model has learned by plotting the mean \( \mu \) for each class.

Here are the plots you’d get:

As you’d expect, the mean of each class very closely captures what that digit typically looks like.

We get about 94% accuracy on the test set, which is pretty good!

Notice how we achieve only about 80% on the test set with Naive Bayes, due to the fact that the Naive assumption is pretty obviously not correct. E.g. if we’re looking at a black pixel at one of the corners, are the pixels around it also not very likely to be black?

A problem that often arises when you’re dealing with lots of data is that it takes forever to process.

So you SSH into your Amazon EC2 machine, start your script, and go do other things while it’s running.

You check back a few hours later to see that your SSH session seems to be frozen or you’ve gotten a “broken pipe” error.

You log back in to your EC2 machine, only to discover your script has terminated.

What to do…

Screen for Linux

Screen is like having “tabs” for your command line. You can have multiple screens running at any time. They stay active, so even if you exit from a screen, or exit your entire SSH session, whatever you were doing inside that screen will continue.

This is great if you have a script that takes longer than EC2’s allowed session duration.

Start Screen

To start screen, just enter:

screen

Hit enter to exit out of this info screen.

How Screen Works

Once you’ve started screen, there are a few things you can do:

1. Create a screen

2. Detach from a screen (i.e. go back to your “regular” terminal)

3. Re-attach to a screen (that you’ve previously detached from).

4. List the screens you have open

This is all you need for the use case described above.

Enter commands in Screen

Once inside Screen, you can tell screen you’re about to enter a command by pressing:

Ctrl + A

Now Screen knows you’re about to enter a command.

Create a Screen

After hitting “Ctrl+A”, hit “C”.

Detach from a Screen

After hitting “Ctrl+A”, hit “D”.

Re-attach to a Screen

screen -r

List the Screens you have open

The above only works if you have only one screen open. Otherwise, you’ll see this:

If you want to re-attach to a particular screen, enter:

screen -r 14366.pts-0.affinity-proto

(Obviously you would choose the screen you want to go back to).