New course! Bayesian Machine Learning in Python: A/B Testing

November 17, 2016


[If you already know you want to sign up for my Bayesian machine learning course, just scroll to the bottom to get your $10 coupon!]

Boy, do I have some exciting news today!

You guys have already been keeping up with my deep learning series.

Hopefully, you’ve noticed that I’ve been releasing non-deep learning machine learning courses as well, in parallel (and they often tie into the deep learning series quite nicely).

Well today, I am announcing the start of a BRAND NEW series on Bayesian machine learning.

Bayesian methods require an entirely new way of thinking – a paradigm shift.

But don’t worry, it’s not just all theory.

In fact, the first course I’m releasing in the series is VERY practical – it’s on A/B testing.

Every online advertiser, e-commerce store, marketing team, etc etc etc. does A/B testing.

But did you know that traditional A/B testing is both horribly confusing and inefficient?

Did you know that there are cool, new adaptive methods inspired by reinforcement learning that improve on those old crusty tests?

(Those old methods, and the way they are traditionally taught, are probably the reason you cringe when you hear the word “statistics”)

Well, Bayesian methods not only represent a state-of-the-art solution to many A/B testing challenges, they are also surprisingly theoretically simpler!

You’ll end the course by doing your own simulation – comparing and contrasting the various adaptive A/B testing algorithms (including the final Bayesian method).

This is VERY practical stuff and any digital media, newsfeed, or advertising startup will be EXTREMELY IMPRESSED if you know this stuff.

This WILL advance your career, and any company would be lucky to have someone that knows this stuff on their team.

Awesome coincidence #1: As I mentioned above, a lot of these techniques cross-over with reinforcement learning, so if you are itching for a preview of my upcoming deep reinforcement learning course, this will be very interesting for you.

Awesome coincidence #2: Bayesian learning also crosses over with deep learning, one example being the variational autoencoder, which I may incorporate into a more advanced deep learning course in the future. They heavily rely on concepts from both Bayesian learning AND deep learning, and are very powerful state-of-the-art algorithms.

Due to all the black Friday madness going on, I am going to do a ONE-TIME ONLY $10 special for this course. With my coupons, the price will remain at $10, even if Udemy’s site-wide sale price goes up (which it will).

See you in class!

As promised, here is the coupon:

UPDATE: The Black Friday sale is over, but the early bird coupon is still up for grabs:

LAST THING: Udemy is currently having an awesome Black Friday sale. $10 for ANY course starting Nov 15, but the price goes up by $1 every 2 days, so you need to ACT FAST.

I was going to tell you earlier but I was hard at work on my course. =)

Just click this link to get ANY course on Udemy for $10 (+$1 every 2 days):

#bayesian #data science #machine learning #statistics

Go to comments

Announcing Data Science: Supervised Machine Learning in Python (Less Math, More Action!)

September 16, 2016


If you don’t want to read about the course and just want the 88% OFF coupon code, skip to the bottom.

In recent years, we’ve seen a resurgence in AI, or artificial intelligence, and machine learning.

Machine learning has led to some amazing results, like being able to analyze medical images and predict diseases on-par with human experts.

Google’s AlphaGo program was able to beat a world champion in the strategy game go using deep reinforcement learning.

Machine learning is even being used to program self driving cars, which is going to change the automotive industry forever. Imagine a world with drastically reduced car accidents, simply by removing the element of human error.

Google famously announced that they are now “machine learning first”, meaning that machine learning is going to get a lot more attention now, and this is what’s going to drive innovation in the coming years.

Machine learning is used in many industries, like finance, online advertising, medicine, and robotics.

It is a widely applicable tool that will benefit you no matter what industry you’re in, and it will also open up a ton of career opportunities once you get good.

Machine learning also raises some philosophical questions. Are we building a machine that can think? What does it mean to be conscious? Will computers one day take over the world?

The best part about this course is that it requires WAY less math than my usual courses; just some basic probability and geometry, no calculus!

In this course, we are first going to discuss the K-Nearest Neighbor algorithm. It’s extremely simple and intuitive, and it’s a great first classification algorithm to learn. After we discuss the concepts and implement it in code, we’ll look at some ways in which KNN can fail.

It’s important to know both the advantages and disadvantages of each algorithm we look at.

Next we’ll look at the Naive Bayes Classifier and the General Bayes Classifier. This is a very interesting algorithm to look at because it is grounded in probability.

We’ll see how we can transform the Bayes Classifier into a linear and quadratic classifier to speed up our calculations.

Next we’ll look at the famous Decision Tree algorithm. This is the most complex of the algorithms we’ll study, and most courses you’ll look at won’t implement them. We will, since I believe implementation is good practice.

The last algorithm we’ll look at is the Perceptron algorithm. Perceptrons are the ancestor of neural networks and deep learning, so they are important to study in the context of machine learning.

One we’ve studied these algorithms, we’ll move to more practical machine learning topics. Hyperparameters, cross-validation, feature extraction, feature selection, and multiclass classification.

We’ll do a comparison with deep learning so you understand the pros and cons of each approach.

We’ll discuss the Sci-Kit Learn library, because even though implementing your own algorithms is fun and educational, you should use optimized and well-tested code in your actual work.

We’ll cap things off with a very practical, real-world example by writing a web service that runs a machine learning model and makes predictions. This is something that real companies do and make money from.

All the materials for this course are FREE. You can download and install Python, Numpy, and Scipy with simple commands on Windows, Linux, or Mac.

UPDATE: New coupon if the above is sold out:

#data science #machine learning #matplotlib #numpy #pandas #python

Go to comments

New course – Deep Learning part 5: Recurrent Neural Networks in Python

July 14, 2016


New course out today – Recurrent Neural Networks in Python: Deep Learning part 5.

If you already know what the course is about (recurrent units, GRU, LSTM), grab your 50% OFF coupon and go!:

Like the course I just released on Hidden Markov Models, Recurrent Neural Networks are all about learning sequences – but whereas Markov Models are limited by the Markov assumption, Recurrent Neural Networks are not – and as a result, they are more expressive, and more powerful than anything we’ve seen on tasks that we haven’t made progress on in decades.

Sequences appear everywhere – stock prices, language, credit scoring, and webpage visits.

Recurrent neural networks have a history of being very hard to train. It hasn’t been until recently that we’ve found ways around what is called the vanishing gradient problem, and since then, recurrent neural networks have become one of the most popular methods in deep learning.

If you took my course on Hidden Markov Models, we are going to go through a lot of the same examples in this class, except that our results are going to be a lot better.

Our classification accuracies will increase, and we’ll be able to create vectors of words, or word embeddings, that allow us to visualize how words are related on a graph.

We’ll see some pretty interesting results, like that our neural network seems to have learned that all religions and languages and numbers are related, and that cities and countries have hierarchical relationships.

If you’re interested in discovering how modern deep learning has propelled machine learning and data science to new heights, this course is for you.

I’ll see you in class.

Click here for 50% OFF:

#data science #deep learning #gru #lstm #machine learning #word vectors

Go to comments

New Course: Unsupervised Machine Learning – Hidden Markov Models in Python

June 13, 2016



Hidden Markov Models are all about learning sequences.

A lot of the data that would be very useful for us to model is in sequences. Stock prices are sequences of prices. Language is a sequence of words. Credit scoring involves sequences of borrowing and repaying money, and we can use those sequences to predict whether or not you’re going to default. In short, sequences are everywhere, and being able to analyze them is an important skill in your data science toolbox.

The easiest way to appreciate the kind of information you get from a sequence is to consider what you are reading right now. If I had written the previous sentence backwards, it wouldn’t make much sense to you, even though it contained all the same words. So order is important.

While the current fad in deep learning is to use recurrent neural networks to model sequences, I want to first introduce you guys to a machine learning algorithm that has been around for several decades now – the Hidden Markov Model.

This course follows directly from my first course in Unsupervised Machine Learning for Cluster Analysis, where you learned how to measure the probability distribution of a random variable. In this course, you’ll learn to measure the probability distribution of a sequence of random variables.

You guys know how much I love deep learning, so there is a little twist in this course. We’ve already covered gradient descent and you know how central it is for solving deep learning problems. I claimed that gradient descent could be used to optimize any objective function. In this course I will show you how you can use gradient descent to solve for the optimal parameters of an HMM, as an alternative to the popular expectation-maximization algorithm.

We’re going to do it in Theano, which is a popular library for deep learning. This is also going to teach you how to work with sequences in Theano, which will be very useful when we cover recurrent neural networks and LSTMs.

This course is also going to go through the many practical applications of Markov models and hidden Markov models. We’re going to look at a model of sickness and health, and calculate how to predict how long you’ll stay sick, if you get sick. We’re going to talk about how Markov models can be used to analyze how people interact with your website, and fix problem areas like high bounce rate, which could be affecting your SEO. We’ll build language models that can be used to identify a writer and even generate text – imagine a machine doing your writing for you.

We’ll look at what is possibly the most recent and prolific application of Markov models – Google’s PageRank algorithm. And finally we’ll discuss even more practical applications of Markov models, including generating images, smartphone autosuggestions, and using HMMs to answer one of the most fundamental questions in biology – how is DNA, the code of life, translated into physical or behavioral attributes of an organism?

All of the materials of this course can be downloaded and installed for FREE. We will do most of our work in Numpy and Matplotlib, along with a little bit of Theano. I am always available to answer your questions and help you along your data science journey.

Sign up now and get 50% off by clicking HERE

#data science #deep learning #hidden markov models #machine learning #recurrent neural networks #theano

Go to comments

New machine learning course! Cluster Analysis and Unsupervised Machine Learning in Python

April 20, 2016


[Scroll to the bottom if you want to jump straight to the coupon]

Cluster analysis is a staple of unsupervised machine learning and data science.

It is very useful for data mining and big data because it automatically finds patterns in the data, without the need for labels, unlike supervised machine learning.

In a real-world environment, you can imagine that a robot or an artificial intelligence won’t always have access to the optimal answer, or maybe there isn’t an optimal correct answer. You’d want that robot to be able to explore the world on its own, and learn things just by looking for patterns.

Do you ever wonder how we get the data that we use in our supervised machine learning algorithms?

We always seem to have a nice CSV or a table, complete with Xs and corresponding Ys.

If you haven’t been involved in acquiring data yourself, you might not have thought about this, but someone has to make this data!

Those “Y”s have to come from somewhere, and a lot of the time that involves manual labor.

Sometimes, you don’t have access to this kind of information or it is infeasible or costly to acquire.

But you still want to have some idea of the structure of the data. If you’re doing data analytics automating pattern recognition in your data would be invaluable.

This is where unsupervised machine learning comes into play.

In this course we are first going to talk about clustering. This is where instead of training on labels, we try to create our own labels! We’ll do this by grouping together data that looks alike.

There are 2 methods of clustering we’ll talk about: k-means clustering and hierarchical clustering.

Next, because in machine learning we like to talk about probability distributions, we’ll go into Gaussian mixture models and kernel density estimation, where we talk about how to “learn” the probability distribution of a set of data.

One interesting fact is that under certain conditions, Gaussian mixture models and k-means clustering are exactly the same! You can think of GMMs as a “souped up” version of k-means. We’ll prove how this is the case.

All the algorithms we’ll talk about in this course are staples in machine learning and data science, so if you want to know how to automatically find patterns in your data with data mining and pattern extraction, without needing someone to put in manual work to label that data, then this course is for you.

All the materials for this course are FREE. You can download and install Python, Numpy, and Scipy with simple commands on Windows, Linux, or Mac.


#agglomerative clustering #cluster analysis #data mining #data science #expectation-maximization #gaussian mixture model #hierarchical clustering #k-means clustering #kernel density estimation #pattern recognition #udemy #unsupervised machine learning

Go to comments

Principal Components Analysis in Theano

February 21, 2016

This is a follow-up post to my original PCA tutorial. It is of interest to you if you:

  • Are interested in deep learning (this tutorial uses gradient descent)
  • Are interested in learning more about Theano (it is not like regular Python, and it is very popular for implementing deep learning algorithms)
  • Want to know how you can write your own PCA solver (in the previous post we used a library to get eigenvalues and eigenvectors)
  • Work with big data (this technique can be used to process data where the dimensionality is very large – where the covariance matrix wouldn’t even fit into memory)

First, you should be familiar with creating variables and functions in Theano. Here is a simple example of how you would do matrix multiplication:

import numpy as np
import theano
import theano.tensor as T

X = T.matrix('X')
Q = T.matrix('Q')
Z =, Q)

transform = theano.function(inputs=[X,Q], outputs=Z)

X_val = np.random.randn(100,10)
Q_val = np.random.randn(10,10)

Z_val = transform(X_val, Q_val)

I think of Theano variables as “containers” for real numbers. They actually represent nodes in a graph. You will see the term “graph” a lot when you read about Theano, and probably think to yourself – what does matrix multiplication or machine learning have to do with graphs? (not graphs as in visual graphs, graphs as in nodes and edges) You can think of any “equation” or “formula” as a graph. Just draw the variables and functions as nodes and then connect them to make the equation using lines/edges. It’s just like drawing a “system” in control systems or a visual representation of a neural network (which is also a graph).

If you have ever done linear programming or integer programming in PuLP you are probably familiar with the idea of “variable” objects and them passing them into a “solver” after creating some “expressions” that represent the constraints and objective of the linear / integer program.

Anyway, onto principal components analysis.

Let’s consider how you would find the leading eigenvalue and eigenvector (the one corresponding to the largest eigenvalue) of a square matrix.

The loss function / objective for PCA is:

$$ J = \sum_{n=1}^{N} |x_n – \hat{x}_n|^2 $$

Where \( \hat{X} \) is the reconstruction of \( X \). If there is only one eigenvector, let’s call this \( v \), then this becomes:

$$ J = \sum_{n=1}^{N} |x_n – x_nvv^T|^2 $$

This is equivalent to the Frobenius norm, so we can write:

$$ J = |X – Xvv^T|_F $$

One identity of the Frobenius norm is:

$$ |A|_F = \sqrt{ \sum_{i} \sum_{j} a_{ij} } = \sqrt{ Tr(A^T A ) } $$

Which means we can rewrite the loss function as:

$$ J = Tr( (X – Xvv^T)^T(X – Xvv^T) ) $$

Keeping in mind that with the trace function you can re-order matrix multiplications that you wouldn’t normally be able to (matrix multiplication isn’t commutative), and dropping any terms that don’t depend on \( v \), you can use matrix algebra to rearrange this to get:

$$ v^* = argmin\{-Tr(X^TXvv^T) \} $$

Which again using reordering would be equivalent to maximizing:

$$ v^* = argmax\{ v^TX^TXv \} $$

The corresponding eigenvalue would then be:

$$ \lambda = v^TX^TXv $$

Now that we have a function to maximize, we can simply use gradient descent to do it, similar to how you would do it in logistic regression or in a deep belief network.

$$ v \leftarrow v + \eta \nabla_v(v^TX^TXv) $$

Next, let’s extend this algorithm for finding the other eigenvalues and eigenvectors. You essentially subtract the contributions of the eigenvalues you already found.

$$ v_i \leftarrow v_i + \eta \nabla_{v_i}(v_i^T( X^TX – \sum_{j=1}^{i-1} \lambda_j v_j v_j^T )v_i ) $$

Next, note that to implement this algorithm you never need to actually calculate the covariance \( X^T X \). If your dimensionality is, say, 1 million, then your covariance matrix will have 1 trillion entries!

Instead, you can multiply by your eigenvector first to get \( Xv \), which is only of size \( N \times 1 \). You can then “dot” this with itself to get a scalar, which is only an \( O(N) \) operation.

So how do you write this code in Theano? If you’ve never used Theano for gradient descent there will be some new concepts here.

First, you don’t actually need to know how to differentiate your cost function. You use Theano’s T.grad(cost_function, differentiation_variable).

v = theano.shared(init_v, name="v")
Xv =, v)
cost =, Xv) - np.sum(evals[j]*[j], v)*[j], v) for j in xrange(i))

gv = T.grad(cost, v)

Note that we re-normalize the eigenvector on each step, so that \( v^T v = 1 \).

Next, you define your “weight update rule” as an expression, and pass this into the “updates” argument of Theano’s function creator.

y = v + learning_rate*gv
update_expression = y / y.norm(2)

train = theano.function(
  outputs=[your outputs],
  updates=((v, update_expression),)

Note that the update variable must be a “shared variable”. With this knowledge in hand, you are ready to implement the gradient descent version of PCA in Theano:

for i in xrange(number of eigenvalues you want to find):
  ... initialize variables and expressions ...
  ... initialize theano train function ...
  while t < max_iterations and change in v < tol:
    outputs = train(data)

... return eigenvalues and eigenvectors ...

This is not really trivial but at the same time it’s a great exercise in both (a) linear algebra and (b) Theano coding.

If you are interested in learning more about PCA, dimensionality reduction, gradient descent, deep learning, or Theano, then check out my course “Unsupervised Deep Learning in Python” and let me know what you think in the comments.

#aws #data science #deep learning #gpu #machine learning #nvidia #pca #principal components analysis #statistics #theano

Go to comments

Logistic Regression in Python video course

November 11, 2015

Hi all!

Do you ever get tired of reading walls of text, and just want a nice video or 10 to explain to you the magic of logistic regression and how to program it with Python?

Look no further, that video course is here.

UPDATE: Please go here for the latest and fullest version of this course:

#big data #data science #logistic regression #neural networks #numpy #python

Go to comments

How to run distributed machine learning jobs using Apache Spark and EC2 (and Python)

April 5, 2015

This is the age of big data.

Sometimes sci-kit learn doesn’t cut it.

In order to make your operations and data-driven decisions scalable – you need to distribute the processing of your data.

Two popular libraries that do such distributed machine learning are Mahout (which uses MapReduce) and MLlib (which uses Spark, which is sometimes considered as a successor to MapReduce).

What I want to do with this tutorial is to show you how easy it is to do distributed machine learning using Spark and EC2.

When I started a recent project of mine, I was distraught at how complicated a Mahout setup could be.

I am not an ops person. I hate installing and configuring things. For something like running distributed k-means clustering, 90% of the work could go into just setting up a Hadoop cluster, installing all the libraries your code needs to run, making sure they are the right versions, etc…

The Hadoop ecosystem is very sensitive to these things, and sometimes MapReduce jobs can be very hard to debug.

With Spark, everything is super easy. Installing Spark and Hadoop is tedious but do-able. Spinning up a cluster is very easy. Running a job is very easy. We will use Python, but you can also use Scala or Java.

Outline of this tutorial:

  1. Install Spark on a driver machine.
  2. Create a cluster.
  3. Run a job.


1. Install Spark

I used an Ubuntu instance on EC2. I’m assuming you already know how to set up a security group, get your PEM, and SSH into the machine.


Once you’ve spun up your AMI, we can begin installing all the stuff we’ll need.

To make this even easier you can probably do this on your local machine, but if for some reason you’re using Windows or you don’t want to mess up your local machine, then you’ll want to do this.

First, set your AWS ID and secret environment variables.


Now install Java:

sudo apt-get update
sudo apt-get install default-jdk maven
export MAVEN_OPTS=”-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m”

For the last line, we will need this RAM available to build Spark, if I remember correctly.

Now download and build Spark:

tar -xf spark-1.3.0.tgz
cd spark-1.3.0
mvn -DskipTests clean package

By the time you read this a new version of Spark may be available. You should check.


2. Create a Cluster

Assuming you are in the Spark folder now, it is very easy to create a cluster to run your jobs:

./ec2/spark-ec2 -k “Your Key Pair Name” -i /path/to/key.pem -s <number of slaves> launch <cluster name> —copy-aws-credentials -z us-east-1b

I set my zone as “us-east-1b” but you can set it to a zone of your choice.

When you’re finished, don’t forget to tear down your cluster! On-demand machines are expensive.

./spark-ec2 destroy <cluster name>

For some reason, numpy isn’t installed when you create a cluster, and the default Python distribution on the m1.large machines is 2.6, while Spark installs its own 2.7. So, even if you easy_install numpy on each of the machines in the cluster, it won’t work for Spark.

You can instead copy the library over to each cluster machine from your driver machine:

scp -i /path/to/key.pem /usr/lib/python2.7/dist-packages/numpy* [email protected]<cluster-machine>:/usr/lib/python2.7/dist-packages/
scp -r -i /path/to/key.pem /usr/lib/python2.7/dist-packages/numpy [email protected]<cluster-machine>:/usr/lib/python2.7/dist-packages/

You can easily write a script to automatically copy all this stuff over (get the machine URLs from the EC2 console).


3. Run a Job

Spark gives you a Python shell.

First, go to your EC2 console and find the URL for your cluster master. SSH into that machine (username is root).

cd spark
MASTER=spark://<cluster-master-ip>:7077 ./pyspark

Import libraries:

from pyspark.mllib.clustering import KMeans
from numpy import array

Load your data:

data = sc.textFile(“s3://<my-bucket>/<path>/*.csv”)

Note 1: You can use a wildcard to grab multiple files into one variable – called an RDD – resilient distributed dataset.

Note 2: Spark gives you a variable called ‘sc’, which is an object of type SparkContext. It specifies the master node, among other things.

Maybe filter out some bad lines:

data = data.filter(lambda line: ‘ERROR’ not in line)

Turn each row into an array / vector observation:

data = line: array([float(x) for x in line.split()]))

Train your model:

clusters = KMeans.train(parsedData, 2, maxIterations=20,
runs=1, initializationMode=”k-means||”)

Save some output:


You can also run a standalone Python script using spark-submit instead of the shell.

./bin/spark-submit —master spark://<master-ip>:7077

Remember you’ll have to instantiate your own SparkContext in this case.


Future Improvements

The goal of this tutorial is to make things easy.

There are many areas for improvement – for instance – on-demand machines on Amazon are the most expensive.

Spark still spins up “m1.large” instances, even though EC2′s current documentation recommends using the better, faster, AND cheaper “m3.large” instance instead.

At the same time, that custom configuration could mean we can’t use the spark-ec2 script to spin up the cluster automatically. There might be an option there to choose. I didn’t really look.

One major reason I wrote this tutorial is because all the information in it is out there in some form, but it is disparate and some of it can be hard to find without knowing what to search for.

So that’s it. The easiest possible way to run distributed machine learning.

How do you do distributed machine learning?

#apache #aws #big data #data science #ec2 #emr #machine learning #python #spark

Go to comments

Tutorial: K-Nearest Neighbor classifier for MNIST

March 20, 2015

Previously we looked at the Bayes classifier for MNIST data, using a multivariate Gaussian to model each class.

We use the same dimensionality reduced dataset here.

The K-Nearest Neighbor (KNN) classifier is also often used as a “simple baseline” classifier, but there are a couple distinctions from the Bayes classifier that are interesting.

1) KNN does not use probability distributions to model data. In fact, many powerful classifiers do not assume any probability distribution on the data.

2) KNN is a “lazy” classifier. No work is actually done to train the model. It just saves the input points X and all the labels Y.

At classification time, the predicted class/label is chosen by looking at the “k nearest neighbors” of the input test point.

We choose “k” beforehand.

Here is a visual example for k = 3:


So for 3-NN, we go through all the training points, find the 3 closest ones, and choose the label that shows up the most.

i.e. If we get 2 whites and 1 black, we choose white. This is called “voting”.

Here is another example, where we might get a different answer depending on whether k = 3 or k = 5.


The idea is, our data should be clustered in “space”. Points from the same class should be near each other. In fact, this is an idea that prevails for pretty much all machine learning algorithms, so you should understand it well.

If data points that are near each other are from the same class, then it follows that a test point can be classified by looking at its neighbors’ classes.

In pseudocode, it might look like this:

function predict(x’):
  Q = [] // we’ll use this to store the k-nearest neighbors
  for x,y in training data:
    if Q.length < k or dist(x,x’) < max distance in Q:
      add dist(x,x’) and y to Q
      if Q.length > k: remove max distance in Q
  return the y that shows up most in Q

A naive implementation might look through every training point, which is O(N), and then to find the max in Q it would be O(k), for a total of O(kN).

You could use a sorted data structure for Q making the search and insert O(logk).

Here is the pseudocode that does this:

k = 1 achieves a test classification rate of 94.8%!

Learn about KNN and more in the course Data Science: Supervised Machine Learning in Python.

#bigdata #data science #k-nearest neighbors #knn #machine learning

Go to comments

Bayes classifier and Naive Bayes tutorial (using the MNIST dataset)

March 19, 2015

The Naive Bayes classifier is a simple classifier that is often used as a baseline for comparison with more complex classifiers.

It is also conceptually very simple and as you’ll see it is just a fancy application of Bayes rule from your probability class.

We will use the famous MNIST data set for this tutorial.

The MNIST dataset is a set of handwritten digits, and our job is to build a computer program that takes as input an image of a digit, and outputs what digit it is.




Recall Bayes rule:

$$ P(c | x) = \frac{P(x | c)P(c)}{P(x)} $$

If you’re like me, you may have found this notation a little confusing at first.

Here \( x \) represents the image, or more precisely, the pixel values of the image formatted as a vector, and \( c \) represents the digit, which can be 0, 1, …, 9.

Sidenote: Images are grayscale and of size 28×28, which, if flattened, yields a vector of length 28×28=784. If you look at the code (linked below) you can see how this is done, and also that we scale the pixel values to be between 0…1. Scaling isn’t strictly necessary, but can be useful for many machine learning algorithms.

We can read the left side \( P(c | x) \) as “the probability that the class is \( c \) given the data \( x \)”. (this is called the “posterior”)

We can read the right side \( P(x | c) \) as “the probability that the data \( x \) belongs to the class \( c \)”. (this is called the “likelihood”)



One little efficiency trick we can do:

We don’t actually care about the value of  \( P(c | x) \).

We care about the value of \( c \) itself. That tells us “which digit” the image belongs to.

The class chosen is simply the one that yields the highest probability for that data:

$$ c^* = argmax_{c}{ P(c | x) } =argmax_{c}{ \frac{ P(x | c)P(c) }{ P(x) } }$$


You will notice that \( P(x) \) is constant for all values of \( c \) in \( P(c | x) \).

So when I take the argmax over \( \frac{ P(x | c)P(c) }{ P(x) } \) I can ignore \( P(x) \).

As a simple example, suppose I have that \( A > B \), or numerically, say, \( 10 > 5 \).

If I multiply or divide these numbers by a positive constant (\( P(x) \) will always be positive) then the relationship still holds.

Ex. \( 2A > 2B \), or \( 20 > 10 \).


Using this information, we can simplify our problem so that, in order to choose “which digit” given an image, all we need to do is calculate this argmax (notice \( P(x) \) is removed):


$$ c^* = argmax_{c}{ P(x | c)P(c) }$$



The next step we can take is to think about how to calculate \( P(c) \).

This is just counting.

If I have 100 students in my class, and I want to figure out the probability that a student is born in January, how can I do that?

I simply count up all the students born in January, and divide by the total number of students.

If I want to know the probability of getting heads when I flip a coin, I flip the coin a bunch of times, and divide the number of heads by the total number of coin flips.


$$ P(c) = \frac{ count(number\, of\, times\, images\, of\, the\, digit\, c\, appear) }{ count(total\, number\, of\, images) } $$



The challenge is choosing a model that accurately fits the data for \( P(x | c) \).

As a thought-exercise, think about how you’d do this naively.

Each image is of size 28×28, which means there are 784 (=28×28) pixels per image. Each pixel can take on integer values in the range 0..255 inclusive.

So, if you modeled this as a discrete probability distribution, you’d have \( 255^{784} \) different possibilities. That’s way more than the number of images you have (~50, 000), and hence, you’d never be able to use the “counting method” (used above for calculating \( P(c) \)) to accurately measure those probabilities.


To make the problem tractable and easily computable, we recall that pixels represent light intensity, and light intensity is actually continuous. It’s only discrete inside a computer because computers are discrete.

A reasonable first-guess for modeling continuous data is the multivariate Gaussian or the multivariate Normal.

We can say that:

$$ P(x | c) = \frac{1}{\sqrt{ (2\pi)^D |\Sigma| }} exp\left({ -\frac{1}{2}(x – \mu)^T \Sigma^{-1} (x – \mu) }\right) $$

Note that because the data are continuous, we are not actually calculating probabilities, but probability densities, on the right for \( P(x | c) \). Luckily, Bayes rule still holds for probability densities.

Another thing to note is that because probabilities are very small when dimensionality is high, we’ll work with log-likelihood instead of likelihood. Then instead of getting numbers very close to 0, which is inaccurate when using a computer to represent them, we’ll just get negative numbers.

The log-likelihood can be represented as:

$$ logP(x | c) = -\frac{D}{2}ln(2\pi) – \frac{1}{2}ln|\Sigma| – \frac{1}{2}(x – \mu)^T \Sigma^{-1} (x – \mu) $$

Which you should try to derive yourself. (D is the dimensionality)


By the way, to calculate \( \mu \) and \( \Sigma \), you can use the sample mean and covariance: Note that it’s \( P(x | c) \), not just \( P(x) \). So, if we want to calculate the mean and covariance for all the images of the digit 9, then we’d first grab only the images of the digit 9 (ignoring the rest of the images), and calculate the sample mean and covariance from this subset.


Earlier, we wanted the argmax over \( P( x | c)P(c) \). Since \( log(AB) = log(A) + log(B) \), then using log probabilities, we can choose the digit class using:


$$ c^* = argmax_{c} {\left( logP(x | c) + logP(c) \right)} $$


This works since the \( log() \) function is monotonically increasing. If \( A > B \) then \( log(A) > log(B) \). Try any 2 numbers on your calculator if you don’t believe me.



Now this problem is tractable.

Training the classifier would work as follows:

For each class = 0..9
    get all x’s (images) for the class
    save the mean and covariance of those x’s with the class

Prediction using the classifier would work as follows:

Given a test point x:
    Calculate the probability that x is in each class c
    Return the class c that yields the highest posterior probability


What makes a Bayes classifier a Naive Bayes classifier?

So far we have only discussed general Bayes classifiers.

A Naive Bayes classifier is one that assumes all input dimensions of \( x \) are independent.

Recall that when 2 random variables \( A \) and \( B \) are independent, their joint probability can be expressed as the product of their individual probabilities:


$$ P(A, B) = P(A)P(B) $$


For the Gaussian case, this means that instead of having a joint Gaussian with a full covariance matrix \( \Sigma \), we can instead express it as a product of 784 individual univariate Gaussians:


$$ P(x | c) = \prod_{i=1}^{784}  \frac{1}{\sqrt{2\pi\sigma_i^2}} exp{\left( -\frac{1}{2} \frac{(x_i – \mu_i)^2}{\sigma_i^2} \right)} $$


One advantage of Naive Bayes is that we don’t have to worry about any interactions between any \( x_i \) and \( x_j \) if \( i \neq j \).

In more practical terms, before we had \( \Sigma \), which is of size \( D \times D = 784^2 \).

Now, we only have \( \sigma_i^2, i=1…784 \).

That’s 784 times less numbers you have to store.

You can express the joint distribution of 784 individual univariate Gaussians as one big multivariate Gaussian, it just means that the covariance matrix \( \Sigma \) will have zeros everywhere except along the diagonal, which just stores the 784 univariate variances.


Having 784 individual variances means we don’t have to invert \( \Sigma \) to calculate the PDF or log PDF, which leads to even more savings.

The downside of Naive Bayes is that the Naive assumption (that all input dimensions are independent) is most often incorrect.



So what do we get after training our model?

Visually, we can “see” what the model has learned by plotting the mean \( \mu \) for each class.

Here are the plots you’d get:


As you’d expect, the mean of each class very closely captures what that digit typically looks like.

The code for this tutorial can be found here:

Non-Naive Bayes:

Naive Bayes:


We get about 94% accuracy on the test set, which is pretty good!

Notice how we achieve only about 80% on the test set with Naive Bayes, due to the fact that the Naive assumption is pretty obviously not correct. E.g. if we’re looking at a black pixel at one of the corners, are the pixels around it also not very likely to be black?

Learn about KNN and more in the course Data Science: Supervised Machine Learning in Python.

#bayes #big data #data science #machine learning #naivebayes

Go to comments

Deep Learning and Artificial Intelligence Newsletter

Get discount coupons, free machine learning material, and new course announcements