Lazy Programmer

Your source for the latest in deep learning, big data, data science, and artificial intelligence. Sign up now

New course! Bayesian Machine Learning in Python: A/B Testing

November 17, 2016


[If you already know you want to sign up for my Bayesian machine learning course, just scroll to the bottom to get your $10 coupon!]

Boy, do I have some exciting news today!

You guys have already been keeping up with my deep learning series.

Hopefully, you’ve noticed that I’ve been releasing non-deep learning machine learning courses as well, in parallel (and they often tie into the deep learning series quite nicely).

Well today, I am announcing the start of a BRAND NEW series on Bayesian machine learning.

Bayesian methods require an entirely new way of thinking – a paradigm shift.

But don’t worry, it’s not just all theory.

In fact, the first course I’m releasing in the series is VERY practical – it’s on A/B testing.

Every online advertiser, e-commerce store, marketing team, etc etc etc. does A/B testing.

But did you know that traditional A/B testing is both horribly confusing and inefficient?

Did you know that there are cool, new adaptive methods inspired by reinforcement learning that improve on those old crusty tests?

(Those old methods, and the way they are traditionally taught, are probably the reason you cringe when you hear the word “statistics”)

Well, Bayesian methods not only represent a state-of-the-art solution to many A/B testing challenges, they are also surprisingly theoretically simpler!

You’ll end the course by doing your own simulation – comparing and contrasting the various adaptive A/B testing algorithms (including the final Bayesian method).

This is VERY practical stuff and any digital media, newsfeed, or advertising startup will be EXTREMELY IMPRESSED if you know this stuff.

This WILL advance your career, and any company would be lucky to have someone that knows this stuff on their team.

Awesome coincidence #1: As I mentioned above, a lot of these techniques cross-over with reinforcement learning, so if you are itching for a preview of my upcoming deep reinforcement learning course, this will be very interesting for you.

Awesome coincidence #2: Bayesian learning also crosses over with deep learning, one example being the variational autoencoder, which I may incorporate into a more advanced deep learning course in the future. They heavily rely on concepts from both Bayesian learning AND deep learning, and are very powerful state-of-the-art algorithms.

Due to all the black Friday madness going on, I am going to do a ONE-TIME ONLY $10 special for this course. With my coupons, the price will remain at $10, even if Udemy’s site-wide sale price goes up (which it will).

See you in class!

As promised, here is the coupon:

UPDATE: The Black Friday sale is over, but the early bird coupon is still up for grabs:

LAST THING: Udemy is currently having an awesome Black Friday sale. $10 for ANY course starting Nov 15, but the price goes up by $1 every 2 days, so you need to ACT FAST.

I was going to tell you earlier but I was hard at work on my course. =)

Just click this link to get ANY course on Udemy for $10 (+$1 every 2 days):

#bayesian #data science #machine learning #statistics

Go to comments

Principal Components Analysis in Theano

February 21, 2016

This is a follow-up post to my original PCA tutorial. It is of interest to you if you:

  • Are interested in deep learning (this tutorial uses gradient descent)
  • Are interested in learning more about Theano (it is not like regular Python, and it is very popular for implementing deep learning algorithms)
  • Want to know how you can write your own PCA solver (in the previous post we used a library to get eigenvalues and eigenvectors)
  • Work with big data (this technique can be used to process data where the dimensionality is very large – where the covariance matrix wouldn’t even fit into memory)

First, you should be familiar with creating variables and functions in Theano. Here is a simple example of how you would do matrix multiplication:

import numpy as np
import theano
import theano.tensor as T

X = T.matrix('X')
Q = T.matrix('Q')
Z =, Q)

transform = theano.function(inputs=[X,Q], outputs=Z)

X_val = np.random.randn(100,10)
Q_val = np.random.randn(10,10)

Z_val = transform(X_val, Q_val)

I think of Theano variables as “containers” for real numbers. They actually represent nodes in a graph. You will see the term “graph” a lot when you read about Theano, and probably think to yourself – what does matrix multiplication or machine learning have to do with graphs? (not graphs as in visual graphs, graphs as in nodes and edges) You can think of any “equation” or “formula” as a graph. Just draw the variables and functions as nodes and then connect them to make the equation using lines/edges. It’s just like drawing a “system” in control systems or a visual representation of a neural network (which is also a graph).

If you have ever done linear programming or integer programming in PuLP you are probably familiar with the idea of “variable” objects and them passing them into a “solver” after creating some “expressions” that represent the constraints and objective of the linear / integer program.

Anyway, onto principal components analysis.

Let’s consider how you would find the leading eigenvalue and eigenvector (the one corresponding to the largest eigenvalue) of a square matrix.

The loss function / objective for PCA is:

$$ J = \sum_{n=1}^{N} |x_n – \hat{x}_n|^2 $$

Where \( \hat{X} \) is the reconstruction of \( X \). If there is only one eigenvector, let’s call this \( v \), then this becomes:

$$ J = \sum_{n=1}^{N} |x_n – x_nvv^T|^2 $$

This is equivalent to the Frobenius norm, so we can write:

$$ J = |X – Xvv^T|_F $$

One identity of the Frobenius norm is:

$$ |A|_F = \sqrt{ \sum_{i} \sum_{j} a_{ij} } = \sqrt{ Tr(A^T A ) } $$

Which means we can rewrite the loss function as:

$$ J = Tr( (X – Xvv^T)^T(X – Xvv^T) ) $$

Keeping in mind that with the trace function you can re-order matrix multiplications that you wouldn’t normally be able to (matrix multiplication isn’t commutative), and dropping any terms that don’t depend on \( v \), you can use matrix algebra to rearrange this to get:

$$ v^* = argmin\{-Tr(X^TXvv^T) \} $$

Which again using reordering would be equivalent to maximizing:

$$ v^* = argmax\{ v^TX^TXv \} $$

The corresponding eigenvalue would then be:

$$ \lambda = v^TX^TXv $$

Now that we have a function to maximize, we can simply use gradient descent to do it, similar to how you would do it in logistic regression or in a deep belief network.

$$ v \leftarrow v + \eta \nabla_v(v^TX^TXv) $$

Next, let’s extend this algorithm for finding the other eigenvalues and eigenvectors. You essentially subtract the contributions of the eigenvalues you already found.

$$ v_i \leftarrow v_i + \eta \nabla_{v_i}(v_i^T( X^TX – \sum_{j=1}^{i-1} \lambda_j v_j v_j^T )v_i ) $$

Next, note that to implement this algorithm you never need to actually calculate the covariance \( X^T X \). If your dimensionality is, say, 1 million, then your covariance matrix will have 1 trillion entries!

Instead, you can multiply by your eigenvector first to get \( Xv \), which is only of size \( N \times 1 \). You can then “dot” this with itself to get a scalar, which is only an \( O(N) \) operation.

So how do you write this code in Theano? If you’ve never used Theano for gradient descent there will be some new concepts here.

First, you don’t actually need to know how to differentiate your cost function. You use Theano’s T.grad(cost_function, differentiation_variable).

v = theano.shared(init_v, name="v")
Xv =, v)
cost =, Xv) - np.sum(evals[j]*[j], v)*[j], v) for j in xrange(i))

gv = T.grad(cost, v)

Note that we re-normalize the eigenvector on each step, so that \( v^T v = 1 \).

Next, you define your “weight update rule” as an expression, and pass this into the “updates” argument of Theano’s function creator.

y = v + learning_rate*gv
update_expression = y / y.norm(2)

train = theano.function(
  outputs=[your outputs],
  updates=((v, update_expression),)

Note that the update variable must be a “shared variable”. With this knowledge in hand, you are ready to implement the gradient descent version of PCA in Theano:

for i in xrange(number of eigenvalues you want to find):
  ... initialize variables and expressions ...
  ... initialize theano train function ...
  while t < max_iterations and change in v < tol:
    outputs = train(data)

... return eigenvalues and eigenvectors ...

This is not really trivial but at the same time it's a great exercise in both (a) linear algebra and (b) Theano coding.

If you are interested in learning more about PCA, dimensionality reduction, gradient descent, deep learning, or Theano, then check out my course on Udemy "Data Science: Deep Learning in Python" and let me know what you think in the comments.

#aws #data science #deep learning #gpu #machine learning #nvidia #pca #principal components analysis #statistics #theano

Go to comments

Covariance Matrix: Divide by N or N-1?

February 14, 2016

This is a statistics post. It’s probably very boring. I am posting it for my own reference, because I seem to forget how this is derived every time I need it.

Sometimes you see the sample variance defined as:

$$ \hat{\sigma}^2 = \frac{1}{N} \sum_{n=1}^{N} (X_n – \mu)^2 $$

But you might also see it defined as:

$$ \hat{\sigma}^2 = \frac{1}{N-1} \sum_{n=1}^{N} (X_n – \hat{\mu})^2 $$

Where as usual the “hat” symbol means that is our prediction.

Why do statisticians sometimes divide by N, and sometimes divide by N-1?

The same question arises for the calculation of the sample covariance matrix, and this is what we will work with in this post.


This has to do with whether you want your estimate to be a biased estimate or an unbiased estimate.

For any parameter \( \theta \), our estimate \( \hat{ \theta } \) is unbiased if:

$$ E\{ \hat{ \theta } \} = 0 $$

In this tutorial we will calculate the bias of the sample covariance on the multivariate Gaussian, which is defined as:

$$ p(x | \mu, \sigma) = \frac{1}{\sqrt{(2\pi)^D |\Sigma|}} exp( -\frac{1}{2} (x – \mu)^T \Sigma^{-1} (x – \mu) ) $$

The maximum likelihood estimates of \( \mu \) and \( \Sigma \) can be found by taking the derivative of the log-likelihood and setting it to 0.

The likelihood is:

$$ p(X | \mu, \Sigma) = \prod_{n=1}^{N} p(x_n | \mu, \Sigma) $$

So the log-likelihood (expanded) is:

$$ L(X | \mu, \Sigma) = -\frac{ND}{2} log(2\pi)  -\frac{N}{2} log(|\Sigma|)  -\sum_{n=1}^{N} \frac{1}{2}(x_n – \mu)^T \Sigma^{-1} (x_n – \mu) $$

To take “vector derivatives” and “matrix derivatives” you’ll want to consult the Matrix Cookbook.

If you made it this far it is almost trivial to calculate \( \frac{ \partial L }{ \partial \mu } = 0 \) to get the usual result:

$$ \hat{ \mu } = \frac{1}{N} \sum_{n=1}^{N} x_n $$

To get the sample covariance, we calculate:

$$ \frac{ \partial L }{ \partial \Sigma } = -\frac{N}{2} (\Sigma^{-1})^T – \frac{1}{2} \sum_{n=1}^{N} – \Sigma^{-T} (x_n – \mu) (x_n – \mu)^T \Sigma^{-T} $$

Set that t0 0 and solve for \( \Sigma \) to get:

$$ \hat{ \Sigma } = \frac{1}{N} \sum_{n=1}^{N} (x_n – \hat{\mu}) (x_n – \hat{\mu})^T $$

Note that we are assuming we don’t have the “true mean”, so we are estimating the mean using the maximum likelihood estimate before calculating the maximum likelihood estimate for the covariance.

Now we will show that \( E\{ \hat{ \Sigma } \}  \neq 0 \). By definition:

$$ E\{ \hat{ \Sigma } \} = \frac{1}{N} E\{ \sum_{n} (x_n – \hat{\mu})(x_n – \hat{\mu})^T \} $$


$$ E\{ \hat{ \Sigma } \} = \frac{1}{N} E\{ \sum_{n} ((x_n – \mu) – (\hat{\mu} – \mu))((x_n – \mu) – (\hat{\mu} – \mu)))^T \} $$

Expand that:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \frac{1}{N} E\{ \sum_{n} (\hat{\mu} – \mu)) (\hat{\mu} – \mu))^T – (x_n – \mu)(\hat{\mu} – \mu))^T – (\hat{\mu} – \mu))(x_n – \mu)^T \} $$

Multiply out the terms:

$$ E\{ \hat{ \Sigma } \} = \Sigma  + E\{ \hat{\mu}\hat{\mu}^T \} – \frac{1}{N} \sum_{n} E\{ x_n\hat{\mu}^T \} – \frac{1}{N} \sum_{n} E\{ \hat{\mu} x_n^T \} + \mu\mu^T $$

We can combine the expected values to get:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \mu\mu^T – E\{ \hat{\mu}\hat{\mu}^T \} $$

Now the exercise becomes finding the expected value.

We need some identities.

First, for \( m \neq n \):

$$ E\{ x_m x_n^T \} = E\{ x_m \} E\{ x_n^T \} =  \mu\mu^T $$

Because each sample is IID: independent and identically distributed.

Next, the definition of covariance:

$$ \Sigma = E\{ (x – \mu)(x – \mu)^T \} = E\{ xx^T \} – \mu\mu^T $$

We can rearrange this to get:

$$ E\{ xx^T \} = \Sigma + \mu\mu^T $$

The term \( E\{\hat{\mu}\hat{\mu}^T \} \) can be expanded as:

$$ E\{\hat{\mu}\hat{\mu}^T \} = \frac{1}{N^2} E\{ (x_1 + x_2 + … + x_N)(x_1 + x_2 + … + x_N)^T \} $$

When expanding the multiplication, there are \( N \) terms that are the same, so that would be a \( N E\{ x_n x_n^T \} \) contribution. There are \( N(N-1) \) terms that are different, and since different terms are independent that is a \( N(N-1)\mu\mu^T \) contribution.

So in total:

$$ E\{\hat{\mu}\hat{\mu}^T \} =\frac{1}{N^2}(N(\Sigma + \mu\mu^T) + N(N-1)\mu\mu^T) $$


$$ E\{\hat{\mu}\hat{\mu}^T \} = \frac{1}{N}\Sigma + \mu\mu^T $$

Plugging this back into the expression for the bias:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \mu\mu^T – \frac{1}{N}\Sigma – \mu\mu^T $$


$$ E\{ \hat{ \Sigma } \} = \frac{N-1}{N} \Sigma \neq \Sigma $$

So, if you want the unbiased estimator, you can multiply the biased maximum likelihood estimator by \( \frac{N}{N-1} \), which gives the expected unbiased formula.


#covariance #maximum likelihood #MLE #multivariate Gaussian #multivariate normal #sample variance #statistics #unbiased estimator

Go to comments

What the hell is a “z-score”?

December 4, 2015

Was having this conversation recently about what a z-score is.

What is a z-score? Why is it mentioned so often in statistics? You’ve probably heard of it in the context of the Student t-test.

If you are a machine learning guy more than a statistics guy, like me, you’ve probably heard you should “standardize” or “normalize” your variables before putting them into a machine learning model.

For example, if you’re doing linear regression and \( X_1 \) varies from \( 0..1 \) and \( X_2 \) varies from \( 0..1000 \), the weights \( \beta_1 \) and \( \beta_2 \) will give an inaccurate picture as to their importance.

Well, the z-score actually is the standardized variable.

I would avoid this terminology if possible, because in machine learning we usually think of a “score” as the output of a model, i.e. I’m getting the “score” of a set of links because I want to know in what order to rank them.

$$ z = \frac{x – \mu}{\sigma} $$

So instead of thinking of “z-scores”, think of “I need to standardize my variables before using them in my machine learning model so that the effect of any one variable is on the same scale as the others”.

Do you find statistics terminology confusing? “Z score”? “F statistic”? Share your thoughts in the comments!

#frequentist statistics #machine learning #statistics #z score

Go to comments

Tutorial: Principal Components Analysis (PCA)

November 20, 2015

I remember learning about principal components analysis for the very first time. I remember thinking it was very confusing, and that I didn’t know what it had to do with eigenvalues and eigenvectors (I’m not even sure I remembered what eigenvalues and eigenvectors were at the time).

I assure you that in hindsight, understanding PCA, despite its very scientific-sounding name, is not that difficult to understand at all. There are two “categories” I think it makes sense to split the discussion of PCA into.

1) What does it do? A written and visual description of how PCA is used.

2) The math behind PCA. A few equations showing that the things mentioned in (1) are true.

What does PCA do?

Linear transformation

PCA finds a matrix \( Q \) that, when multiplied by your original data matrix \( X \), gives you a linearly transformed data matrix \( Z \), where:

$$ Z = XQ $$

In other words, for a single sample vector \( x \), we can obtain its transformation \( z = Qx \).

The \( Q \) and \( x \) appear in a different order here because when we load a data matrix, it’s usually \( N \times D \), so each sample vector is \( 1 \times D \), but when we talk about vectors by themselves we usually think of them as \( D \times 1 \).

The interesting thing about PCA is how it chooses \( Q \).

Dimensionality reduction

When we think of machine learning models we often study them in the context of 1-3 dimensions. But if you take a dataset like MNIST, where each image is 28×28 pixels, then each input is 784 dimensions. 28×28 is a tiny image. The new iPhone 6 has a resolution of 1080 × 1920 pixels. Each image would thus have 2073600 \( \approx \) 2 million dimensions.

PCA reduces dimensionality by moving as much “information” as possible into as few dimensions as possible. We measure information via unpredictability, i.e. variance. The end result is that our transformed matrix \( Z \) has most of its variance in the first column, less variance in the second column, even less variance in the third column, etc.

You will see later how our choice of \( Q \) accomplishes this.


\( D \approx 2000000 \) is a much bigger problem than the ones you learn about in the classroom. But often there are correlations in the data that make many of the dimensions redundant. This is related to the problem of linear regression – if we can accurately predict one dimension of the inputs, let’s say, \( x_3 \), in terms of another, \( x_5 \), then one of those dimensions is redundant.

The final result of removing these correlations is that every dimension of \( Z \) is uncorrelated with the other.


The image above shows a scatterplot of the original data on the original x-y axis. The arrows on the plot itself show that the new v1-v2 axis is a) rotated (matrix multiplication rotates a vector) b) still perpendicular, and c) the data relative to the new axis is normally distributed and the distributions on each dimension are independent and uncorrelated.


Once we’ve transformed \( X \) into \( Z \), noting that most of the “information” from \( X \) is in the first few dimensions of \( Z \) – let’s say… 90% of the variance is in the first 2 dimensions – we can do a 2-D scatterplot of Z to see the structure of our data. Humans have a hard time visualization anything greater than 2 dimensions.

We can consider the first 2 dimensions to represent the real underlying trend in the data, and the other dimensions just small perturbations or noise.


You may have heard that too many parameters in your model can lead to overfitting. So if you are planning to use a logistic regression model, and the dimensionality of each input is 2 million, then the number of weights in your model will also be 2 million. This is not good news if you have only a few thousand samples. One rule of thumb is we would like to have 10x the number of samples compared to the number of parameters in our model.

So instead of going out and finding 20 million samples, we can use PCA to reduce the dimensionality of our data to say, 20, and then we only need 200 samples for our model.

You can also use PCA to pre-process data before using an unsupervised learning algorithm, like k-means clustering. PCA, by the way, is also an unsupervised algorithm.

The math behind PCA

Everybody’s favorite part!

Ok, so this requires both (a) some statistics knowledge (knowing how to find the covariance matrix), and (b) some linear algebra knowledge (knowing what eigenvalues and eigenvectors are, and basic matrix manipulation).


I stated above that in the rotated v1-v2 coordinate system, the data on the v1 axis was not correlated with the data on the v2 axis.

Intuitively, when you have a 2-D Gaussian distribution, the data looks like an ellipse. If that ellipse is perpendicular to the x-y grid, then the x and y components are independent and uncorrelated.

The image below shows what happens with different values of the correlation between \( X_i \) and \( X_j \) (called ‘c’ in the image):


(If you really know your statistics, then you will recall that independence implies 0 correlation, but 0 correlation does not imply independence, unless the distribution is a joint Gaussian.)

So, in the first image, since the ellipse is not perpendicular to the x-y axis, the distributions p(x) and p(y) are not independent. But in the rotated v1-v2 axis, the distributions p(v1) and p(v2) are independent and uncorrelated.

If this all sounds crazy, just remember this one fact: if 2 variables \( X_i \) and \( X_j \) are uncorrelated, their corresponding element in the covariance matrix, \( \Sigma_{ij} = 0 \). And since the covariance matrix is symmetric (for our purposes), \( \Sigma_{ji} = 0 \) also.

For every variable to be uncorrelated to every other variable, all the elements of \( \Sigma \) will be 0, except those along the diagonal, \( \Sigma_{ii} = \sigma_i^2 \), which is just the regular variance for each dimension.

How do we calculate the covariance matrix from the data? Simple!

$$ \Sigma_X = \frac{1}{N} (X – \mu_X)^T(X – \mu_X) \tag{1}$$

In non-matrix form this is:

$$ \Sigma_{ij} = \frac{1}{N} \sum_{n=1}^{N} (x(n)_i – \mu_i)(x(n)_j – \mu_j) $$

I’ve labeled the equation because we’re going to return to it. You can verify that \( \Sigma \) is \( D \times D \) since those are the outer dimensions of the matrix multiply.

Note the mathematical sleight of hand I used above. \( X \) is \( N \times D \) and the mean \( \mu_X \) is \( 1 \times D \), so technically you can’t subtract them under the rules of matrix algebra, but let’s assume that the above notation means each row of \( X \) has \( \mu_X \) subtracted from it.

Eigenvalues and Eigenvectors

This is where the real magic of PCA happens.

For no particular reason at all, suppose we compute the eigenvalues and eigenvectors of \( \Sigma_X \).

If you don’t know what eigenvalues/eigenvectors are: remember that usually, when we multiply a vector by a matrix, it changes the direction of the vector.

So for example, the vector \( (1,2) \) is pointing in the same direction as \( (2,4) = 2(1,2) \). But if we were to multiply \( (1,2) \) by a matrix:

$$ \begin{pmatrix} 5 \\ 11 \end{pmatrix} = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \begin{pmatrix} 1 \\ 2 \end{pmatrix} $$

The result is a vector in a different direction. Eigenvalues \( \lambda \) and eigenvectors \( v \) have the property that, if you multiply \( \Sigma_Xv \), this is the same by multiplying the eigenvector by a constant – the eigenvalue, i.e. the eigenvector does not change direction, it only gets shorter or longer by a factor of \( \lambda \). In other words:

$$ \Sigma_Xv = \lambda v $$

How many pairs of these eigenvectors and eigenvalues can we find? It just so happens that there are \( D \) such pairs. We’ll say, each eigenvalue \( \lambda_i \) has a corresponding eigenvector \( v_i \).

Now again, for no particular reason at all, we are going to combine all the \( v_i \) into a matrix, let’s call it \( V \).

We could line up the \( v_i \) in any random order, but instead we are going to choose a very particular order – we are going to line them up such that the corresponding eigenvalues \( \lambda_i \) are in descending order. In other words:

$$ \lambda_1 > \lambda_2 > \lambda_3 > … > \lambda_D $$


$$ V = \begin{pmatrix}
| & | & | & … & | \\
v_1 & v_2 & v_3 & … & v_D \\
| & | & | & … & |

We are also going to combine the eigenvalues into a diagonal matrix like this:

$$ \Lambda = \begin{pmatrix}
\lambda_1 & 0 & … & 0 \\
0 & \lambda_2 & … & 0 \\
… & … & … & … \\
0 & 0 & … & \lambda_D

As a sidenote, don’t worry about how we find the eigenvalues and eigenvectors. That is a huge problem in and of itself. The method you used in high school – solving a polynomial to get the eigenvalues, plugging the eigenvalues into the eigenvalue equation to get the eigenvectors, etc. doesn’t really translate to computer code. There exist some very involved algorithms to solve this problem, but we’ll just use numpy.

In any case, it is easy to prove to yourself, with a simple 2-D example, that:

$$ \Sigma_X V = V \Lambda \tag{2}$$

Just one more ingredient. We want to show that \( V \) is orthonormal. In terms of the individual eigenvectors this means that \( {v_i}^T v_j = 0 \) if \( i \ne j \), and \( {v_i}^T v_i = 1 \). In matrix form this means \( V^T V = I \), or in other words, \( V^T = V^{-1} \).

Proof of orthogonality: Consider \( \lambda_j {v_i}^T v_j \) for \( i \ne j \). \( \lambda_j {v_i}^T v_j = {v_i}^T (\lambda_j v_j) = {v_i}^T \Sigma_X v_j = (\Sigma_X^T v_i)^T v_j \). But we know that \( \Sigma_X \) is symmetric so this is \( (\Sigma_X v_i)^T v_j = (\lambda_i v_i)^T v_j = \lambda_i {v_i}^T v_j \). Since \( \lambda_i \ne \lambda_j \), this can only be true if \( {v_i}^T v_j = 0 \).

Normalizing eigenvectors is easy since they are not unique – just choose values so that their length is 1.

Finally, using \( (1) \), consider the covariance of the transformed data, \( Z \).

$$ \begin{align*}
\Sigma_Z &= \frac{1}{N} (Z – \mu_Z)^T (Z – \mu_Z) \\
&= \frac{1}{N} (XQ – \mu_XQ)^T (XQ – \mu_XQ) \\
&= \frac{1}{N} Q^T(X – \mu_X)^T (X – \mu_X)Q \\
&= Q^T \Sigma_X Q

Now, suppose we choose \( Q = V \). Then, by plugging in \( (2) \), we get:

$$ \begin{align*}
\Sigma_Z &= V^T \Sigma_X V \\
&= V^T V \Lambda \\
&= \Lambda

So what does this tell us? Since \( \Lambda \) is a diagonal matrix, there are no correlations in the transformed data. The variance of each dimension of \( Z \) is equal to the eigenvalues. In addition, because we sorted the eigenvalues in descending order, the first dimension of \( Z \) has the most variance, the second dimension has the second-most variance, etc. So most of the information is kept in the leading columns, as promised.

You can verify this in Python:

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA()
X = np.random.random((100,10)) # generate an N = 100, D = 10 random data matrix
Z = pca.fit_transform(X)

# visualize the covariance of Z

covariance of output of PCA

Observe that the off-diagonals are 0 and that the variances are in decreasing order.

You can also confirm that \( Q^T Q = I \).


print np.around(QTQ, decimals=2)


Remember to leave a comment below if you have any questions on the above material. If you want to go more in depth on this and other data science topics (both with the math and the code), check out some of my data science video courses online.

#anomaly detection #covariance #dimensionality reduction #eigenvalues #eigenvectors #linear algebra #machine learning #pca #principal components analysis #statistics

Go to comments

Linear regression video course in Python

November 11, 2015

Hi all!

Do you ever get tired of reading walls of text, and just want a nice video or 10 to explain to you the magic of linear regression and how to program it with Python and numpy?

Look no further, that video course is here.

#linear regression #numpy #python #statistics

Go to comments

Bayesian Bandit Tutorial

July 31, 2014


To explain and provide code for the Bayesian bandit problem, without too much math.


Also known as the “multi-armed bandit”, the problem is as follows:

Suppose you are at a casino and have a choice between N slot machines. Each of the N slot machines (bandits) has an unknown probability of letting you win. i.e. Bandit 1 may have P(win) = 0.9. Bandit 2 may have P(win) = 0.3. We wish to maximize our winnings by playing the machine which has the highest probability of winning. The problem is determining which machine this is.

What is the inherent difficulty in determining the machine probabilities?

We know that to accurately determine the probability of an event, we have to take a large number of samples. The more samples we take, the more accurate that probability is.

Unfortunately for us, if we take 1000 samples from a machine with low probability of winning, then we have not maximized our winnings. In fact, we may have just needlessly lost 1000 times.

Explore vs. Exploit

This problem is known as explore vs. exploit. Should we “explore” – i.e. try a new machine in hopes that its probability of winning is higher? Or should we “exploit” – i.e. the machine we are using now is pretty good, so let’s keep going?

There are real-world applications to this dilemma. For example, suppose you are an online advertising platform and you want to show ads to users that have a high probability of being clicked. Do you show an ad that has a reasonably high probability so far, or do you choose a new ad that could have an even higher click-through rate (CTR)? Suppose you are the New York Times and you want to increase the click-through of articles on your home page. Should you post the same articles that are already known to be popular, or should you show a fresh article that you predict might be even more popular?

The trick is to use Bayes’ rule.

P(X|Y) = P(Y|X)P(X)/P(Y)

The Bayesian Bandit Solution

The idea: let’s not pull each arm 1000 times to get an accurate estimate of its probability of winning. Instead, let’s use the data we’ve collected so far to determine which arm to pull. If an arm doesn’t win that often, but we haven’t sampled it too much, then its probability of winning is low, but our confidence in that estimate is also low – so let’s give it a small chance in the future. However, if we’ve pulled an arm many times and it wins often, then its probability of winning is high, and our confidence in that estimate is also high – let’s give that arm a higher chance of being pulled.

Let’s rewrite Bayes’ rule for our problem:


Where theta is the parameter we’re trying to estimate – win rate, and x is the data we’ve collected so far (# wins and # losses).


  • P(x | theta) is the “likelihood” (i.e. the probability of observing the data x given our current belief about the parameters theta)
  • P(theta) is the “prior” – our current assumption about the parameters
  • P(theta | x) is the “posterior” – our updated assumption about the parameters after observing the data x

The “trick” in Bayesian bandits has to do with something mathematicians call “conjugate priors”. It saves us from having to do the actual calculation for Bayes’ rule.

We choose the Beta distribution for some convenient reasons (even though there is no deductive reason for us to do so):

  • It can have values between 0 and 1, which is nice for things like click-through rate, or in other words, for representing the probability of winning.
  • When the prior is a Beta distribution and the likelihood is a Bernoulli distribution, the posterior is also a Beta distribution.
  • The update formula for determining the posterior only involves addition and subtraction, whereas for other distributions it can be more complicated (I will show how the parameters are related below).

The Beta distribution has 2 parameters, a and b, which govern its shape. We refer to a specific distribution as Beta(a,b)

Initially, we set a=1 and b=1. This results in a uniform distribution, i.e. we assume nothing about our chances of winning – all probabilities of winning are equally likely. This is called a minimally informative prior.

To update the posterior given current data, we do the following:

  • Let our prior parameters be a, b.
  • Let x be the result of pulling an arm (either 0 or 1).
  • Then our posterior parameters are: a’ = a + x, b’ = b + 1 – x

Thus, our updated distribution is Beta(a + x, b + 1 – x).

The algorithm for Bayesian Bandits

for number_of_trials:
take random sample from each bandit with its current (a, b)
for the bandit with the largest sample, pull its arm
update that bandit with the data from the last pull

A visual representation of Bayesian Bandits

Below I plot the distributions of each Beta after a certain number of trials.







What is happening here? Notice that the red distribution, which has the highest probability of winning, gets sharper and sharper as the number of trials increases.

The key is in the sampling stage. When we sample from each distribution, the red distribution is much more likely to give us a number around 0.75, whereas the green and blue distributions will give us a number in a much wider range.

So once the red bandit has “proved itself” we are much more likely to choose it in the future, and we just don’t bother to choose the lower performing bandits, although there is a small chance we could.

A fat distribution means more “exploration”. A sharp distribution means more “exploitation” (if it has a relative high win rate). Note that as the programmer, you don’t choose whether or not to explore or exploit. You sample from the distributions, meaning it is randomly decided whether you should explore or exploit. Essentially the decision is random, with a bias toward bandits who have proven themselves to win more.

See the code here:

#a/b testing #bayesian bandit #online marketing #statistics #tutorial

Go to comments

Install all your statistics and numerical computation libraries for Python in one go on Ubuntu

July 26, 2014

sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose

#numpy #python #scientific computing #scipy #statistics

Go to comments

Multiple Linear Regression

July 26, 2014

Code for this tutorial is here:


Today we will continue our discussion of linear regression by extending the ideas from simple linear regression to multiple linear regression.

Recall that in simple linear regression, the input is 1-D. In multiple linear regression, the input is N-dimensional (any number of dimensions). The output is still just a scalar (1-D).

So now our input data looks like this:

(X1, Y1), (X2, Y2), …, (Xm, Ym)

Where X is a vector and Y is a scalar.

But now instead of our hypothesis, h(), looking like this:

h(X) = aX + b

It looks like this:


Where each subscripted x is a scalar.

beta0 is also known as the “bias term”.

Another, more compact way of writing this is:


Where beta and x are vectors. When we transpose the first vector this is also called a “dot product” or “inner product”.

In this representation, we introduce a dummy variable x0 = 1, so that beta and x both contain the same number of elements (n+1).

To solve for the beta vector, we do the same thing we did for simple linear regression: define an error function (we’ll use sum of squared error again), and take the derivative of J with respect to each parameter (beta0, beta1, …) and set them to 0 to solve for each beta.


This is a lot more tedious than in the 1-D case, but I would suggest as an exercise attempting at least the 2-D case.

As before, there is a “closed form” solution for beta:


Here, each (Xi, Yi) is a “sample” from the data.

Notice that in the first term we transpose the second Xi. This is an “outer product” and the result is an (n+1) x (n+1) vector.

The superscript -1 denotes a matrix inverse.

An even more compact form of this equation arises when we consider all the samples of X together in an m x (n+1) matrix, and all the samples of Y together in an m x 1 matrix:


As in the 1-D case, we use the R-square to measure how well the model fits the actual data (the formula is exactly the same).

#linear regression #machine learning #multiple linear regression #statistics

Go to comments

Linear Regression

July 25, 2014

Code for this tutorial is here:

Prerequisites for understanding this material:

  • calculus (taking partial derivatives)

Linear regression is one of the simplest machine learning techniques you can use. It is often useful as a baseline relative to more powerful techniques.

To start, we will look at a simple 1-D case.

Like all regressions, we wish to map some input X to some input Y.


Y = f(X)

With linear regression:

Y = aX + b

More accurately we can say:

h(X) = aX + b

Where “h” is our “hypothesis”.

You may recall from your high school studies that this is just the equation for a straight line.


When X is 1-D, or when “Y has one explanatory variable”, we call this “simple linear regression”.

When we use linear regression, we are using it to model linear relationships, or what we think may be linear relationships.

As with all supervised machine learning problems, we are given labeled data points:

(X1, Y1), (X2, Y2), (X3, Y3), …, (Xn, Yn)

And we will try to fit the line (aX + b) as best we can to these data points.


This means we have to optimize the parameters “a” and “b”.

How do we do this?

We will define an error function and then find the “a” and “b” that will make the error as small as possible.

You will see that many regression problems work this way.

What is our error function?

We could use the difference between the predicted Y and the actual Y like so:


But if we had equal amounts of errors where Y was bigger than the prediction, and where Y was smaller than the prediction, then the errors would cancel out, even though the absolute difference in errors is large.

Typically in machine learning, the squared error is a good place to start.


Now, whether or not the difference in the actual and predicted output is positive or negative, its contribution to the total error is still positive.

We call this sum the “sum of squared errors”.

Recall that we want to minimize it.

Recall from calculus that to minimize something, you want to take its derivative.


Because there are two parameters, we have to take the derivatives both with respect to a and with respect to b, set them to 0, and solve for a and b.

Luckily, because the error function is a quadratic it increases as (a,b) get further and further away from the minimum.

As an exercise I will let you calculate the derivatives.

You will get 2 equations (the derivatives) and 2 unknowns (a, b). From high school math you should know how to solve this by rearranging the terms.

Note that these equations can be solved analytically. Meaning you can just plug and chug the values of your inputs and get the final value of a and b by blindly using a formula.

Note that this method is also called “ordinary least squares”.

Measuring the error (R-squared)

To determine how well our model fits the data, we need a measure called the “R-square”.

Note that in classification problems, we can simply use the “classification rate”, which is the number of correctly classified inputs divided by the total number of inputs. With the real-valued outputs we have in regression, this is not possible.

Here are the equations we use to predict the R-square.


SS(residual) is the sum of squared error between the actual and predicted output. This is the same as the error we were trying to minimize before!

SS(total) is the sum of squared error between each sample output and the mean of all the sample outputs, i.e. What the residual error would be if we just predicted the average output every time.

So the R-square then, is just how much better our model is compared to predicting the mean each time. If we just predicted the mean each time, the R-square would be 1-1=0. If our model is perfect, then the R-square would be 1-0=1.

Something to think about: If our model performs worse than predicting the mean each time, what would be the R-square value?

Limitations of Linear Regression

  • It only models linear equations. You can model higher order polynomials (link to later post) but the model is still linear in its parameters.
  • It is sensitive to outliers. Meaning if we have one data point very far away from all the others, it could “pull” the regression line in its direction, away from all the other data points, just to minimize the error.
#calculus #linear regression #machine learning #statistics

Go to comments