New Deep Learning course on Udemy

February 26, 2016


Shut up and gimme the link!:

This course continues where my first course, Deep Learning in Python, left off. You already know how to build an artificial neural network in Python, and you have a plug-and-play script that you can use for TensorFlow.

You learned about backpropagation (and because of that, this course contains basically NO MATH), but there were a lot of unanswered questions. How can you modify it to improve training speed? In this course you will learn about batch and stochastic gradient descent, two commonly used techniques that allow you to train on just a small sample of the data at each iteration, greatly speeding up training time.

You will also learn about momentum, which can be helpful for carrying you through local minima and prevent you from having to be too conservative with your learning rate. You will also learn aboutadaptive learning rate techniques like AdaGrad and RMSprop which can also help speed up your training.

In my last course, I just wanted to give you a little sneak peak at TensorFlow. In this course we are going to start from the basics so you understand exactly what’s going on – what are TensorFlow variables and expressions and how can you use these building blocks to create a neural network? We are also going to look at a library that’s been around much longer and is very popular for deep learning – Theano. With this library we will also examine the basic building blocks – variables, expressions, and functions – so that you can build neural networks in Theano with confidence.

Because one of the main advantages of TensorFlow and Theano is the ability to use the GPU to speed up training, I will show you how to set up a GPU-instance on AWS and compare the speed of CPU vs GPU for training a deep neural network.

With all this extra speed, we are going to look at a real dataset – the famous MNIST dataset (images of handwritten digits) and compare against various known benchmarks.

#adagrad #aws #batch gradient descent #deep learning #ec2 #gpu #machine learning #nesterov momentum #numpy #nvidia #python #rmsprop #stochastic gradient descent #tensorflow #theano

Go to comments

Principal Components Analysis in Theano

February 21, 2016

This is a follow-up post to my original PCA tutorial. It is of interest to you if you:

  • Are interested in deep learning (this tutorial uses gradient descent)
  • Are interested in learning more about Theano (it is not like regular Python, and it is very popular for implementing deep learning algorithms)
  • Want to know how you can write your own PCA solver (in the previous post we used a library to get eigenvalues and eigenvectors)
  • Work with big data (this technique can be used to process data where the dimensionality is very large – where the covariance matrix wouldn’t even fit into memory)

First, you should be familiar with creating variables and functions in Theano. Here is a simple example of how you would do matrix multiplication:

import numpy as np
import theano
import theano.tensor as T

X = T.matrix('X')
Q = T.matrix('Q')
Z =, Q)

transform = theano.function(inputs=[X,Q], outputs=Z)

X_val = np.random.randn(100,10)
Q_val = np.random.randn(10,10)

Z_val = transform(X_val, Q_val)

I think of Theano variables as “containers” for real numbers. They actually represent nodes in a graph. You will see the term “graph” a lot when you read about Theano, and probably think to yourself – what does matrix multiplication or machine learning have to do with graphs? (not graphs as in visual graphs, graphs as in nodes and edges) You can think of any “equation” or “formula” as a graph. Just draw the variables and functions as nodes and then connect them to make the equation using lines/edges. It’s just like drawing a “system” in control systems or a visual representation of a neural network (which is also a graph).

If you have ever done linear programming or integer programming in PuLP you are probably familiar with the idea of “variable” objects and them passing them into a “solver” after creating some “expressions” that represent the constraints and objective of the linear / integer program.

Anyway, onto principal components analysis.

Let’s consider how you would find the leading eigenvalue and eigenvector (the one corresponding to the largest eigenvalue) of a square matrix.

The loss function / objective for PCA is:

$$ J = \sum_{n=1}^{N} |x_n – \hat{x}_n|^2 $$

Where \( \hat{X} \) is the reconstruction of \( X \). If there is only one eigenvector, let’s call this \( v \), then this becomes:

$$ J = \sum_{n=1}^{N} |x_n – x_nvv^T|^2 $$

This is equivalent to the Frobenius norm, so we can write:

$$ J = |X – Xvv^T|_F $$

One identity of the Frobenius norm is:

$$ |A|_F = \sqrt{ \sum_{i} \sum_{j} a_{ij} } = \sqrt{ Tr(A^T A ) } $$

Which means we can rewrite the loss function as:

$$ J = Tr( (X – Xvv^T)^T(X – Xvv^T) ) $$

Keeping in mind that with the trace function you can re-order matrix multiplications that you wouldn’t normally be able to (matrix multiplication isn’t commutative), and dropping any terms that don’t depend on \( v \), you can use matrix algebra to rearrange this to get:

$$ v^* = argmin\{-Tr(X^TXvv^T) \} $$

Which again using reordering would be equivalent to maximizing:

$$ v^* = argmax\{ v^TX^TXv \} $$

The corresponding eigenvalue would then be:

$$ \lambda = v^TX^TXv $$

Now that we have a function to maximize, we can simply use gradient descent to do it, similar to how you would do it in logistic regression or in a deep belief network.

$$ v \leftarrow v + \eta \nabla_v(v^TX^TXv) $$

Next, let’s extend this algorithm for finding the other eigenvalues and eigenvectors. You essentially subtract the contributions of the eigenvalues you already found.

$$ v_i \leftarrow v_i + \eta \nabla_{v_i}(v_i^T( X^TX – \sum_{j=1}^{i-1} \lambda_j v_j v_j^T )v_i ) $$

Next, note that to implement this algorithm you never need to actually calculate the covariance \( X^T X \). If your dimensionality is, say, 1 million, then your covariance matrix will have 1 trillion entries!

Instead, you can multiply by your eigenvector first to get \( Xv \), which is only of size \( N \times 1 \). You can then “dot” this with itself to get a scalar, which is only an \( O(N) \) operation.

So how do you write this code in Theano? If you’ve never used Theano for gradient descent there will be some new concepts here.

First, you don’t actually need to know how to differentiate your cost function. You use Theano’s T.grad(cost_function, differentiation_variable).

v = theano.shared(init_v, name="v")
Xv =, v)
cost =, Xv) - np.sum(evals[j]*[j], v)*[j], v) for j in xrange(i))

gv = T.grad(cost, v)

Note that we re-normalize the eigenvector on each step, so that \( v^T v = 1 \).

Next, you define your “weight update rule” as an expression, and pass this into the “updates” argument of Theano’s function creator.

y = v + learning_rate*gv
update_expression = y / y.norm(2)

train = theano.function(
  outputs=[your outputs],
  updates=((v, update_expression),)

Note that the update variable must be a “shared variable”. With this knowledge in hand, you are ready to implement the gradient descent version of PCA in Theano:

for i in xrange(number of eigenvalues you want to find):
  ... initialize variables and expressions ...
  ... initialize theano train function ...
  while t < max_iterations and change in v < tol:
    outputs = train(data)

... return eigenvalues and eigenvectors ...

This is not really trivial but at the same time it’s a great exercise in both (a) linear algebra and (b) Theano coding.

If you are interested in learning more about PCA, dimensionality reduction, gradient descent, deep learning, or Theano, then check out my course “Unsupervised Deep Learning in Python” and let me know what you think in the comments.

#aws #data science #deep learning #gpu #machine learning #nvidia #pca #principal components analysis #statistics #theano

Go to comments

Covariance Matrix: Divide by N or N-1?

February 14, 2016

This is a statistics post. It’s probably very boring. I am posting it for my own reference, because I seem to forget how this is derived every time I need it.

Sometimes you see the sample variance defined as:

$$ \hat{\sigma}^2 = \frac{1}{N} \sum_{n=1}^{N} (X_n – \mu)^2 $$

But you might also see it defined as:

$$ \hat{\sigma}^2 = \frac{1}{N-1} \sum_{n=1}^{N} (X_n – \hat{\mu})^2 $$

Where as usual the “hat” symbol means that is our prediction.

Why do statisticians sometimes divide by N, and sometimes divide by N-1?

The same question arises for the calculation of the sample covariance matrix, and this is what we will work with in this post.


This has to do with whether you want your estimate to be a biased estimate or an unbiased estimate.

For any parameter \( \theta \), our estimate \( \hat{ \theta } \) is unbiased if:

$$ E\{ \hat{ \theta } – \theta \} = 0 $$

In this tutorial we will calculate the bias of the sample covariance on the multivariate Gaussian, which is defined as:

$$ p(x | \mu, \sigma) = \frac{1}{\sqrt{(2\pi)^D |\Sigma|}} exp( -\frac{1}{2} (x – \mu)^T \Sigma^{-1} (x – \mu) ) $$

The maximum likelihood estimates of \( \mu \) and \( \Sigma \) can be found by taking the derivative of the log-likelihood and setting it to 0.

The likelihood is:

$$ p(X | \mu, \Sigma) = \prod_{n=1}^{N} p(x_n | \mu, \Sigma) $$

So the log-likelihood (expanded) is:

$$ L(X | \mu, \Sigma) = -\frac{ND}{2} log(2\pi)  -\frac{N}{2} log(|\Sigma|)  -\sum_{n=1}^{N} \frac{1}{2}(x_n – \mu)^T \Sigma^{-1} (x_n – \mu) $$

To take “vector derivatives” and “matrix derivatives” you’ll want to consult the Matrix Cookbook.

If you made it this far it is almost trivial to calculate \( \frac{ \partial L }{ \partial \mu } = 0 \) to get the usual result:

$$ \hat{ \mu } = \frac{1}{N} \sum_{n=1}^{N} x_n $$

To get the sample covariance, we calculate:

$$ \frac{ \partial L }{ \partial \Sigma } = -\frac{N}{2} (\Sigma^{-1})^T – \frac{1}{2} \sum_{n=1}^{N} – \Sigma^{-T} (x_n – \mu) (x_n – \mu)^T \Sigma^{-T} $$

Set that to 0 and solve for \( \Sigma \) to get:

$$ \hat{ \Sigma } = \frac{1}{N} \sum_{n=1}^{N} (x_n – \hat{\mu}) (x_n – \hat{\mu})^T $$

Note that we are assuming we don’t have the “true mean”, so we are estimating the mean using the maximum likelihood estimate before calculating the maximum likelihood estimate for the covariance.

Now we will show that \( E\{ \hat{ \Sigma } \}  \neq \Sigma \). By definition:

$$ E\{ \hat{ \Sigma } \} = \frac{1}{N} E\{ \sum_{n} (x_n – \hat{\mu})(x_n – \hat{\mu})^T \} $$


$$ E\{ \hat{ \Sigma } \} = \frac{1}{N} E\{ \sum_{n} ((x_n – \mu) – (\hat{\mu} – \mu))((x_n – \mu) – (\hat{\mu} – \mu)))^T \} $$

Expand that:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \frac{1}{N} E\{ \sum_{n} (\hat{\mu} – \mu)) (\hat{\mu} – \mu))^T – (x_n – \mu)(\hat{\mu} – \mu))^T – (\hat{\mu} – \mu))(x_n – \mu)^T \} $$

Multiply out the terms:

$$ E\{ \hat{ \Sigma } \} = \Sigma  + E\{ \hat{\mu}\hat{\mu}^T \} – \frac{1}{N} \sum_{n} E\{ x_n\hat{\mu}^T \} – \frac{1}{N} \sum_{n} E\{ \hat{\mu} x_n^T \} + \mu\mu^T $$

We can combine the expected values to get:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \mu\mu^T – E\{ \hat{\mu}\hat{\mu}^T \} $$

Now the exercise becomes finding the expected value on the right side.

We need some identities.

First, for \( m \neq n \):

$$ E\{ x_m x_n^T \} = E\{ x_m \} E\{ x_n^T \} =  \mu\mu^T $$

Because each sample is IID: independent and identically distributed.

Next, the definition of covariance:

$$ \Sigma = E\{ (x – \mu)(x – \mu)^T \} = E\{ xx^T \} – \mu\mu^T $$

We can rearrange this to get:

$$ E\{ xx^T \} = \Sigma + \mu\mu^T $$

The term \( E\{\hat{\mu}\hat{\mu}^T \} \) can be expanded as:

$$ E\{\hat{\mu}\hat{\mu}^T \} = \frac{1}{N^2} E\{ (x_1 + x_2 + … + x_N)(x_1 + x_2 + … + x_N)^T \} $$

When expanding the multiplication, there are \( N \) terms that are the same, so that would be a \( N E\{ x_n x_n^T \} \) contribution. There are \( N(N-1) \) terms that are different, and since different terms are independent that is a \( N(N-1)\mu\mu^T \) contribution.

So in total:

$$ E\{\hat{\mu}\hat{\mu}^T \} =\frac{1}{N^2}(N(\Sigma + \mu\mu^T) + N(N-1)\mu\mu^T) $$


$$ E\{\hat{\mu}\hat{\mu}^T \} = \frac{1}{N}\Sigma + \mu\mu^T $$

Plugging this back into the expression for the bias:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \mu\mu^T – \frac{1}{N}\Sigma – \mu\mu^T $$


$$ E\{ \hat{ \Sigma } \} = \frac{N-1}{N} \Sigma \neq \Sigma $$

So, if you want the unbiased estimator, you can multiply the biased maximum likelihood estimator by \( \frac{N}{N-1} \), which gives the expected unbiased formula.


#covariance #maximum likelihood #MLE #multivariate Gaussian #multivariate normal #sample variance #statistics #unbiased estimator

Go to comments

Data Science: Natural Language Processing in Python

February 11, 2016

Do you want to learn natural language processing from the ground-up?

If you hate math and want to jump into purely practical coding examples, my new course is for you.

You can check it out here:

I am posting the course summary here also for convenience:


In this course you will build MULTIPLE practical systems using natural language processing, or NLP. This course is not part of my deep learning series, so there are no mathematical prerequisites – just straight up coding in Python. All the materials for this course are FREE.

After a brief discussion about what NLP is and what it can do, we will begin building very useful stuff. The first thing we’ll build is a spam detector. You likely get very little spam these days, compared to say, the early 2000s, because of systems like these.

Next we’ll build a model for sentiment analysis in Python. This is something that allows us to assign a score to a block of text that tells us how positive or negative it is. People have used sentiment analysis on Twitter to predict the stock market.

We’ll go over some practical tools and techniques like the NLTK (natural language toolkit) library and latent semantic analysis or LSA.

Finally, we end the course by building an article spinner. This is a very hard problem and even the most popular products out there these days don’t get it right. These lectures are designed to just get you started and to give you ideas for how you might improve on them yourself. Once mastered, you can use it as an SEO, or search engine optimization tool. Internet marketers everywhere will love you if you can do this for them!

As a thank you for visiting this site, I’ve created a coupon that gets you 85% off! Just click here:

#article spinner #latent semantic analysis #latent semantic indexing #machine learning #natural language processing #nlp #pca #python #spam detection #svd

Go to comments

Deep Learning and Artificial Intelligence Newsletter

Get discount coupons, free machine learning material, and new course announcements