Lazy Programmer

Your source for the latest in deep learning, big data, data science, and artificial intelligence. Sign up now

Principal Components Analysis in Theano

February 21, 2016

This is a follow-up post to my original PCA tutorial. It is of interest to you if you:

  • Are interested in deep learning (this tutorial uses gradient descent)
  • Are interested in learning more about Theano (it is not like regular Python, and it is very popular for implementing deep learning algorithms)
  • Want to know how you can write your own PCA solver (in the previous post we used a library to get eigenvalues and eigenvectors)
  • Work with big data (this technique can be used to process data where the dimensionality is very large – where the covariance matrix wouldn’t even fit into memory)

First, you should be familiar with creating variables and functions in Theano. Here is a simple example of how you would do matrix multiplication:

import numpy as np
import theano
import theano.tensor as T

X = T.matrix('X')
Q = T.matrix('Q')
Z = T.dot(X, Q)

transform = theano.function(inputs=[X,Q], outputs=Z)

X_val = np.random.randn(100,10)
Q_val = np.random.randn(10,10)

Z_val = transform(X_val, Q_val)

I think of Theano variables as “containers” for real numbers. They actually represent nodes in a graph. You will see the term “graph” a lot when you read about Theano, and probably think to yourself – what does matrix multiplication or machine learning have to do with graphs? (not graphs as in visual graphs, graphs as in nodes and edges) You can think of any “equation” or “formula” as a graph. Just draw the variables and functions as nodes and then connect them to make the equation using lines/edges. It’s just like drawing a “system” in control systems or a visual representation of a neural network (which is also a graph).

If you have ever done linear programming or integer programming in PuLP you are probably familiar with the idea of “variable” objects and them passing them into a “solver” after creating some “expressions” that represent the constraints and objective of the linear / integer program.

Anyway, onto principal components analysis.

Let’s consider how you would find the leading eigenvalue and eigenvector (the one corresponding to the largest eigenvalue) of a square matrix.

The loss function / objective for PCA is:

$$ J = \sum_{n=1}^{N} |x_n – \hat{x}_n|^2 $$

Where \( \hat{X} \) is the reconstruction of \( X \). If there is only one eigenvector, let’s call this \( v \), then this becomes:

$$ J = \sum_{n=1}^{N} |x_n – x_nvv^T|^2 $$

This is equivalent to the Frobenius norm, so we can write:

$$ J = |X – Xvv^T|_F $$

One identity of the Frobenius norm is:

$$ |A|_F = \sqrt{ \sum_{i} \sum_{j} a_{ij} } = \sqrt{ Tr(A^T A ) } $$

Which means we can rewrite the loss function as:

$$ J = Tr( (X – Xvv^T)^T(X – Xvv^T) ) $$

Keeping in mind that with the trace function you can re-order matrix multiplications that you wouldn’t normally be able to (matrix multiplication isn’t commutative), and dropping any terms that don’t depend on \( v \), you can use matrix algebra to rearrange this to get:

$$ v^* = argmin\{-Tr(X^TXvv^T) \} $$

Which again using reordering would be equivalent to maximizing:

$$ v^* = argmax\{ v^TX^TXv \} $$

The corresponding eigenvalue would then be:

$$ \lambda = v^TX^TXv $$

Now that we have a function to maximize, we can simply use gradient descent to do it, similar to how you would do it in logistic regression or in a deep belief network.

$$ v \leftarrow v + \eta \nabla_v(v^TX^TXv) $$

Next, let’s extend this algorithm for finding the other eigenvalues and eigenvectors. You essentially subtract the contributions of the eigenvalues you already found.

$$ v_i \leftarrow v_i + \eta \nabla_{v_i}(v_i^T( X^TX – \sum_{j=1}^{i-1} \lambda_j v_j v_j^T )v_i ) $$

Next, note that to implement this algorithm you never need to actually calculate the covariance \( X^T X \). If your dimensionality is, say, 1 million, then your covariance matrix will have 1 trillion entries!

Instead, you can multiply by your eigenvector first to get \( Xv \), which is only of size \( N \times 1 \). You can then “dot” this with itself to get a scalar, which is only an \( O(N) \) operation.

So how do you write this code in Theano? If you’ve never used Theano for gradient descent there will be some new concepts here.

First, you don’t actually need to know how to differentiate your cost function. You use Theano’s T.grad(cost_function, differentiation_variable).

v = theano.shared(init_v, name="v")
Xv = T.dot(X, v)
cost = T.dot(Xv.T, Xv) - np.sum(evals[j]*T.dot(evecs[j], v)*T.dot(evecs[j], v) for j in xrange(i))

gv = T.grad(cost, v)

Note that we re-normalize the eigenvector on each step, so that \( v^T v = 1 \).

Next, you define your “weight update rule” as an expression, and pass this into the “updates” argument of Theano’s function creator.

y = v + learning_rate*gv
update_expression = y / y.norm(2)

train = theano.function(
  inputs=[X],
  outputs=[your outputs],
  updates=((v, update_expression),)
)

Note that the update variable must be a “shared variable”. With this knowledge in hand, you are ready to implement the gradient descent version of PCA in Theano:

for i in xrange(number of eigenvalues you want to find):
  ... initialize variables and expressions ...
  ... initialize theano train function ...
  while t < max_iterations and change in v < tol:
    outputs = train(data)

... return eigenvalues and eigenvectors ...

This is not really trivial but at the same time it's a great exercise in both (a) linear algebra and (b) Theano coding.

If you are interested in learning more about PCA, dimensionality reduction, gradient descent, deep learning, or Theano, then check out my course on Udemy "Data Science: Deep Learning in Python" and let me know what you think in the comments.

#aws #data science #deep learning #gpu #machine learning #nvidia #pca #principal components analysis #statistics #theano

Go to comments


Data Science: Natural Language Processing in Python

February 11, 2016

Do you want to learn natural language processing from the ground-up?

If you hate math and want to jump into purely practical coding examples, my new course is for you.

You can check it out at Udemy: https://www.udemy.com/data-science-natural-language-processing-in-python

I am posting the course summary here also for convenience:

753140_f3f3_2

In this course you will build MULTIPLE practical systems using natural language processing, or NLP. This course is not part of my deep learning series, so there are no mathematical prerequisites – just straight up coding in Python. All the materials for this course are FREE.

After a brief discussion about what NLP is and what it can do, we will begin building very useful stuff. The first thing we’ll build is a spam detector. You likely get very little spam these days, compared to say, the early 2000s, because of systems like these.

Next we’ll build a model for sentiment analysis in Python. This is something that allows us to assign a score to a block of text that tells us how positive or negative it is. People have used sentiment analysis on Twitter to predict the stock market.

We’ll go over some practical tools and techniques like the NLTK (natural language toolkit) library and latent semantic analysis or LSA.

Finally, we end the course by building an article spinner. This is a very hard problem and even the most popular products out there these days don’t get it right. These lectures are designed to just get you started and to give you ideas for how you might improve on them yourself. Once mastered, you can use it as an SEO, or search engine optimization tool. Internet marketers everywhere will love you if you can do this for them!

As a thank you for visiting this site, I’ve created a coupon that gets you 70% off.

Click here to get the course for only $15.

#article spinner #latent semantic analysis #latent semantic indexing #machine learning #natural language processing #nlp #pca #python #spam detection #svd

Go to comments


A Tutorial on Autoencoders for Deep Learning

December 31, 2015

Despite its somewhat initially-sounding cryptic name, autoencoders are a fairly basic machine learning model (and the name is not cryptic at all when you know what it does).

Autoencoders belong to the neural network family, but they are also closely related to PCA (principal components analysis).

Some facts about the autoencoder:

  • It is an unsupervised learning algorithm (like PCA)
  • It minimizes the same objective function as PCA
  • It is a neural network
  • The neural network’s target output is its input

The last point is key here. This is the architecture of an autoencoder:

aziW7

So the dimensionality of the input is the same as the dimensionality of the output, and essentially what we want is x’ = x.

It can be shown that the objective function for PCA is:

$$ J = \sum_{n=1}^{N} |x(n) – \hat{x}(n)|^2 $$

Where the prediction \( \hat{x}(n) = Q^{-1}Qx(n) \).

Q can be the full transformation matrix (which would result in getting exactly the old x back), or it can be a “rank k” matrix (i.e. keeping the k-most relevant eigenvectors), which would then result in only an approximation of x.

So the objective function can be written as:

$$ J = \sum_{n=1}^{N} |x(n) – Q^{-1}Qx(n)|^2 $$

Now let’s return to autoencoders.

Recall that to get the value at the hidden layer, we simply multiply the input->hidden weights by the input.

Like so:

$$ z = f(Wx) $$

And to get the value at the output, we multiply the hidden->output weights by the hidden layer values, like so:

$$ y = g(Vz) $$

The choice of \( f \) and \( g \) is up to us, we just have to know how to take the derivative for backpropagation.

We are of course free to make them “identity” functions, such that:

$$ y = g(V f(Wx)) = VWx $$

This gives us the objective:

$$ J = \sum_{n=1}^{N} |x(n) – VWx(n)|^2 $$

Which is the same as PCA!

 

If autoencoders are similar to PCA, why do we need autoencoders?

Autoencoders are much more flexible than PCA.

Recall that with neural networks we have an activation function – this can be a “ReLU” (aka. rectifier), “tanh” (hyperbolic tangent), or sigmoid.

This introduces nonlinearities in our encoding, whereas PCA can only represent linear transformations.

The network representation also means you can stack autoencoders to form a deep network.

 

Cool theory bro, but what can autoencoders actually do for me?

Good question!

Similar to PCA – autoencoders can be used for finding a low-dimensional representation of your input data. Why is this useful?

Some of your features may be redundant or correlated, resulting in wasted processing time and overfitting in your model (too many parameters).

It is thus ideal to only include the features we need.

If your “reconstruction” of x is very accurate, that means your low-dimensional representation is good.

You can then use this transformation as input into another model.

 

Training an autoencoder

Since autoencoders are really just neural networks where the target output is the input, you actually don’t need any new code.

Suppose we’re working with a sci-kit learn-like interface.

Instead of:

model.fit(X, Y)

You would just have:

model.fit(X, X)

Pretty simple, huh?

All the usual neural network training strategies work with autoencoders too:

  • backpropagation
  • regularization
  • dropout

If you want to get good with autoencoders – I would recommend trying to take some data and an existing neural network package you’re comfortable with – and see what low-dimensional representation you can come up with. How many dimensions are there?

 

Where can I learn more?

Autoencoders are part of a family of unsupervised deep learning methods, which I cover in-depth in my course, Unsupervised Deep Learning in Python. We discuss how to stack autoencoders to build deep belief networks, and compare them to RBMs which can be used for the same purpose. We derive all the equations and write all the code from scratch – no shortcuts. Ask me for a coupon so I can give you a discount!

P.S. “Autoencoders” means “encodes itself”. Not so cryptic now, right?

Leave a comment!

#autoencoders #deep learning #machine learning #pca #principal components analysis #unsupervised learning

Go to comments


Tutorial: Principal Components Analysis (PCA)

November 20, 2015

I remember learning about principal components analysis for the very first time. I remember thinking it was very confusing, and that I didn’t know what it had to do with eigenvalues and eigenvectors (I’m not even sure I remembered what eigenvalues and eigenvectors were at the time).

I assure you that in hindsight, understanding PCA, despite its very scientific-sounding name, is not that difficult to understand at all. There are two “categories” I think it makes sense to split the discussion of PCA into.

1) What does it do? A written and visual description of how PCA is used.

2) The math behind PCA. A few equations showing that the things mentioned in (1) are true.

What does PCA do?

Linear transformation

PCA finds a matrix \( Q \) that, when multiplied by your original data matrix \( X \), gives you a linearly transformed data matrix \( Z \), where:

$$ Z = XQ $$

In other words, for a single sample vector \( x \), we can obtain its transformation \( z = Q^Tx \).

The \( Q \) and \( x \) appear in a different order here because when we load a data matrix, it’s usually \( N \times D \), so each sample vector is \( 1 \times D \), but when we talk about vectors by themselves we usually think of them as \( D \times 1 \).

The interesting thing about PCA is how it chooses \( Q \).

Dimensionality reduction

When we think of machine learning models we often study them in the context of 1-3 dimensions. But if you take a dataset like MNIST, where each image is 28×28 pixels, then each input is 784 dimensions. 28×28 is a tiny image. The new iPhone 6 has a resolution of 1080 × 1920 pixels. Each image would thus have 2073600 \( \approx \) 2 million dimensions.

PCA reduces dimensionality by moving as much “information” as possible into as few dimensions as possible. We measure information via unpredictability, i.e. variance. The end result is that our transformed matrix \( Z \) has most of its variance in the first column, less variance in the second column, even less variance in the third column, etc.

You will see later how our choice of \( Q \) accomplishes this.

De-correlation

\( D \approx 2000000 \) is a much bigger problem than the ones you learn about in the classroom. But often there are correlations in the data that make many of the dimensions redundant. This is related to the problem of linear regression – if we can accurately predict one dimension of the inputs, let’s say, \( x_3 \), in terms of another, \( x_5 \), then one of those dimensions is redundant.

The final result of removing these correlations is that every dimension of \( Z \) is uncorrelated with the other.

PCA

The image above shows a scatterplot of the original data on the original x-y axis. The arrows on the plot itself show that the new v1-v2 axis is a) rotated (matrix multiplication rotates a vector) b) still perpendicular, and c) the data relative to the new axis is normally distributed and the distributions on each dimension are independent and uncorrelated.

Visualization

Once we’ve transformed \( X \) into \( Z \), noting that most of the “information” from \( X \) is in the first few dimensions of \( Z \) – let’s say… 90% of the variance is in the first 2 dimensions – we can do a 2-D scatterplot of Z to see the structure of our data. Humans have a hard time visualization anything greater than 2 dimensions.

We can consider the first 2 dimensions to represent the real underlying trend in the data, and the other dimensions just small perturbations or noise.

Pre-processing

You may have heard that too many parameters in your model can lead to overfitting. So if you are planning to use a logistic regression model, and the dimensionality of each input is 2 million, then the number of weights in your model will also be 2 million. This is not good news if you have only a few thousand samples. One rule of thumb is we would like to have 10x the number of samples compared to the number of parameters in our model.

So instead of going out and finding 20 million samples, we can use PCA to reduce the dimensionality of our data to say, 20, and then we only need 200 samples for our model.

You can also use PCA to pre-process data before using an unsupervised learning algorithm, like k-means clustering. PCA, by the way, is also an unsupervised algorithm.

The math behind PCA

Everybody’s favorite part!

Ok, so this requires both (a) some statistics knowledge (knowing how to find the covariance matrix), and (b) some linear algebra knowledge (knowing what eigenvalues and eigenvectors are, and basic matrix manipulation).

Covariance

I stated above that in the rotated v1-v2 coordinate system, the data on the v1 axis was not correlated with the data on the v2 axis.

Intuitively, when you have a 2-D Gaussian distribution, the data looks like an ellipse. If that ellipse is perpendicular to the x-y grid, then the x and y components are independent and uncorrelated.

The image below shows what happens with different values of the correlation between \( X_i \) and \( X_j \) (called ‘c’ in the image):

ellipses

(If you really know your statistics, then you will recall that independence implies 0 correlation, but 0 correlation does not imply independence, unless the distribution is a joint Gaussian.)

So, in the first image, since the ellipse is not perpendicular to the x-y axis, the distributions p(x) and p(y) are not independent. But in the rotated v1-v2 axis, the distributions p(v1) and p(v2) are independent and uncorrelated.

If this all sounds crazy, just remember this one fact: if 2 variables \( X_i \) and \( X_j \) are uncorrelated, their corresponding element in the covariance matrix, \( \Sigma_{ij} = 0 \). And since the covariance matrix is symmetric (for our purposes), \( \Sigma_{ji} = 0 \) also.

For every variable to be uncorrelated to every other variable, all the elements of \( \Sigma \) will be 0, except those along the diagonal, \( \Sigma_{ii} = \sigma_i^2 \), which is just the regular variance for each dimension.

How do we calculate the covariance matrix from the data? Simple!

$$ \Sigma_X = \frac{1}{N} (X – \mu_X)^T(X – \mu_X) \tag{1}$$

In non-matrix form this is:

$$ \Sigma_{ij} = \frac{1}{N} \sum_{n=1}^{N} (x(n)_i – \mu_i)(x(n)_j – \mu_j) $$

I’ve labeled the equation because we’re going to return to it. You can verify that \( \Sigma \) is \( D \times D \) since those are the outer dimensions of the matrix multiply.

Note the mathematical sleight of hand I used above. \( X \) is \( N \times D \) and the mean \( \mu_X \) is \( 1 \times D \), so technically you can’t subtract them under the rules of matrix algebra, but let’s assume that the above notation means each row of \( X \) has \( \mu_X \) subtracted from it.

Eigenvalues and Eigenvectors

This is where the real magic of PCA happens.

For no particular reason at all, suppose we compute the eigenvalues and eigenvectors of \( \Sigma_X \).

If you don’t know what eigenvalues/eigenvectors are: remember that usually, when we multiply a vector by a matrix, it changes the direction of the vector.

So for example, the vector \( (1,2) \) is pointing in the same direction as \( (2,4) = 2(1,2) \). But if we were to multiply \( (1,2) \) by a matrix:

$$ \begin{pmatrix} 5 \\ 11 \end{pmatrix} = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \begin{pmatrix} 1 \\ 2 \end{pmatrix} $$

The result is a vector in a different direction. Eigenvalues \( \lambda \) and eigenvectors \( v \) have the property that, if you multiply \( \Sigma_Xv \), this is the same by multiplying the eigenvector by a constant – the eigenvalue, i.e. the eigenvector does not change direction, it only gets shorter or longer by a factor of \( \lambda \). In other words:

$$ \Sigma_Xv = \lambda v $$

How many pairs of these eigenvectors and eigenvalues can we find? It just so happens that there are \( D \) such pairs. We’ll say, each eigenvalue \( \lambda_i \) has a corresponding eigenvector \( v_i \).

Now again, for no particular reason at all, we are going to combine all the \( v_i \) into a matrix, let’s call it \( V \).

We could line up the \( v_i \) in any random order, but instead we are going to choose a very particular order – we are going to line them up such that the corresponding eigenvalues \( \lambda_i \) are in descending order. In other words:

$$ \lambda_1 > \lambda_2 > \lambda_3 > … > \lambda_D $$

So:

$$ V = \begin{pmatrix}
| & | & | & … & | \\
v_1 & v_2 & v_3 & … & v_D \\
| & | & | & … & |
\end{pmatrix}
$$

We are also going to combine the eigenvalues into a diagonal matrix like this:

$$ \Lambda = \begin{pmatrix}
\lambda_1 & 0 & … & 0 \\
0 & \lambda_2 & … & 0 \\
… & … & … & … \\
0 & 0 & … & \lambda_D
\end{pmatrix}
$$

As a sidenote, don’t worry about how we find the eigenvalues and eigenvectors. That is a huge problem in and of itself. The method you used in high school – solving a polynomial to get the eigenvalues, plugging the eigenvalues into the eigenvalue equation to get the eigenvectors, etc. doesn’t really translate to computer code. There exist some very involved algorithms to solve this problem, but we’ll just use numpy.

In any case, it is easy to prove to yourself, with a simple 2-D example, that:

$$ \Sigma_X V = V \Lambda \tag{2}$$

Just one more ingredient. We want to show that \( V \) is orthonormal. In terms of the individual eigenvectors this means that \( {v_i}^T v_j = 0 \) if \( i \ne j \), and \( {v_i}^T v_i = 1 \). In matrix form this means \( V^T V = I \), or in other words, \( V^T = V^{-1} \).

Proof of orthogonality: Consider \( \lambda_j {v_i}^T v_j \) for \( i \ne j \). \( \lambda_j {v_i}^T v_j = {v_i}^T (\lambda_j v_j) = {v_i}^T \Sigma_X v_j = (\Sigma_X^T v_i)^T v_j \). But we know that \( \Sigma_X \) is symmetric so this is \( (\Sigma_X v_i)^T v_j = (\lambda_i v_i)^T v_j = \lambda_i {v_i}^T v_j \). Since \( \lambda_i \ne \lambda_j \), this can only be true if \( {v_i}^T v_j = 0 \).

Normalizing eigenvectors is easy since they are not unique – just choose values so that their length is 1.

Finally, using \( (1) \), consider the covariance of the transformed data, \( Z \).

$$ \begin{align*}
\Sigma_Z &= \frac{1}{N} (Z – \mu_Z)^T (Z – \mu_Z) \\
&= \frac{1}{N} (XQ – \mu_XQ)^T (XQ – \mu_XQ) \\
&= \frac{1}{N} Q^T(X – \mu_X)^T (X – \mu_X)Q \\
&= Q^T \Sigma_X Q
\end{align*}
$$

Now, suppose we choose \( Q = V \). Then, by plugging in \( (2) \), we get:

$$ \begin{align*}
\Sigma_Z &= V^T \Sigma_X V \\
&= V^T V \Lambda \\
&= \Lambda
\end{align*}
$$

So what does this tell us? Since \( \Lambda \) is a diagonal matrix, there are no correlations in the transformed data. The variance of each dimension of \( Z \) is equal to the eigenvalues. In addition, because we sorted the eigenvalues in descending order, the first dimension of \( Z \) has the most variance, the second dimension has the second-most variance, etc. So most of the information is kept in the leading columns, as promised.

You can verify this in Python:

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA()
X = np.random.random((100,10)) # generate an N = 100, D = 10 random data matrix
Z = pca.fit_transform(X)

# visualize the covariance of Z
plt.imshow(np.cov(Z.T))
plt.show()

covariance of output of PCA

Observe that the off-diagonals are 0 and that the variances are in decreasing order.

You can also confirm that \( Q^T Q = I \).

QTQ = pca.components_.T.dot(pca.components_)
plt.imshow(QTQ)
plt.show()

print np.around(QTQ, decimals=2)

qtq

Remember to leave a comment below if you have any questions on the above material. If you want to go more in depth on this and other data science topics (both with the math and the code), check out some of my data science video courses online.

#anomaly detection #covariance #dimensionality reduction #eigenvalues #eigenvectors #linear algebra #machine learning #pca #principal components analysis #statistics

Go to comments