February 21, 2016

This is a follow-up post to my original PCA tutorial. It is of interest to you if you:

- Are interested in deep learning (this tutorial uses gradient descent)
- Are interested in learning more about Theano (it is not like regular Python, and it is very popular for implementing deep learning algorithms)
- Want to know how you can write your own PCA solver (in the previous post we used a library to get eigenvalues and eigenvectors)
- Work with big data (this technique can be used to process data where the dimensionality is very large – where the covariance matrix wouldn’t even fit into memory)

First, you should be familiar with creating variables and functions in Theano. Here is a simple example of how you would do matrix multiplication:

import numpy as np import theano import theano.tensor as T X = T.matrix('X') Q = T.matrix('Q') Z = T.dot(X, Q) transform = theano.function(inputs=[X,Q], outputs=Z) X_val = np.random.randn(100,10) Q_val = np.random.randn(10,10) Z_val = transform(X_val, Q_val)

I think of Theano variables as “containers” for real numbers. They actually represent nodes in a graph. You will see the term “graph” a lot when you read about Theano, and probably think to yourself – what does matrix multiplication or machine learning have to do with graphs? (not graphs as in visual graphs, graphs as in nodes and edges) You can think of any “equation” or “formula” as a graph. Just draw the variables and functions as nodes and then connect them to make the equation using lines/edges. It’s just like drawing a “system” in control systems or a visual representation of a neural network (which is also a graph).

If you have ever done linear programming or integer programming in PuLP you are probably familiar with the idea of “variable” objects and them passing them into a “solver” after creating some “expressions” that represent the constraints and objective of the linear / integer program.

Anyway, onto principal components analysis.

Let’s consider how you would find the leading eigenvalue and eigenvector (the one corresponding to the largest eigenvalue) of a square matrix.

The loss function / objective for PCA is:

$$ J = \sum_{n=1}^{N} |x_n – \hat{x}_n|^2 $$

Where \( \hat{X} \) is the reconstruction of \( X \). If there is only one eigenvector, let’s call this \( v \), then this becomes:

$$ J = \sum_{n=1}^{N} |x_n – x_nvv^T|^2 $$

This is equivalent to the Frobenius norm, so we can write:

$$ J = |X – Xvv^T|_F $$

One identity of the Frobenius norm is:

$$ |A|_F = \sqrt{ \sum_{i} \sum_{j} a_{ij} } = \sqrt{ Tr(A^T A ) } $$

Which means we can rewrite the loss function as:

$$ J = Tr( (X – Xvv^T)^T(X – Xvv^T) ) $$

Keeping in mind that with the trace function you can re-order matrix multiplications that you wouldn’t normally be able to (matrix multiplication isn’t commutative), and dropping any terms that don’t depend on \( v \), you can use matrix algebra to rearrange this to get:

$$ v^* = argmin\{-Tr(X^TXvv^T) \} $$

Which again using reordering would be equivalent to maximizing:

$$ v^* = argmax\{ v^TX^TXv \} $$

The corresponding eigenvalue would then be:

$$ \lambda = v^TX^TXv $$

Now that we have a function to maximize, we can simply use gradient descent to do it, similar to how you would do it in logistic regression or in a deep belief network.

$$ v \leftarrow v + \eta \nabla_v(v^TX^TXv) $$

Next, let’s extend this algorithm for finding the other eigenvalues and eigenvectors. You essentially subtract the contributions of the eigenvalues you already found.

$$ v_i \leftarrow v_i + \eta \nabla_{v_i}(v_i^T( X^TX – \sum_{j=1}^{i-1} \lambda_j v_j v_j^T )v_i ) $$

Next, note that to implement this algorithm you **never **need to actually calculate the covariance \( X^T X \). If your dimensionality is, say, 1 million, then your covariance matrix will have 1 trillion entries!

Instead, you can multiply by your eigenvector first to get \( Xv \), which is only of size \( N \times 1 \). You can then “dot” this with itself to get a scalar, which is only an \( O(N) \) operation.

So how do you write this code in Theano? If you’ve never used Theano for gradient descent there will be some new concepts here.

First, you don’t actually need to know how to differentiate your cost function. You use Theano’s T.grad(cost_function, differentiation_variable).

v = theano.shared(init_v, name="v") Xv = T.dot(X, v) cost = T.dot(Xv.T, Xv) - np.sum(evals[j]*T.dot(evecs[j], v)*T.dot(evecs[j], v) for j in xrange(i)) gv = T.grad(cost, v)

Note that we re-normalize the eigenvector on each step, so that \( v^T v = 1 \).

Next, you define your “weight update rule” as an expression, and pass this into the “updates” argument of Theano’s function creator.

y = v + learning_rate*gv update_expression = y / y.norm(2) train = theano.function( inputs=[X], outputs=[your outputs], updates=((v, update_expression),) )

Note that the update variable must be a “shared variable”. With this knowledge in hand, you are ready to implement the gradient descent version of PCA in Theano:

for i in xrange(number of eigenvalues you want to find): ... initialize variables and expressions ... ... initialize theano train function ... while t < max_iterations and change in v < tol: outputs = train(data) ... return eigenvalues and eigenvectors ...

This is not really trivial but at the same time it’s a great exercise in both (a) linear algebra and (b) Theano coding.

If you are interested in learning more about PCA, dimensionality reduction, gradient descent, deep learning, or Theano, then check out my course “Unsupervised Deep Learning in Python” and let me know what you think in the comments.

#aws #data science #deep learning #gpu #machine learning #nvidia #pca #principal components analysis #statistics #theano