# New course! Bayesian Machine Learning in Python: A/B Testing

November 17, 2016

[If you already know you want to sign up for my Bayesian machine learning course, just scroll to the bottom to get your $10 coupon!] Boy, do I have some exciting news today! You guys have already been keeping up with my deep learning series. Hopefully, you’ve noticed that I’ve been releasing non-deep learning machine learning courses as well, in parallel (and they often tie into the deep learning series quite nicely). Well today, I am announcing the start of a BRAND NEW series on Bayesian machine learning. Bayesian methods require an entirely new way of thinking – a paradigm shift. But don’t worry, it’s not just all theory. In fact, the first course I’m releasing in the series is VERY practical – it’s on A/B testing. Every online advertiser, e-commerce store, marketing team, etc etc etc. does A/B testing. But did you know that traditional A/B testing is both horribly confusing and inefficient? Did you know that there are cool, new adaptive methods inspired by reinforcement learning that improve on those old crusty tests? (Those old methods, and the way they are traditionally taught, are probably the reason you cringe when you hear the word “statistics”) Well, Bayesian methods not only represent a state-of-the-art solution to many A/B testing challenges, they are also surprisingly theoretically simpler! You’ll end the course by doing your own simulation – comparing and contrasting the various adaptive A/B testing algorithms (including the final Bayesian method). This is VERY practical stuff and any digital media, newsfeed, or advertising startup will be EXTREMELY IMPRESSED if you know this stuff. This WILL advance your career, and any company would be lucky to have someone that knows this stuff on their team. Awesome coincidence #1: As I mentioned above, a lot of these techniques cross-over with reinforcement learning, so if you are itching for a preview of my upcoming deep reinforcement learning course, this will be very interesting for you. Awesome coincidence #2: Bayesian learning also crosses over with deep learning, one example being the variational autoencoder, which I may incorporate into a more advanced deep learning course in the future. They heavily rely on concepts from both Bayesian learning AND deep learning, and are very powerful state-of-the-art algorithms. Due to all the black Friday madness going on, I am going to do a ONE-TIME ONLY$10 special for this course. With my coupons, the price will remain at $10, even if Udemy’s site-wide sale price goes up (which it will). See you in class! As promised, here is the coupon: https://www.udemy.com/bayesian-machine-learning-in-python-ab-testing/?couponCode=EARLYBIRDSITE2 UPDATE: The Black Friday sale is over, but the early bird coupon is still up for grabs: https://www.udemy.com/bayesian-machine-learning-in-python-ab-testing/?couponCode=EARLYBIRDSITE LAST THING: Udemy is currently having an awesome Black Friday sale.$10 for ANY course starting Nov 15, but the price goes up by $1 every 2 days, so you need to ACT FAST. I was going to tell you earlier but I was hard at work on my course. =) Just click this link to get ANY course on Udemy for$10 (+\$1 every 2 days): http://bit.ly/2fY3y5M

#bayesian #data science #machine learning #statistics

# Announcing Data Science: Supervised Machine Learning in Python (Less Math, More Action!)

September 16, 2016

In recent years, we’ve seen a resurgence in AI, or artificial intelligence, and machine learning.

Machine learning has led to some amazing results, like being able to analyze medical images and predict diseases on-par with human experts.

Google’s AlphaGo program was able to beat a world champion in the strategy game go using deep reinforcement learning.

Machine learning is even being used to program self driving cars, which is going to change the automotive industry forever. Imagine a world with drastically reduced car accidents, simply by removing the element of human error.

Google famously announced that they are now “machine learning first”, meaning that machine learning is going to get a lot more attention now, and this is what’s going to drive innovation in the coming years.

Machine learning is used in many industries, like finance, online advertising, medicine, and robotics.

It is a widely applicable tool that will benefit you no matter what industry you’re in, and it will also open up a ton of career opportunities once you get good.

Machine learning also raises some philosophical questions. Are we building a machine that can think? What does it mean to be conscious? Will computers one day take over the world?

The best part about this course is that it requires WAY less math than my usual courses; just some basic probability and geometry, no calculus!

In this course, we are first going to discuss the K-Nearest Neighbor algorithm. It’s extremely simple and intuitive, and it’s a great first classification algorithm to learn. After we discuss the concepts and implement it in code, we’ll look at some ways in which KNN can fail.

It’s important to know both the advantages and disadvantages of each algorithm we look at.

Next we’ll look at the Naive Bayes Classifier and the General Bayes Classifier. This is a very interesting algorithm to look at because it is grounded in probability.

We’ll see how we can transform the Bayes Classifier into a linear and quadratic classifier to speed up our calculations.

Next we’ll look at the famous Decision Tree algorithm. This is the most complex of the algorithms we’ll study, and most courses you’ll look at won’t implement them. We will, since I believe implementation is good practice.

The last algorithm we’ll look at is the Perceptron algorithm. Perceptrons are the ancestor of neural networks and deep learning, so they are important to study in the context of machine learning.

One we’ve studied these algorithms, we’ll move to more practical machine learning topics. Hyperparameters, cross-validation, feature extraction, feature selection, and multiclass classification.

We’ll do a comparison with deep learning so you understand the pros and cons of each approach.

We’ll discuss the Sci-Kit Learn library, because even though implementing your own algorithms is fun and educational, you should use optimized and well-tested code in your actual work.

We’ll cap things off with a very practical, real-world example by writing a web service that runs a machine learning model and makes predictions. This is something that real companies do and make money from.

All the materials for this course are FREE. You can download and install Python, Numpy, and Scipy with simple commands on Windows, Linux, or Mac.

https://www.udemy.com/data-science-supervised-machine-learning-in-python/?couponCode=EARLYBIRDSITE

UPDATE: New coupon if the above is sold out:

https://www.udemy.com/data-science-supervised-machine-learning-in-python/?couponCode=SLOWBIRD_SITE

#data science #machine learning #matplotlib #numpy #pandas #python

# New course – Deep Learning part 5: Recurrent Neural Networks in Python

July 14, 2016

New course out today – Recurrent Neural Networks in Python: Deep Learning part 5.

If you already know what the course is about (recurrent units, GRU, LSTM), grab your 50% OFF coupon and go!:

https://www.udemy.com/deep-learning-recurrent-neural-networks-in-python/?couponCode=WEBSITE

Like the course I just released on Hidden Markov Models, Recurrent Neural Networks are all about learning sequences – but whereas Markov Models are limited by the Markov assumption, Recurrent Neural Networks are not – and as a result, they are more expressive, and more powerful than anything we’ve seen on tasks that we haven’t made progress on in decades.

Sequences appear everywhere – stock prices, language, credit scoring, and webpage visits.

Recurrent neural networks have a history of being very hard to train. It hasn’t been until recently that we’ve found ways around what is called the vanishing gradient problem, and since then, recurrent neural networks have become one of the most popular methods in deep learning.

If you took my course on Hidden Markov Models, we are going to go through a lot of the same examples in this class, except that our results are going to be a lot better.

Our classification accuracies will increase, and we’ll be able to create vectors of words, or word embeddings, that allow us to visualize how words are related on a graph.

We’ll see some pretty interesting results, like that our neural network seems to have learned that all religions and languages and numbers are related, and that cities and countries have hierarchical relationships.

If you’re interested in discovering how modern deep learning has propelled machine learning and data science to new heights, this course is for you.

I’ll see you in class.

https://www.udemy.com/deep-learning-recurrent-neural-networks-in-python/?couponCode=WEBSITE

#data science #deep learning #gru #lstm #machine learning #word vectors

# New Course: Unsupervised Machine Learning – Hidden Markov Models in Python

June 13, 2016

Hidden Markov Models are all about learning sequences.

A lot of the data that would be very useful for us to model is in sequences. Stock prices are sequences of prices. Language is a sequence of words. Credit scoring involves sequences of borrowing and repaying money, and we can use those sequences to predict whether or not you’re going to default. In short, sequences are everywhere, and being able to analyze them is an important skill in your data science toolbox.

The easiest way to appreciate the kind of information you get from a sequence is to consider what you are reading right now. If I had written the previous sentence backwards, it wouldn’t make much sense to you, even though it contained all the same words. So order is important.

While the current fad in deep learning is to use recurrent neural networks to model sequences, I want to first introduce you guys to a machine learning algorithm that has been around for several decades now – the Hidden Markov Model.

This course follows directly from my first course in Unsupervised Machine Learning for Cluster Analysis, where you learned how to measure the probability distribution of a random variable. In this course, you’ll learn to measure the probability distribution of a sequence of random variables.

You guys know how much I love deep learning, so there is a little twist in this course. We’ve already covered gradient descent and you know how central it is for solving deep learning problems. I claimed that gradient descent could be used to optimize any objective function. In this course I will show you how you can use gradient descent to solve for the optimal parameters of an HMM, as an alternative to the popular expectation-maximization algorithm.

We’re going to do it in Theano, which is a popular library for deep learning. This is also going to teach you how to work with sequences in Theano, which will be very useful when we cover recurrent neural networks and LSTMs.

This course is also going to go through the many practical applications of Markov models and hidden Markov models. We’re going to look at a model of sickness and health, and calculate how to predict how long you’ll stay sick, if you get sick. We’re going to talk about how Markov models can be used to analyze how people interact with your website, and fix problem areas like high bounce rate, which could be affecting your SEO. We’ll build language models that can be used to identify a writer and even generate text – imagine a machine doing your writing for you.

We’ll look at what is possibly the most recent and prolific application of Markov models – Google’s PageRank algorithm. And finally we’ll discuss even more practical applications of Markov models, including generating images, smartphone autosuggestions, and using HMMs to answer one of the most fundamental questions in biology – how is DNA, the code of life, translated into physical or behavioral attributes of an organism?

#data science #deep learning #hidden markov models #machine learning #recurrent neural networks #theano

# New Deep Learning course on Udemy

February 26, 2016

Shut up and gimme the link!: https://deeplearningcourses.com/c/data-science-deep-learning-in-theano-tensorflow/

This course continues where my first course, Deep Learning in Python, left off. You already know how to build an artificial neural network in Python, and you have a plug-and-play script that you can use for TensorFlow.

You learned about backpropagation (and because of that, this course contains basically NO MATH), but there were a lot of unanswered questions. How can you modify it to improve training speed? In this course you will learn about batch and stochastic gradient descent, two commonly used techniques that allow you to train on just a small sample of the data at each iteration, greatly speeding up training time.

You will also learn about momentum, which can be helpful for carrying you through local minima and prevent you from having to be too conservative with your learning rate. You will also learn aboutadaptive learning rate techniques like AdaGrad and RMSprop which can also help speed up your training.

In my last course, I just wanted to give you a little sneak peak at TensorFlow. In this course we are going to start from the basics so you understand exactly what’s going on – what are TensorFlow variables and expressions and how can you use these building blocks to create a neural network? We are also going to look at a library that’s been around much longer and is very popular for deep learning – Theano. With this library we will also examine the basic building blocks – variables, expressions, and functions – so that you can build neural networks in Theano with confidence.

Because one of the main advantages of TensorFlow and Theano is the ability to use the GPU to speed up training, I will show you how to set up a GPU-instance on AWS and compare the speed of CPU vs GPU for training a deep neural network.

With all this extra speed, we are going to look at a real dataset – the famous MNIST dataset (images of handwritten digits) and compare against various known benchmarks.

# Principal Components Analysis in Theano

February 21, 2016

This is a follow-up post to my original PCA tutorial. It is of interest to you if you:

• Are interested in deep learning (this tutorial uses gradient descent)
• Are interested in learning more about Theano (it is not like regular Python, and it is very popular for implementing deep learning algorithms)
• Want to know how you can write your own PCA solver (in the previous post we used a library to get eigenvalues and eigenvectors)
• Work with big data (this technique can be used to process data where the dimensionality is very large – where the covariance matrix wouldn’t even fit into memory)

First, you should be familiar with creating variables and functions in Theano. Here is a simple example of how you would do matrix multiplication:

import numpy as np
import theano
import theano.tensor as T

X = T.matrix('X')
Q = T.matrix('Q')
Z = T.dot(X, Q)

transform = theano.function(inputs=[X,Q], outputs=Z)

X_val = np.random.randn(100,10)
Q_val = np.random.randn(10,10)

Z_val = transform(X_val, Q_val)


I think of Theano variables as “containers” for real numbers. They actually represent nodes in a graph. You will see the term “graph” a lot when you read about Theano, and probably think to yourself – what does matrix multiplication or machine learning have to do with graphs? (not graphs as in visual graphs, graphs as in nodes and edges) You can think of any “equation” or “formula” as a graph. Just draw the variables and functions as nodes and then connect them to make the equation using lines/edges. It’s just like drawing a “system” in control systems or a visual representation of a neural network (which is also a graph).

If you have ever done linear programming or integer programming in PuLP you are probably familiar with the idea of “variable” objects and them passing them into a “solver” after creating some “expressions” that represent the constraints and objective of the linear / integer program.

Anyway, onto principal components analysis.

Let’s consider how you would find the leading eigenvalue and eigenvector (the one corresponding to the largest eigenvalue) of a square matrix.

The loss function / objective for PCA is:

$$J = \sum_{n=1}^{N} |x_n – \hat{x}_n|^2$$

Where $$\hat{X}$$ is the reconstruction of $$X$$. If there is only one eigenvector, let’s call this $$v$$, then this becomes:

$$J = \sum_{n=1}^{N} |x_n – x_nvv^T|^2$$

This is equivalent to the Frobenius norm, so we can write:

$$J = |X – Xvv^T|_F$$

One identity of the Frobenius norm is:

$$|A|_F = \sqrt{ \sum_{i} \sum_{j} a_{ij} } = \sqrt{ Tr(A^T A ) }$$

Which means we can rewrite the loss function as:

$$J = Tr( (X – Xvv^T)^T(X – Xvv^T) )$$

Keeping in mind that with the trace function you can re-order matrix multiplications that you wouldn’t normally be able to (matrix multiplication isn’t commutative), and dropping any terms that don’t depend on $$v$$, you can use matrix algebra to rearrange this to get:

$$v^* = argmin\{-Tr(X^TXvv^T) \}$$

Which again using reordering would be equivalent to maximizing:

$$v^* = argmax\{ v^TX^TXv \}$$

The corresponding eigenvalue would then be:

$$\lambda = v^TX^TXv$$

Now that we have a function to maximize, we can simply use gradient descent to do it, similar to how you would do it in logistic regression or in a deep belief network.

$$v \leftarrow v + \eta \nabla_v(v^TX^TXv)$$

Next, let’s extend this algorithm for finding the other eigenvalues and eigenvectors. You essentially subtract the contributions of the eigenvalues you already found.

$$v_i \leftarrow v_i + \eta \nabla_{v_i}(v_i^T( X^TX – \sum_{j=1}^{i-1} \lambda_j v_j v_j^T )v_i )$$

Next, note that to implement this algorithm you never need to actually calculate the covariance $$X^T X$$. If your dimensionality is, say, 1 million, then your covariance matrix will have 1 trillion entries!

Instead, you can multiply by your eigenvector first to get $$Xv$$, which is only of size $$N \times 1$$. You can then “dot” this with itself to get a scalar, which is only an $$O(N)$$ operation.

So how do you write this code in Theano? If you’ve never used Theano for gradient descent there will be some new concepts here.

First, you don’t actually need to know how to differentiate your cost function. You use Theano’s T.grad(cost_function, differentiation_variable).

v = theano.shared(init_v, name="v")
Xv = T.dot(X, v)
cost = T.dot(Xv.T, Xv) - np.sum(evals[j]*T.dot(evecs[j], v)*T.dot(evecs[j], v) for j in xrange(i))



Note that we re-normalize the eigenvector on each step, so that $$v^T v = 1$$.

Next, you define your “weight update rule” as an expression, and pass this into the “updates” argument of Theano’s function creator.

y = v + learning_rate*gv
update_expression = y / y.norm(2)

train = theano.function(
inputs=[X],
)


Note that the update variable must be a “shared variable”. With this knowledge in hand, you are ready to implement the gradient descent version of PCA in Theano:

for i in xrange(number of eigenvalues you want to find):
... initialize variables and expressions ...
... initialize theano train function ...
while t < max_iterations and change in v < tol:
outputs = train(data)

... return eigenvalues and eigenvectors ...


This is not really trivial but at the same time it’s a great exercise in both (a) linear algebra and (b) Theano coding.

If you are interested in learning more about PCA, dimensionality reduction, gradient descent, deep learning, or Theano, then check out my course “Unsupervised Deep Learning in Python” and let me know what you think in the comments.

#aws #data science #deep learning #gpu #machine learning #nvidia #pca #principal components analysis #statistics #theano

# Data Science: Natural Language Processing in Python

February 11, 2016

Do you want to learn natural language processing from the ground-up?

If you hate math and want to jump into purely practical coding examples, my new course is for you.

You can check it out here: https://deeplearningcourses.com/c/data-science-natural-language-processing-in-python/

I am posting the course summary here also for convenience:

In this course you will build MULTIPLE practical systems using natural language processing, or NLP. This course is not part of my deep learning series, so there are no mathematical prerequisites – just straight up coding in Python. All the materials for this course are FREE.

After a brief discussion about what NLP is and what it can do, we will begin building very useful stuff. The first thing we’ll build is a spam detector. You likely get very little spam these days, compared to say, the early 2000s, because of systems like these.

Next we’ll build a model for sentiment analysis in Python. This is something that allows us to assign a score to a block of text that tells us how positive or negative it is. People have used sentiment analysis on Twitter to predict the stock market.

We’ll go over some practical tools and techniques like the NLTK (natural language toolkit) library and latent semantic analysis or LSA.

Finally, we end the course by building an article spinner. This is a very hard problem and even the most popular products out there these days don’t get it right. These lectures are designed to just get you started and to give you ideas for how you might improve on them yourself. Once mastered, you can use it as an SEO, or search engine optimization tool. Internet marketers everywhere will love you if you can do this for them!

As a thank you for visiting this site, I’ve created a coupon that gets you 85% off! Just click here:

https://deeplearningcourses.com/c/data-science-natural-language-processing-in-python/

#article spinner #latent semantic analysis #latent semantic indexing #machine learning #natural language processing #nlp #pca #python #spam detection #svd

# A Tutorial on Autoencoders for Deep Learning

December 31, 2015

Despite its somewhat initially-sounding cryptic name, autoencoders are a fairly basic machine learning model (and the name is not cryptic at all when you know what it does).

Autoencoders belong to the neural network family, but they are also closely related to PCA (principal components analysis).

• It is an unsupervised learning algorithm (like PCA)
• It minimizes the same objective function as PCA
• It is a neural network
• The neural network’s target output is its input

The last point is key here. This is the architecture of an autoencoder:

So the dimensionality of the input is the same as the dimensionality of the output, and essentially what we want is x’ = x.

It can be shown that the objective function for PCA is:

$$J = \sum_{n=1}^{N} |x(n) – \hat{x}(n)|^2$$

Where the prediction $$\hat{x}(n) = Q^{-1}Qx(n)$$.

Q can be the full transformation matrix (which would result in getting exactly the old x back), or it can be a “rank k” matrix (i.e. keeping the k-most relevant eigenvectors), which would then result in only an approximation of x.

So the objective function can be written as:

$$J = \sum_{n=1}^{N} |x(n) – Q^{-1}Qx(n)|^2$$

Recall that to get the value at the hidden layer, we simply multiply the input->hidden weights by the input.

Like so:

$$z = f(Wx)$$

And to get the value at the output, we multiply the hidden->output weights by the hidden layer values, like so:

$$y = g(Vz)$$

The choice of $$f$$ and $$g$$ is up to us, we just have to know how to take the derivative for backpropagation.

We are of course free to make them “identity” functions, such that:

$$y = g(V f(Wx)) = VWx$$

This gives us the objective:

$$J = \sum_{n=1}^{N} |x(n) – VWx(n)|^2$$

Which is the same as PCA!

## If autoencoders are similar to PCA, why do we need autoencoders?

Autoencoders are much more flexible than PCA.

Recall that with neural networks we have an activation function – this can be a “ReLU” (aka. rectifier), “tanh” (hyperbolic tangent), or sigmoid.

This introduces nonlinearities in our encoding, whereas PCA can only represent linear transformations.

The network representation also means you can stack autoencoders to form a deep network.

## Cool theory bro, but what can autoencoders actually do for me?

Good question!

Similar to PCA – autoencoders can be used for finding a low-dimensional representation of your input data. Why is this useful?

Some of your features may be redundant or correlated, resulting in wasted processing time and overfitting in your model (too many parameters).

It is thus ideal to only include the features we need.

If your “reconstruction” of x is very accurate, that means your low-dimensional representation is good.

You can then use this transformation as input into another model.

## Training an autoencoder

Since autoencoders are really just neural networks where the target output is the input, you actually don’t need any new code.

Suppose we’re working with a sci-kit learn-like interface.

model.fit(X, Y)


You would just have:

model.fit(X, X)


Pretty simple, huh?

All the usual neural network training strategies work with autoencoders too:

• backpropagation
• regularization
• dropout

If you want to get good with autoencoders – I would recommend trying to take some data and an existing neural network package you’re comfortable with – and see what low-dimensional representation you can come up with. How many dimensions are there?

Autoencoders are part of a family of unsupervised deep learning methods, which I cover in-depth in my course, Unsupervised Deep Learning in Python. We discuss how to stack autoencoders to build deep belief networks, and compare them to RBMs which can be used for the same purpose. We derive all the equations and write all the code from scratch – no shortcuts. Ask me for a coupon so I can give you a discount!

P.S. “Autoencoders” means “encodes itself”. Not so cryptic now, right?

#autoencoders #deep learning #machine learning #pca #principal components analysis #unsupervised learning

# What the hell is a “z-score”?

December 4, 2015

Was having this conversation recently about what a z-score is.

What is a z-score? Why is it mentioned so often in statistics? You’ve probably heard of it in the context of the Student t-test.

If you are a machine learning guy more than a statistics guy, like me, you’ve probably heard you should “standardize” or “normalize” your variables before putting them into a machine learning model.

For example, if you’re doing linear regression and $$X_1$$ varies from $$0..1$$ and $$X_2$$ varies from $$0..1000$$, the weights $$\beta_1$$ and $$\beta_2$$ will give an inaccurate picture as to their importance.

Well, the z-score actually is the standardized variable.

I would avoid this terminology if possible, because in machine learning we usually think of a “score” as the output of a model, i.e. I’m getting the “score” of a set of links because I want to know in what order to rank them.

$$z = \frac{x – \mu}{\sigma}$$

So instead of thinking of “z-scores”, think of “I need to standardize my variables before using them in my machine learning model so that the effect of any one variable is on the same scale as the others”.

Do you find statistics terminology confusing? “Z score”? “F statistic”? Share your thoughts in the comments!

#frequentist statistics #machine learning #statistics #z score

# Tutorial: Principal Components Analysis (PCA)

November 20, 2015

I remember learning about principal components analysis for the very first time. I remember thinking it was very confusing, and that I didn’t know what it had to do with eigenvalues and eigenvectors (I’m not even sure I remembered what eigenvalues and eigenvectors were at the time).

I assure you that in hindsight, understanding PCA, despite its very scientific-sounding name, is not that difficult to understand at all. There are two “categories” I think it makes sense to split the discussion of PCA into.

1) What does it do? A written and visual description of how PCA is used.

2) The math behind PCA. A few equations showing that the things mentioned in (1) are true.

### What does PCA do?

Linear transformation

PCA finds a matrix $$Q$$ that, when multiplied by your original data matrix $$X$$, gives you a linearly transformed data matrix $$Z$$, where:

$$Z = XQ$$

In other words, for a single sample vector $$x$$, we can obtain its transformation $$z = Q^Tx$$.

The $$Q$$ and $$x$$ appear in a different order here because when we load a data matrix, it’s usually $$N \times D$$, so each sample vector is $$1 \times D$$, but when we talk about vectors by themselves we usually think of them as $$D \times 1$$.

The interesting thing about PCA is how it chooses $$Q$$.

Dimensionality reduction

When we think of machine learning models we often study them in the context of 1-3 dimensions. But if you take a dataset like MNIST, where each image is 28×28 pixels, then each input is 784 dimensions. 28×28 is a tiny image. The new iPhone 6 has a resolution of 1080 × 1920 pixels. Each image would thus have 2073600 $$\approx$$ 2 million dimensions.

PCA reduces dimensionality by moving as much “information” as possible into as few dimensions as possible. We measure information via unpredictability, i.e. variance. The end result is that our transformed matrix $$Z$$ has most of its variance in the first column, less variance in the second column, even less variance in the third column, etc.

You will see later how our choice of $$Q$$ accomplishes this.

De-correlation

$$D \approx 2000000$$ is a much bigger problem than the ones you learn about in the classroom. But often there are correlations in the data that make many of the dimensions redundant. This is related to the problem of linear regression – if we can accurately predict one dimension of the inputs, let’s say, $$x_3$$, in terms of another, $$x_5$$, then one of those dimensions is redundant.

The final result of removing these correlations is that every dimension of $$Z$$ is uncorrelated with the other.

The image above shows a scatterplot of the original data on the original x-y axis. The arrows on the plot itself show that the new v1-v2 axis is a) rotated (matrix multiplication rotates a vector) b) still perpendicular, and c) the data relative to the new axis is normally distributed and the distributions on each dimension are independent and uncorrelated.

Visualization

Once we’ve transformed $$X$$ into $$Z$$, noting that most of the “information” from $$X$$ is in the first few dimensions of $$Z$$ – let’s say… 90% of the variance is in the first 2 dimensions – we can do a 2-D scatterplot of Z to see the structure of our data. Humans have a hard time visualization anything greater than 2 dimensions.

We can consider the first 2 dimensions to represent the real underlying trend in the data, and the other dimensions just small perturbations or noise.

Pre-processing

You may have heard that too many parameters in your model can lead to overfitting. So if you are planning to use a logistic regression model, and the dimensionality of each input is 2 million, then the number of weights in your model will also be 2 million. This is not good news if you have only a few thousand samples. One rule of thumb is we would like to have 10x the number of samples compared to the number of parameters in our model.

So instead of going out and finding 20 million samples, we can use PCA to reduce the dimensionality of our data to say, 20, and then we only need 200 samples for our model.

You can also use PCA to pre-process data before using an unsupervised learning algorithm, like k-means clustering. PCA, by the way, is also an unsupervised algorithm.

### The math behind PCA

Everybody’s favorite part!

Ok, so this requires both (a) some statistics knowledge (knowing how to find the covariance matrix), and (b) some linear algebra knowledge (knowing what eigenvalues and eigenvectors are, and basic matrix manipulation).

Covariance

I stated above that in the rotated v1-v2 coordinate system, the data on the v1 axis was not correlated with the data on the v2 axis.

Intuitively, when you have a 2-D Gaussian distribution, the data looks like an ellipse. If that ellipse is perpendicular to the x-y grid, then the x and y components are independent and uncorrelated.

The image below shows what happens with different values of the correlation between $$X_i$$ and $$X_j$$ (called ‘c’ in the image):

(If you really know your statistics, then you will recall that independence implies 0 correlation, but 0 correlation does not imply independence, unless the distribution is a joint Gaussian.)

So, in the first image, since the ellipse is not perpendicular to the x-y axis, the distributions p(x) and p(y) are not independent. But in the rotated v1-v2 axis, the distributions p(v1) and p(v2) are independent and uncorrelated.

If this all sounds crazy, just remember this one fact: if 2 variables $$X_i$$ and $$X_j$$ are uncorrelated, their corresponding element in the covariance matrix, $$\Sigma_{ij} = 0$$. And since the covariance matrix is symmetric (for our purposes), $$\Sigma_{ji} = 0$$ also.

For every variable to be uncorrelated to every other variable, all the elements of $$\Sigma$$ will be 0, except those along the diagonal, $$\Sigma_{ii} = \sigma_i^2$$, which is just the regular variance for each dimension.

How do we calculate the covariance matrix from the data? Simple!

$$\Sigma_X = \frac{1}{N} (X – \mu_X)^T(X – \mu_X) \tag{1}$$

In non-matrix form this is:

$$\Sigma_{ij} = \frac{1}{N} \sum_{n=1}^{N} (x(n)_i – \mu_i)(x(n)_j – \mu_j)$$

I’ve labeled the equation because we’re going to return to it. You can verify that $$\Sigma$$ is $$D \times D$$ since those are the outer dimensions of the matrix multiply.

Note the mathematical sleight of hand I used above. $$X$$ is $$N \times D$$ and the mean $$\mu_X$$ is $$1 \times D$$, so technically you can’t subtract them under the rules of matrix algebra, but let’s assume that the above notation means each row of $$X$$ has $$\mu_X$$ subtracted from it.

Eigenvalues and Eigenvectors

This is where the real magic of PCA happens.

For no particular reason at all, suppose we compute the eigenvalues and eigenvectors of $$\Sigma_X$$.

If you don’t know what eigenvalues/eigenvectors are: remember that usually, when we multiply a vector by a matrix, it changes the direction of the vector.

So for example, the vector $$(1,2)$$ is pointing in the same direction as $$(2,4) = 2(1,2)$$. But if we were to multiply $$(1,2)$$ by a matrix:

$$\begin{pmatrix} 5 \\ 11 \end{pmatrix} = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \begin{pmatrix} 1 \\ 2 \end{pmatrix}$$

The result is a vector in a different direction. Eigenvalues $$\lambda$$ and eigenvectors $$v$$ have the property that, if you multiply $$\Sigma_Xv$$, this is the same by multiplying the eigenvector by a constant – the eigenvalue, i.e. the eigenvector does not change direction, it only gets shorter or longer by a factor of $$\lambda$$. In other words:

$$\Sigma_Xv = \lambda v$$

How many pairs of these eigenvectors and eigenvalues can we find? It just so happens that there are $$D$$ such pairs. We’ll say, each eigenvalue $$\lambda_i$$ has a corresponding eigenvector $$v_i$$.

Now again, for no particular reason at all, we are going to combine all the $$v_i$$ into a matrix, let’s call it $$V$$.

We could line up the $$v_i$$ in any random order, but instead we are going to choose a very particular order – we are going to line them up such that the corresponding eigenvalues $$\lambda_i$$ are in descending order. In other words:

$$\lambda_1 > \lambda_2 > \lambda_3 > … > \lambda_D$$

So:

$$V = \begin{pmatrix} | & | & | & … & | \\ v_1 & v_2 & v_3 & … & v_D \\ | & | & | & … & | \end{pmatrix}$$

We are also going to combine the eigenvalues into a diagonal matrix like this:

$$\Lambda = \begin{pmatrix} \lambda_1 & 0 & … & 0 \\ 0 & \lambda_2 & … & 0 \\ … & … & … & … \\ 0 & 0 & … & \lambda_D \end{pmatrix}$$

As a sidenote, don’t worry about how we find the eigenvalues and eigenvectors. That is a huge problem in and of itself. The method you used in high school – solving a polynomial to get the eigenvalues, plugging the eigenvalues into the eigenvalue equation to get the eigenvectors, etc. doesn’t really translate to computer code. There exist some very involved algorithms to solve this problem, but we’ll just use numpy.

In any case, it is easy to prove to yourself, with a simple 2-D example, that:

$$\Sigma_X V = V \Lambda \tag{2}$$

Just one more ingredient. We want to show that $$V$$ is orthonormal. In terms of the individual eigenvectors this means that $${v_i}^T v_j = 0$$ if $$i \ne j$$, and $${v_i}^T v_i = 1$$. In matrix form this means $$V^T V = I$$, or in other words, $$V^T = V^{-1}$$.

Proof of orthogonality: Consider $$\lambda_j {v_i}^T v_j$$ for $$i \ne j$$. $$\lambda_j {v_i}^T v_j = {v_i}^T (\lambda_j v_j) = {v_i}^T \Sigma_X v_j = (\Sigma_X^T v_i)^T v_j$$. But we know that $$\Sigma_X$$ is symmetric so this is $$(\Sigma_X v_i)^T v_j = (\lambda_i v_i)^T v_j = \lambda_i {v_i}^T v_j$$. Since $$\lambda_i \ne \lambda_j$$, this can only be true if $${v_i}^T v_j = 0$$.

Normalizing eigenvectors is easy since they are not unique – just choose values so that their length is 1.

Finally, using $$(1)$$, consider the covariance of the transformed data, $$Z$$.

\begin{align*} \Sigma_Z &= \frac{1}{N} (Z – \mu_Z)^T (Z – \mu_Z) \\ &= \frac{1}{N} (XQ – \mu_XQ)^T (XQ – \mu_XQ) \\ &= \frac{1}{N} Q^T(X – \mu_X)^T (X – \mu_X)Q \\ &= Q^T \Sigma_X Q \end{align*}

Now, suppose we choose $$Q = V$$. Then, by plugging in $$(2)$$, we get:

\begin{align*} \Sigma_Z &= V^T \Sigma_X V \\ &= V^T V \Lambda \\ &= \Lambda \end{align*}

So what does this tell us? Since $$\Lambda$$ is a diagonal matrix, there are no correlations in the transformed data. The variance of each dimension of $$Z$$ is equal to the eigenvalues. In addition, because we sorted the eigenvalues in descending order, the first dimension of $$Z$$ has the most variance, the second dimension has the second-most variance, etc. So most of the information is kept in the leading columns, as promised.

You can verify this in Python:

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA()
X = np.random.random((100,10)) # generate an N = 100, D = 10 random data matrix
Z = pca.fit_transform(X)

# visualize the covariance of Z
plt.imshow(np.cov(Z.T))
plt.show()


Observe that the off-diagonals are 0 and that the variances are in decreasing order.

You can also confirm that $$Q^T Q = I$$.

QTQ = pca.components_.T.dot(pca.components_)
plt.imshow(QTQ)
plt.show()

print np.around(QTQ, decimals=2)


Remember to leave a comment below if you have any questions on the above material. If you want to go more in depth on this and other data science topics (both with the math and the code), check out some of my data science video courses online.

#anomaly detection #covariance #dimensionality reduction #eigenvalues #eigenvectors #linear algebra #machine learning #pca #principal components analysis #statistics