Lazy Programmer

Your source for the latest in deep learning, big data, data science, and artificial intelligence. Sign up now

Covariance Matrix: Divide by N or N-1?

February 14, 2016

This is a statistics post. It’s probably very boring. I am posting it for my own reference, because I seem to forget how this is derived every time I need it.

Sometimes you see the sample variance defined as:

$$ \hat{\sigma}^2 = \frac{1}{N} \sum_{n=1}^{N} (X_n – \mu)^2 $$

But you might also see it defined as:

$$ \hat{\sigma}^2 = \frac{1}{N-1} \sum_{n=1}^{N} (X_n – \hat{\mu})^2 $$

Where as usual the “hat” symbol means that is our prediction.

Why do statisticians sometimes divide by N, and sometimes divide by N-1?

The same question arises for the calculation of the sample covariance matrix, and this is what we will work with in this post.

covariance

This has to do with whether you want your estimate to be a biased estimate or an unbiased estimate.

For any parameter \( \theta \), our estimate \( \hat{ \theta } \) is unbiased if:

$$ E\{ \hat{ \theta } \} = 0 $$

In this tutorial we will calculate the bias of the sample covariance on the multivariate Gaussian, which is defined as:

$$ p(x | \mu, \sigma) = \frac{1}{\sqrt{(2\pi)^D |\Sigma|}} exp( -\frac{1}{2} (x – \mu)^T \Sigma^{-1} (x – \mu) ) $$

The maximum likelihood estimates of \( \mu \) and \( \Sigma \) can be found by taking the derivative of the log-likelihood and setting it to 0.

The likelihood is:

$$ p(X | \mu, \Sigma) = \prod_{n=1}^{N} p(x_n | \mu, \Sigma) $$

So the log-likelihood (expanded) is:

$$ L(X | \mu, \Sigma) = -\frac{ND}{2} log(2\pi)  -\frac{N}{2} log(|\Sigma|)  -\sum_{n=1}^{N} \frac{1}{2}(x_n – \mu)^T \Sigma^{-1} (x_n – \mu) $$

To take “vector derivatives” and “matrix derivatives” you’ll want to consult the Matrix Cookbook.

If you made it this far it is almost trivial to calculate \( \frac{ \partial L }{ \partial \mu } = 0 \) to get the usual result:

$$ \hat{ \mu } = \frac{1}{N} \sum_{n=1}^{N} x_n $$

To get the sample covariance, we calculate:

$$ \frac{ \partial L }{ \partial \Sigma } = -\frac{N}{2} (\Sigma^{-1})^T – \frac{1}{2} \sum_{n=1}^{N} – \Sigma^{-T} (x_n – \mu) (x_n – \mu)^T \Sigma^{-T} $$

Set that t0 0 and solve for \( \Sigma \) to get:

$$ \hat{ \Sigma } = \frac{1}{N} \sum_{n=1}^{N} (x_n – \hat{\mu}) (x_n – \hat{\mu})^T $$

Note that we are assuming we don’t have the “true mean”, so we are estimating the mean using the maximum likelihood estimate before calculating the maximum likelihood estimate for the covariance.

Now we will show that \( E\{ \hat{ \Sigma } \}  \neq 0 \). By definition:

$$ E\{ \hat{ \Sigma } \} = \frac{1}{N} E\{ \sum_{n} (x_n – \hat{\mu})(x_n – \hat{\mu})^T \} $$

Expand:

$$ E\{ \hat{ \Sigma } \} = \frac{1}{N} E\{ \sum_{n} ((x_n – \mu) – (\hat{\mu} – \mu))((x_n – \mu) – (\hat{\mu} – \mu)))^T \} $$

Expand that:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \frac{1}{N} E\{ \sum_{n} (\hat{\mu} – \mu)) (\hat{\mu} – \mu))^T – (x_n – \mu)(\hat{\mu} – \mu))^T – (\hat{\mu} – \mu))(x_n – \mu)^T \} $$

Multiply out the terms:

$$ E\{ \hat{ \Sigma } \} = \Sigma  + E\{ \hat{\mu}\hat{\mu}^T \} – \frac{1}{N} \sum_{n} E\{ x_n\hat{\mu}^T \} – \frac{1}{N} \sum_{n} E\{ \hat{\mu} x_n^T \} + \mu\mu^T $$

We can combine the expected values to get:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \mu\mu^T – E\{ \hat{\mu}\hat{\mu}^T \} $$

Now the exercise becomes finding the expected value.

We need some identities.

First, for \( m \neq n \):

$$ E\{ x_m x_n^T \} = E\{ x_m \} E\{ x_n^T \} =  \mu\mu^T $$

Because each sample is IID: independent and identically distributed.

Next, the definition of covariance:

$$ \Sigma = E\{ (x – \mu)(x – \mu)^T \} = E\{ xx^T \} – \mu\mu^T $$

We can rearrange this to get:

$$ E\{ xx^T \} = \Sigma + \mu\mu^T $$

The term \( E\{\hat{\mu}\hat{\mu}^T \} \) can be expanded as:

$$ E\{\hat{\mu}\hat{\mu}^T \} = \frac{1}{N^2} E\{ (x_1 + x_2 + … + x_N)(x_1 + x_2 + … + x_N)^T \} $$

When expanding the multiplication, there are \( N \) terms that are the same, so that would be a \( N E\{ x_n x_n^T \} \) contribution. There are \( N(N-1) \) terms that are different, and since different terms are independent that is a \( N(N-1)\mu\mu^T \) contribution.

So in total:

$$ E\{\hat{\mu}\hat{\mu}^T \} =\frac{1}{N^2}(N(\Sigma + \mu\mu^T) + N(N-1)\mu\mu^T) $$

Or:

$$ E\{\hat{\mu}\hat{\mu}^T \} = \frac{1}{N}\Sigma + \mu\mu^T $$

Plugging this back into the expression for the bias:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \mu\mu^T – \frac{1}{N}\Sigma – \mu\mu^T $$

Or:

$$ E\{ \hat{ \Sigma } \} = \frac{N-1}{N} \Sigma \neq \Sigma $$

So, if you want the unbiased estimator, you can multiply the biased maximum likelihood estimator by \( \frac{N}{N-1} \), which gives the expected unbiased formula.

 

#covariance #maximum likelihood #MLE #multivariate Gaussian #multivariate normal #sample variance #statistics #unbiased estimator

Go to comments


Tutorial: Principal Components Analysis (PCA)

November 20, 2015

I remember learning about principal components analysis for the very first time. I remember thinking it was very confusing, and that I didn’t know what it had to do with eigenvalues and eigenvectors (I’m not even sure I remembered what eigenvalues and eigenvectors were at the time).

I assure you that in hindsight, understanding PCA, despite its very scientific-sounding name, is not that difficult to understand at all. There are two “categories” I think it makes sense to split the discussion of PCA into.

1) What does it do? A written and visual description of how PCA is used.

2) The math behind PCA. A few equations showing that the things mentioned in (1) are true.

What does PCA do?

Linear transformation

PCA finds a matrix \( Q \) that, when multiplied by your original data matrix \( X \), gives you a linearly transformed data matrix \( Z \), where:

$$ Z = XQ $$

In other words, for a single sample vector \( x \), we can obtain its transformation \( z = Q^Tx \).

The \( Q \) and \( x \) appear in a different order here because when we load a data matrix, it’s usually \( N \times D \), so each sample vector is \( 1 \times D \), but when we talk about vectors by themselves we usually think of them as \( D \times 1 \).

The interesting thing about PCA is how it chooses \( Q \).

Dimensionality reduction

When we think of machine learning models we often study them in the context of 1-3 dimensions. But if you take a dataset like MNIST, where each image is 28×28 pixels, then each input is 784 dimensions. 28×28 is a tiny image. The new iPhone 6 has a resolution of 1080 × 1920 pixels. Each image would thus have 2073600 \( \approx \) 2 million dimensions.

PCA reduces dimensionality by moving as much “information” as possible into as few dimensions as possible. We measure information via unpredictability, i.e. variance. The end result is that our transformed matrix \( Z \) has most of its variance in the first column, less variance in the second column, even less variance in the third column, etc.

You will see later how our choice of \( Q \) accomplishes this.

De-correlation

\( D \approx 2000000 \) is a much bigger problem than the ones you learn about in the classroom. But often there are correlations in the data that make many of the dimensions redundant. This is related to the problem of linear regression – if we can accurately predict one dimension of the inputs, let’s say, \( x_3 \), in terms of another, \( x_5 \), then one of those dimensions is redundant.

The final result of removing these correlations is that every dimension of \( Z \) is uncorrelated with the other.

PCA

The image above shows a scatterplot of the original data on the original x-y axis. The arrows on the plot itself show that the new v1-v2 axis is a) rotated (matrix multiplication rotates a vector) b) still perpendicular, and c) the data relative to the new axis is normally distributed and the distributions on each dimension are independent and uncorrelated.

Visualization

Once we’ve transformed \( X \) into \( Z \), noting that most of the “information” from \( X \) is in the first few dimensions of \( Z \) – let’s say… 90% of the variance is in the first 2 dimensions – we can do a 2-D scatterplot of Z to see the structure of our data. Humans have a hard time visualization anything greater than 2 dimensions.

We can consider the first 2 dimensions to represent the real underlying trend in the data, and the other dimensions just small perturbations or noise.

Pre-processing

You may have heard that too many parameters in your model can lead to overfitting. So if you are planning to use a logistic regression model, and the dimensionality of each input is 2 million, then the number of weights in your model will also be 2 million. This is not good news if you have only a few thousand samples. One rule of thumb is we would like to have 10x the number of samples compared to the number of parameters in our model.

So instead of going out and finding 20 million samples, we can use PCA to reduce the dimensionality of our data to say, 20, and then we only need 200 samples for our model.

You can also use PCA to pre-process data before using an unsupervised learning algorithm, like k-means clustering. PCA, by the way, is also an unsupervised algorithm.

The math behind PCA

Everybody’s favorite part!

Ok, so this requires both (a) some statistics knowledge (knowing how to find the covariance matrix), and (b) some linear algebra knowledge (knowing what eigenvalues and eigenvectors are, and basic matrix manipulation).

Covariance

I stated above that in the rotated v1-v2 coordinate system, the data on the v1 axis was not correlated with the data on the v2 axis.

Intuitively, when you have a 2-D Gaussian distribution, the data looks like an ellipse. If that ellipse is perpendicular to the x-y grid, then the x and y components are independent and uncorrelated.

The image below shows what happens with different values of the correlation between \( X_i \) and \( X_j \) (called ‘c’ in the image):

ellipses

(If you really know your statistics, then you will recall that independence implies 0 correlation, but 0 correlation does not imply independence, unless the distribution is a joint Gaussian.)

So, in the first image, since the ellipse is not perpendicular to the x-y axis, the distributions p(x) and p(y) are not independent. But in the rotated v1-v2 axis, the distributions p(v1) and p(v2) are independent and uncorrelated.

If this all sounds crazy, just remember this one fact: if 2 variables \( X_i \) and \( X_j \) are uncorrelated, their corresponding element in the covariance matrix, \( \Sigma_{ij} = 0 \). And since the covariance matrix is symmetric (for our purposes), \( \Sigma_{ji} = 0 \) also.

For every variable to be uncorrelated to every other variable, all the elements of \( \Sigma \) will be 0, except those along the diagonal, \( \Sigma_{ii} = \sigma_i^2 \), which is just the regular variance for each dimension.

How do we calculate the covariance matrix from the data? Simple!

$$ \Sigma_X = \frac{1}{N} (X – \mu_X)^T(X – \mu_X) \tag{1}$$

In non-matrix form this is:

$$ \Sigma_{ij} = \frac{1}{N} \sum_{n=1}^{N} (x(n)_i – \mu_i)(x(n)_j – \mu_j) $$

I’ve labeled the equation because we’re going to return to it. You can verify that \( \Sigma \) is \( D \times D \) since those are the outer dimensions of the matrix multiply.

Note the mathematical sleight of hand I used above. \( X \) is \( N \times D \) and the mean \( \mu_X \) is \( 1 \times D \), so technically you can’t subtract them under the rules of matrix algebra, but let’s assume that the above notation means each row of \( X \) has \( \mu_X \) subtracted from it.

Eigenvalues and Eigenvectors

This is where the real magic of PCA happens.

For no particular reason at all, suppose we compute the eigenvalues and eigenvectors of \( \Sigma_X \).

If you don’t know what eigenvalues/eigenvectors are: remember that usually, when we multiply a vector by a matrix, it changes the direction of the vector.

So for example, the vector \( (1,2) \) is pointing in the same direction as \( (2,4) = 2(1,2) \). But if we were to multiply \( (1,2) \) by a matrix:

$$ \begin{pmatrix} 5 \\ 11 \end{pmatrix} = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \begin{pmatrix} 1 \\ 2 \end{pmatrix} $$

The result is a vector in a different direction. Eigenvalues \( \lambda \) and eigenvectors \( v \) have the property that, if you multiply \( \Sigma_Xv \), this is the same by multiplying the eigenvector by a constant – the eigenvalue, i.e. the eigenvector does not change direction, it only gets shorter or longer by a factor of \( \lambda \). In other words:

$$ \Sigma_Xv = \lambda v $$

How many pairs of these eigenvectors and eigenvalues can we find? It just so happens that there are \( D \) such pairs. We’ll say, each eigenvalue \( \lambda_i \) has a corresponding eigenvector \( v_i \).

Now again, for no particular reason at all, we are going to combine all the \( v_i \) into a matrix, let’s call it \( V \).

We could line up the \( v_i \) in any random order, but instead we are going to choose a very particular order – we are going to line them up such that the corresponding eigenvalues \( \lambda_i \) are in descending order. In other words:

$$ \lambda_1 > \lambda_2 > \lambda_3 > … > \lambda_D $$

So:

$$ V = \begin{pmatrix}
| & | & | & … & | \\
v_1 & v_2 & v_3 & … & v_D \\
| & | & | & … & |
\end{pmatrix}
$$

We are also going to combine the eigenvalues into a diagonal matrix like this:

$$ \Lambda = \begin{pmatrix}
\lambda_1 & 0 & … & 0 \\
0 & \lambda_2 & … & 0 \\
… & … & … & … \\
0 & 0 & … & \lambda_D
\end{pmatrix}
$$

As a sidenote, don’t worry about how we find the eigenvalues and eigenvectors. That is a huge problem in and of itself. The method you used in high school – solving a polynomial to get the eigenvalues, plugging the eigenvalues into the eigenvalue equation to get the eigenvectors, etc. doesn’t really translate to computer code. There exist some very involved algorithms to solve this problem, but we’ll just use numpy.

In any case, it is easy to prove to yourself, with a simple 2-D example, that:

$$ \Sigma_X V = V \Lambda \tag{2}$$

Just one more ingredient. We want to show that \( V \) is orthonormal. In terms of the individual eigenvectors this means that \( {v_i}^T v_j = 0 \) if \( i \ne j \), and \( {v_i}^T v_i = 1 \). In matrix form this means \( V^T V = I \), or in other words, \( V^T = V^{-1} \).

Proof of orthogonality: Consider \( \lambda_j {v_i}^T v_j \) for \( i \ne j \). \( \lambda_j {v_i}^T v_j = {v_i}^T (\lambda_j v_j) = {v_i}^T \Sigma_X v_j = (\Sigma_X^T v_i)^T v_j \). But we know that \( \Sigma_X \) is symmetric so this is \( (\Sigma_X v_i)^T v_j = (\lambda_i v_i)^T v_j = \lambda_i {v_i}^T v_j \). Since \( \lambda_i \ne \lambda_j \), this can only be true if \( {v_i}^T v_j = 0 \).

Normalizing eigenvectors is easy since they are not unique – just choose values so that their length is 1.

Finally, using \( (1) \), consider the covariance of the transformed data, \( Z \).

$$ \begin{align*}
\Sigma_Z &= \frac{1}{N} (Z – \mu_Z)^T (Z – \mu_Z) \\
&= \frac{1}{N} (XQ – \mu_XQ)^T (XQ – \mu_XQ) \\
&= \frac{1}{N} Q^T(X – \mu_X)^T (X – \mu_X)Q \\
&= Q^T \Sigma_X Q
\end{align*}
$$

Now, suppose we choose \( Q = V \). Then, by plugging in \( (2) \), we get:

$$ \begin{align*}
\Sigma_Z &= V^T \Sigma_X V \\
&= V^T V \Lambda \\
&= \Lambda
\end{align*}
$$

So what does this tell us? Since \( \Lambda \) is a diagonal matrix, there are no correlations in the transformed data. The variance of each dimension of \( Z \) is equal to the eigenvalues. In addition, because we sorted the eigenvalues in descending order, the first dimension of \( Z \) has the most variance, the second dimension has the second-most variance, etc. So most of the information is kept in the leading columns, as promised.

You can verify this in Python:

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA()
X = np.random.random((100,10)) # generate an N = 100, D = 10 random data matrix
Z = pca.fit_transform(X)

# visualize the covariance of Z
plt.imshow(np.cov(Z.T))
plt.show()

covariance of output of PCA

Observe that the off-diagonals are 0 and that the variances are in decreasing order.

You can also confirm that \( Q^T Q = I \).

QTQ = pca.components_.T.dot(pca.components_)
plt.imshow(QTQ)
plt.show()

print np.around(QTQ, decimals=2)

qtq

Remember to leave a comment below if you have any questions on the above material. If you want to go more in depth on this and other data science topics (both with the math and the code), check out some of my data science video courses online.

#anomaly detection #covariance #dimensionality reduction #eigenvalues #eigenvectors #linear algebra #machine learning #pca #principal components analysis #statistics

Go to comments