Covariance Matrix: Divide by N or N-1?

This is a statistics post. It’s probably very boring. I am posting it for my own reference, because I seem to forget how this is derived every time I need it.

Sometimes you see the sample variance defined as:

$$ \hat{\sigma}^2 = \frac{1}{N} \sum_{n=1}^{N} (X_n – \mu)^2 $$

But you might also see it defined as:

$$ \hat{\sigma}^2 = \frac{1}{N-1} \sum_{n=1}^{N} (X_n – \hat{\mu})^2 $$

Where as usual the “hat” symbol means that is our prediction.

Why do statisticians sometimes divide by N, and sometimes divide by N-1?

The same question arises for the calculation of the sample covariance matrix, and this is what we will work with in this post.

covariance

This has to do with whether you want your estimate to be a biased estimate or an unbiased estimate.

For any parameter \( \theta \), our estimate \( \hat{ \theta } \) is unbiased if:

$$ E\{ \hat{ \theta } – \theta \} = 0 $$

In this tutorial we will calculate the bias of the sample covariance on the multivariate Gaussian, which is defined as:

$$ p(x | \mu, \sigma) = \frac{1}{\sqrt{(2\pi)^D |\Sigma|}} exp( -\frac{1}{2} (x – \mu)^T \Sigma^{-1} (x – \mu) ) $$

The maximum likelihood estimates of \( \mu \) and \( \Sigma \) can be found by taking the derivative of the log-likelihood and setting it to 0.

The likelihood is:

$$ p(X | \mu, \Sigma) = \prod_{n=1}^{N} p(x_n | \mu, \Sigma) $$

So the log-likelihood (expanded) is:

$$ L(X | \mu, \Sigma) = -\frac{ND}{2} log(2\pi)  -\frac{N}{2} log(|\Sigma|)  -\sum_{n=1}^{N} \frac{1}{2}(x_n – \mu)^T \Sigma^{-1} (x_n – \mu) $$

To take “vector derivatives” and “matrix derivatives” you’ll want to consult the Matrix Cookbook.

If you made it this far it is almost trivial to calculate \( \frac{ \partial L }{ \partial \mu } = 0 \) to get the usual result:

$$ \hat{ \mu } = \frac{1}{N} \sum_{n=1}^{N} x_n $$

To get the sample covariance, we calculate:

$$ \frac{ \partial L }{ \partial \Sigma } = -\frac{N}{2} (\Sigma^{-1})^T – \frac{1}{2} \sum_{n=1}^{N} – \Sigma^{-T} (x_n – \mu) (x_n – \mu)^T \Sigma^{-T} $$

Set that to 0 and solve for \( \Sigma \) to get:

$$ \hat{ \Sigma } = \frac{1}{N} \sum_{n=1}^{N} (x_n – \hat{\mu}) (x_n – \hat{\mu})^T $$

Note that we are assuming we don’t have the “true mean”, so we are estimating the mean using the maximum likelihood estimate before calculating the maximum likelihood estimate for the covariance.

Now we will show that \( E\{ \hat{ \Sigma } \}  \neq \Sigma \). By definition:

$$ E\{ \hat{ \Sigma } \} = \frac{1}{N} E\{ \sum_{n} (x_n – \hat{\mu})(x_n – \hat{\mu})^T \} $$

Expand:

$$ E\{ \hat{ \Sigma } \} = \frac{1}{N} E\{ \sum_{n} ((x_n – \mu) – (\hat{\mu} – \mu))((x_n – \mu) – (\hat{\mu} – \mu)))^T \} $$

Expand that:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \frac{1}{N} E\{ \sum_{n} (\hat{\mu} – \mu)) (\hat{\mu} – \mu))^T – (x_n – \mu)(\hat{\mu} – \mu))^T – (\hat{\mu} – \mu))(x_n – \mu)^T \} $$

Multiply out the terms:

$$ E\{ \hat{ \Sigma } \} = \Sigma  + E\{ \hat{\mu}\hat{\mu}^T \} – \frac{1}{N} \sum_{n} E\{ x_n\hat{\mu}^T \} – \frac{1}{N} \sum_{n} E\{ \hat{\mu} x_n^T \} + \mu\mu^T $$

We can combine the expected values to get:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \mu\mu^T – E\{ \hat{\mu}\hat{\mu}^T \} $$

Now the exercise becomes finding the expected value on the right side.

We need some identities.

First, for \( m \neq n \):

$$ E\{ x_m x_n^T \} = E\{ x_m \} E\{ x_n^T \} =  \mu\mu^T $$

Because each sample is IID: independent and identically distributed.

Next, the definition of covariance:

$$ \Sigma = E\{ (x – \mu)(x – \mu)^T \} = E\{ xx^T \} – \mu\mu^T $$

We can rearrange this to get:

$$ E\{ xx^T \} = \Sigma + \mu\mu^T $$

The term \( E\{\hat{\mu}\hat{\mu}^T \} \) can be expanded as:

$$ E\{\hat{\mu}\hat{\mu}^T \} = \frac{1}{N^2} E\{ (x_1 + x_2 + … + x_N)(x_1 + x_2 + … + x_N)^T \} $$

When expanding the multiplication, there are \( N \) terms that are the same, so that would be a \( N E\{ x_n x_n^T \} \) contribution. There are \( N(N-1) \) terms that are different, and since different terms are independent that is a \( N(N-1)\mu\mu^T \) contribution.

So in total:

$$ E\{\hat{\mu}\hat{\mu}^T \} =\frac{1}{N^2}(N(\Sigma + \mu\mu^T) + N(N-1)\mu\mu^T) $$

Or:

$$ E\{\hat{\mu}\hat{\mu}^T \} = \frac{1}{N}\Sigma + \mu\mu^T $$

Plugging this back into the expression for the bias:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \mu\mu^T – \frac{1}{N}\Sigma – \mu\mu^T $$

Or:

$$ E\{ \hat{ \Sigma } \} = \frac{N-1}{N} \Sigma \neq \Sigma $$

So, if you want the unbiased estimator, you can multiply the biased maximum likelihood estimator by \( \frac{N}{N-1} \), which gives the expected unbiased formula.