July 3, 2013

Why variance estimator is biased when uses n as denominator.

Given  \( X_1, .. , X_n \) a set of independent and identically distributed random variables with fixed, existent and unknown mean \( \mu \) and variance \(\sigma^2\), one usual problem which needs to be solved is to estimate population variance \(\sigma^2\). One well known estimators for that, which we will denote as \(S_n^2\) is defined by the following formula.
$$ S_n^2 = E\left[\frac{1}{n}\sum_{i=1}^{n}(X_i-\bar{X_n})^2\right] $$
where \(\bar{X_n} = \frac{1}{n}\sum_{i=1}^{n}X_i \) is the population mean estimator.

It is a well know fact that the estimator for population variance \(S_n^2\) is biased when it uses \(n\) as denominator. By biased, we mean that it's expected value does not equals with the theoretical value of population variance \(\sigma^2\). This fact is stated in almost all statistics books and this is the reason why this estimator is not considered THE population variance estimator. Even so, I saw no complete/clear proof of this fact. On wikipedia site there are two proofs but none convinced me enough to accept that and this problem bothered me for quite a few weeks.

Finally I arrived at a simpler (in my humble opinion) proof of that fact: this population variance estimator is biased.

To proceed with the proof we have to assert that for a given random variable \(X\), the expected value of \(X^2\) equals squared mean plus variation.
$$ EX^2 = E(X-\mu+\mu)^2 = E(X-\mu)^2 + 2\mu E(X-\mu) + E\mu^2 = \sigma^2+\mu^2 $$
Another needed assertion is the fact that expected value of the product of two independent random variables equals the product of the expected values of those variables. $$ EXY = EXEY $$.

Now we proceed to develop the definition of \(S_n^2\)

$$ S_n^2 = E\left[\frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X_n})^2\right] = E\left[\frac{1}{n}\sum_{i=1}^{n}(X_i^2 - 2X_i\bar{X_n} + \bar{X_n}^2)\right] = \frac{1}{n}\sum_{i=1}^{n}EX_i^2 - \frac{2}{n}\sum_{i=1}^{n}EX_i\bar{X_n} + \frac{1}{n}\sum_{i=1}^{n}E\bar{X_n}^2 $$

Using preliminary results we have

$$ = \mu^2 + \sigma^2 - \frac{2}{n}\sum_{i=1}^{n}EX_i\left(\frac{1}{n}\sum_{j=1}^{n}X_j\right) + \mu^2 + \frac{\sigma^2}{n} = 2\mu^2 + \sigma^2 + \frac{\sigma^2}{n} - \frac{2}{n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}EX_iX_j $$

In the last equation we study the sum of sums of products \(\sum_{i=1}^{n}\sum_{j=1}^{n}EX_iX_j\). This expression has \(n^2\) terms and, among these terms, there are \(n\) terms with equal indexes for which we empower first preliminary result. Also, the same expression contains \(n^2-n\) terms with different indexes, where we can empower second preliminary. Thus we can develop further the equation of \(S_n^2\)

$$ S_n^2 = 2\mu^2 + \sigma^2 + \frac{\sigma^2}{n} - \frac{2}{n^2}\left[n(\mu^2+\sigma^2) + (n^2-n)\mu^2\right]
= 2\mu^2 + \sigma^2 + \frac{\sigma^2}{n} - \frac{2}{n^2}\left[n\mu^2+n\sigma^2 + n^2\mu^2 - n\mu^2\right] $$
$$ S_n^2 = 2\mu^2 + \sigma^2 + \frac{\sigma^2}{n} - 2\mu^2 - \frac{2\sigma^2}{n} = \sigma^2 - \frac{\sigma^2}{n} $$

Finally, we find that the estimator is biased, i.e. its expected value is not equal with theoretical population variance \(\sigma^2\).
$$ S_n^2 = \sigma^2\frac{n-1}{n} \leq \sigma^2 $$