Broken thoughts: July 2013

Given $ X_1, .. , X_n $ a set of independent and identically distributed random variables with fixed, existent and unknown mean $ \mu $ and variance $\sigma^2$, one usual problem which needs to be solved is to estimate population variance $\sigma^2$. One well known estimators for that, which we will denote as $S_n^2$ is defined by the following formula.
$$ S_n^2 = E\left[\frac{1}{n}\sum_{i=1}^{n}(X_i-\bar{X_n})^2\right] $$
where $\bar{X_n} = \frac{1}{n}\sum_{i=1}^{n}X_i $ is the population mean estimator.

It is a well know fact that the estimator for population variance $S_n^2$ is biased when it uses $n$ as denominator. By biased, we mean that it's expected value does not equals with the theoretical value of population variance $\sigma^2$. This fact is stated in almost all statistics books and this is the reason why this estimator is not considered THE population variance estimator. Even so, I saw no complete/clear proof of this fact. On wikipedia site there are two proofs but none convinced me enough to accept that and this problem bothered me for quite a few weeks.

Finally I arrived at a simpler (in my humble opinion) proof of that fact: this population variance estimator is biased.

To proceed with the proof we have to assert that for a given random variable $X$, the expected value of $X^2$ equals squared mean plus variation.
$$ EX^2 = E(X-\mu+\mu)^2 = E(X-\mu)^2 + 2\mu E(X-\mu) + E\mu^2 = \sigma^2+\mu^2 $$
Another needed assertion is the fact that expected value of the product of two independent random variables equals the product of the expected values of those variables. $$ EXY = EXEY $$.

Now we proceed to develop the definition of $S_n^2$

$$ S_n^2 = E\left[\frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X_n})^2\right] = E\left[\frac{1}{n}\sum_{i=1}^{n}(X_i^2 - 2X_i\bar{X_n} + \bar{X_n}^2)\right] = \frac{1}{n}\sum_{i=1}^{n}EX_i^2 - \frac{2}{n}\sum_{i=1}^{n}EX_i\bar{X_n} + \frac{1}{n}\sum_{i=1}^{n}E\bar{X_n}^2 $$

Using preliminary results we have

$$ = \mu^2 + \sigma^2 - \frac{2}{n}\sum_{i=1}^{n}EX_i\left(\frac{1}{n}\sum_{j=1}^{n}X_j\right) + \mu^2 + \frac{\sigma^2}{n} = 2\mu^2 + \sigma^2 + \frac{\sigma^2}{n} - \frac{2}{n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}EX_iX_j $$

In the last equation we study the sum of sums of products $\sum_{i=1}^{n}\sum_{j=1}^{n}EX_iX_j$. This expression has $n^2$ terms and, among these terms, there are $n$ terms with equal indexes for which we empower first preliminary result. Also, the same expression contains $n^2-n$ terms with different indexes, where we can empower second preliminary. Thus we can develop further the equation of $S_n^2$

$$ S_n^2 = 2\mu^2 + \sigma^2 + \frac{\sigma^2}{n} - \frac{2}{n^2}\left[n(\mu^2+\sigma^2) + (n^2-n)\mu^2\right]
= 2\mu^2 + \sigma^2 + \frac{\sigma^2}{n} - \frac{2}{n^2}\left[n\mu^2+n\sigma^2 + n^2\mu^2 - n\mu^2\right] $$
$$ S_n^2 = 2\mu^2 + \sigma^2 + \frac{\sigma^2}{n} - 2\mu^2 - \frac{2\sigma^2}{n} = \sigma^2 - \frac{\sigma^2}{n} $$

Finally, we find that the estimator is biased, i.e. its expected value is not equal with theoretical population variance $\sigma^2$.
$$ S_n^2 = \sigma^2\frac{n-1}{n} \leq \sigma^2 $$

Broken thoughts

July 3, 2013

Why variance estimator is biased when uses n as denominator.