Processing math: 100%

July 3, 2013

Why variance estimator is biased when uses n as denominator.

Given  X1,..,Xn a set of independent and identically distributed random variables with fixed, existent and unknown mean μ and variance σ2, one usual problem which needs to be solved is to estimate population variance σ2. One well known estimators for that, which we will denote as S2n is defined by the following formula.
S2n=E[1nni=1(Xi¯Xn)2]
where ¯Xn=1nni=1Xi is the population mean estimator.

It is a well know fact that the estimator for population variance S2n is biased when it uses n as denominator. By biased, we mean that it's expected value does not equals with the theoretical value of population variance σ2. This fact is stated in almost all statistics books and this is the reason why this estimator is not considered THE population variance estimator. Even so, I saw no complete/clear proof of this fact. On wikipedia site there are two proofs but none convinced me enough to accept that and this problem bothered me for quite a few weeks.

Finally I arrived at a simpler (in my humble opinion) proof of that fact: this population variance estimator is biased.

To proceed with the proof we have to assert that for a given random variable X, the expected value of X2 equals squared mean plus variation.
EX2=E(Xμ+μ)2=E(Xμ)2+2μE(Xμ)+Eμ2=σ2+μ2
Another needed assertion is the fact that expected value of the product of two independent random variables equals the product of the expected values of those variables. EXY=EXEY.

Now we proceed to develop the definition of S2n

S2n=E[1nni=1(Xi¯Xn)2]=E[1nni=1(X2i2Xi¯Xn+¯Xn2)]=1nni=1EX2i2nni=1EXi¯Xn+1nni=1E¯Xn2

Using preliminary results we have

=μ2+σ22nni=1EXi(1nnj=1Xj)+μ2+σ2n=2μ2+σ2+σ2n2n2ni=1nj=1EXiXj

In the last equation we study the sum of sums of products ni=1nj=1EXiXj. This expression has n2 terms and, among these terms, there are n terms with equal indexes for which we empower first preliminary result. Also, the same expression contains n2n terms with different indexes, where we can empower second preliminary. Thus we can develop further the equation of S2n

S2n=2μ2+σ2+σ2n2n2[n(μ2+σ2)+(n2n)μ2]=2μ2+σ2+σ2n2n2[nμ2+nσ2+n2μ2nμ2]
S2n=2μ2+σ2+σ2n2μ22σ2n=σ2σ2n

Finally, we find that the estimator is biased, i.e. its expected value is not equal with theoretical population variance σ2.
S2n=σ2n1nσ2