Variance is the statistic which describe the spread of values with respect to the mean. We have a random variable \( x = (x_1,x_2,...,x_n) \).The classical formula is self explanatory:
$$ Var(x) = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n} $$
Deriving first formula:
$$ Var(x) = \frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2
= \frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\bar{x}+\bar{x}^2)
= \frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\frac{\sum_{i=1}^{n}x_i}{n}+[\frac{\sum_{i=1}^{n}x_i}{n}]^2) \\
= \frac{\sum_{i=1}^{n}x_i^2}{n} -\frac{2\sum_{i=1}^{n}x_i\sum_{i=1}^{n}x_i}{n^2}+\frac{1}{n^3}\sum_{i=1}^{n}[\sum_{i=1}^{n}x_i]^2
= \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{2}{n^2}[\sum_{i=1}^{n}x_i]^2+\frac{n}{n^3}[\sum_{i=1}^{n}x_i]^2
= \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}[\sum_{i=1}^{n}x_i]^2
$$
Leads to another well-known formula of variance:
$$ Var(x) = \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}\bigg[\sum_{i=1}^{n}x_i\bigg]^2 $$
There are also other formulas for variance more or less self-explanatory. While playing with those terms one evening I found also a new formula for variance which was not described in other places (at least I do not know and I have the strong excuse of any freshman in this field). So:
$$ Var(x)=\frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}\bigg[\sum_{i=1}^{n}x_i\bigg]^2
= \frac{1}{2n^2}\bigg[\sum_{i=1}^{n}x_i^2 - 2\sum_{i=1}^2x_i\sum_{i=1}^2x_i+ \sum_{i=1}^{n}x_i^2 \bigg]
= \frac{1}{2n^2}\bigg[n\sum_{i=1}^{n}x_i^2 - 2\sum_{i=1}^{n}x_i\sum_{j=1}^{n}x_j+ n\sum_{j=1}^{n}x_j^2 \bigg] \\
= \frac{1}{2n^2}\bigg[\sum_{i=1}^{n}\sum_{j=1}^{n}x_i^2 - 2\sum_{i=1}^{n}\sum_{j=1}^{n}x_ix_j+ \sum_{i=1}^{n}\sum_{j=1}^{n}x_j^2 \bigg]
= \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i^2 - 2x_ix_j+ x_j^2)
= \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i-x_j)^2 $$
Finally the new promised formula is:
$$ Var(x) = \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i-x_j)^2 $$
or simplified a little more:
$$ Var(x) = \frac{1}{n^2}\sum_{i=1}^{n}\sum_{j=i}^{n}(x_i-x_j)^2 $$
This formula is clearly not feasible from computational point of view. To calculate we need \(O(n^2\) running time. What I really found interesting at this formula is it's form. In plain English it can be translated like "Variance is the average of squared difference between all pairs of values".
That formula also gives a nice geometrical or spatial view of variance. Another idea which can be derived from this formula is that the sample variance is much closer to population variance if we have more values in sample. Explained by this formula it becomes more clear this idea. I imagine the that the trust one can put in sample variance to predict population variance as the density of links between sample values (where those links are the squared differences between its values).
I have a friend which knows much better than me statistics and he found interesting this formula, at least because at first he took a look at the final formula and stated that is not correct. Well, it seems he was wrong this time.
No comments:
Post a Comment