Variance is the statistic which describe the spread of values with respect to the mean. We have a random variable \( x = (x_1,x_2,...,x_n) \).The classical formula is self explanatory:
$$ Var(x) = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n} $$
Deriving first formula:
$$ Var(x) = \frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2
= \frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\bar{x}+\bar{x}^2)
= \frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\frac{\sum_{i=1}^{n}x_i}{n}+[\frac{\sum_{i=1}^{n}x_i}{n}]^2) \\
= \frac{\sum_{i=1}^{n}x_i^2}{n} -\frac{2\sum_{i=1}^{n}x_i\sum_{i=1}^{n}x_i}{n^2}+\frac{1}{n^3}\sum_{i=1}^{n}[\sum_{i=1}^{n}x_i]^2
= \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{2}{n^2}[\sum_{i=1}^{n}x_i]^2+\frac{n}{n^3}[\sum_{i=1}^{n}x_i]^2
= \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}[\sum_{i=1}^{n}x_i]^2
$$
Leads to another well-known formula of variance:
$$ Var(x) = \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}\bigg[\sum_{i=1}^{n}x_i\bigg]^2 $$
There are also other formulas for variance more or less self-explanatory. While playing with those terms one evening I found also a new formula for variance which was not described in other places (at least I do not know and I have the strong excuse of any freshman in this field). So:
$$ Var(x)=\frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}\bigg[\sum_{i=1}^{n}x_i\bigg]^2
= \frac{1}{2n^2}\bigg[\sum_{i=1}^{n}x_i^2 - 2\sum_{i=1}^2x_i\sum_{i=1}^2x_i+ \sum_{i=1}^{n}x_i^2 \bigg]
= \frac{1}{2n^2}\bigg[n\sum_{i=1}^{n}x_i^2 - 2\sum_{i=1}^{n}x_i\sum_{j=1}^{n}x_j+ n\sum_{j=1}^{n}x_j^2 \bigg] \\
= \frac{1}{2n^2}\bigg[\sum_{i=1}^{n}\sum_{j=1}^{n}x_i^2 - 2\sum_{i=1}^{n}\sum_{j=1}^{n}x_ix_j+ \sum_{i=1}^{n}\sum_{j=1}^{n}x_j^2 \bigg]
= \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i^2 - 2x_ix_j+ x_j^2)
= \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i-x_j)^2 $$
Finally the new promised formula is:
$$ Var(x) = \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i-x_j)^2 $$
or simplified a little more:
$$ Var(x) = \frac{1}{n^2}\sum_{i=1}^{n}\sum_{j=i}^{n}(x_i-x_j)^2 $$
This formula is clearly not feasible from computational point of view. To calculate we need \(O(n^2\) running time. What I really found interesting at this formula is it's form. In plain English it can be translated like "Variance is the average of squared difference between all pairs of values".
That formula also gives a nice geometrical or spatial view of variance. Another idea which can be derived from this formula is that the sample variance is much closer to population variance if we have more values in sample. Explained by this formula it becomes more clear this idea. I imagine the that the trust one can put in sample variance to predict population variance as the density of links between sample values (where those links are the squared differences between its values).
I have a friend which knows much better than me statistics and he found interesting this formula, at least because at first he took a look at the final formula and stated that is not correct. Well, it seems he was wrong this time.
January 16, 2013
January 10, 2013
Approximate with a constant value
When we do a regression with a decision tree like CART, as part of the learning bias is the fact that we choose to approximate values from a region with a single constant value. Basically what decision trees does in case of regression is to recursively split the p-dimensional space into regions. Than for each region it selects the corresponding instances and using the class variable values from subset, it calculates a constant value used later for prediction.
How CART and its friends works, regarding the splitting process is not the subject here. What I want to talk about is how to find a proper constant value for a region. To reformulate in simpler terms, I am interested in how to calculate a constant value usable for prediction, given a set of \(y=(y_1,y_2,..y_n)\) variables. In order to obtain that we use a loss function. A loss function is a function which establish a certain criteria for choosing the proper constant values.
I was interested in to very often used loss functions: square loss and absolute loss.
Square loss
Square loss function is defined as $$ f(y,c)=(y-c)^2=\sum_{i=1}^{n}(y_i-c)^2 $$ and what we are searching for is the constant \( c \) for which the value of loss function is minimal.: $$ c = \underset{c}{\arg\min} \sum_{i=1}^n{(y_i-c)^2}$$ Here we find an simple way to calculate using the fact that the function is positive. Based on that we know that the function has a minimum and that minimum is given by derivative equals 0. $$ \frac{\partial}{\partial{c}}[\sum_{i=1}^{n}{(y_i-c)^2}]=0 $$ We calculate $$ \frac{\partial}{\partial{c}}[\sum_{i=1}^{n}{(y_i-c)^2}] = \sum_{i=1}^{n}{\frac{\partial}{\partial{c}}{[(y_i-c)^2]}} = \sum_{i=1}^{n}{2(y_i-c)\frac{\partial}{\partial{c}}(y_i-c)} = \sum_{i=1}^{n}{-2(y_i-c)} = 0 $$ $$ \sum_{i=1}^{n}{y_i} = \sum_{i=1}^{n}{c} = nc $$ $$ c = \frac{\sum_{i=1}^{n}{y_i}}{n} = \overline{y} $$
Absolute loss
Absolute loss function is defined as $$ f(y,c)=|y-c|=\sum_{i=1}^{n}|y_i-c| $$. Again, what we are searching for is the constant \( c \) for which the value of loss function is minimal.: $$ c = \underset{c}{\arg\min} \sum_{i=1}^n{|y_i-c|}$$ We know that function is positive so we know also that is has a minimum somewhere. This time we can't use derivation. I read somewhere that c is the median, and that is statistically correct. I do not know what is the meaning of statistically correct, and that sounds fuzzy to my. After few tries I saw a way to proof that. I hope some day I will learn about a shorter proof.
We consider without loose of generality that \(y\) variables are sorted (\(y_1 \leq y_2 \leq ... \leq y_n\)). We are interested first to see what happens when \(c < y_1\). We calculate $$ \underset{c<{y_1}}{f(y,c)}=\sum_{i=1}^{n}{|y_i-c|}=\sum_{i=1}^{n}{|y_i+c+y_1-y_1|}=\sum_{i=1}^{n}{|y_1+c+y_i-y_1|}=\sum_{i=1}^{n}{|y_1+c|}+\sum_{i=1}^{n}{|y_i-y_1|} \geq \sum_{i=1}^{n}{|y_i-y_1|}=\underset{c={y_1}}{f(y,c)}$$ So $$ \underset{c<{y_1}}{f(y,c)} \geq \underset{c={y_1}}{f(y,c)}$$ In a similar manner we show also that $$ \underset{c>{y_n}}{f(y,c)} \geq \underset{c={y_n}}{f(y,c)}$$ What we found is that to minimize loss absolute loss, \(c\) must lie between \(y_1\) and \(y_n\) inclusive. Next we note that, when \(c\) is in this interval, the distance to both ends is constant $$ |y_1-c|+|y_n-c|=|y_n-y_1| $$ We simply do not consider anymore those two points (\(y_1,y_n\)), we reduce our interest on interval \(y_2,y_{n-1}\) and apply recursively the same process.
Finally we will find either a point in the middle, or an interval with no other \(y\) values included. If we found a point than this is our solution. If we found an interval, any point from that interval is a possible solution. We note that the definition of the median fits those conditions and is a solution which minimize absolute loss function.
January 8, 2013
Complete rewrite myself
In the past years I slowly started to rewrite myself. My interests moved to fundamental computer science data structures and algorithms. More recently I discovered machine learning, statistics and data mining. I am completely conquered by those problems and I feel that this is the way to go for the rest of my life. So, no more about programming, technologies or frameworks.
As a starting point I would like to share a gorgeous and insightful poem, written by an eminent scientist Maurice Kendall, called "Hiawatha Designs an Experiment".
Original source:
Kendall, Maurice (1959). Hiawatha Designs an Experiment. The American Statistician 13: 23-24.
As a starting point I would like to share a gorgeous and insightful poem, written by an eminent scientist Maurice Kendall, called "Hiawatha Designs an Experiment".
Original source:
Kendall, Maurice (1959). Hiawatha Designs an Experiment. The American Statistician 13: 23-24.
- Hiawatha, mighty hunter,
He could shoot ten arrows upward,
Shoot them with such strength and swiftness
That the last had left the bow-string
Ere the first to earth descended.
- This was commonly regarded
As a feat of skill and cunning.
Several sarcastic spirits
Pointed out to him, however,
That it might be much more useful
If he sometimes hit the target.
"Why not shoot a little straighter
And employ a smaller sample?"
Hiawatha, who at college
Majored in applied statistics,
Consequently felt entitled
To instruct his fellow man
In any subject whatsoever,
Waxed exceedingly indignant,
Talked about the law of errors,
Talked about truncated normals,
Talked of loss of information,
Talked about his lack of bias,
Pointed out that (in the long run)
Independent observations,
Even though they missed the target,
Had an average point of impact
Very near the spot he aimed at,
With the possible exception
of a set of measure zero.
- "This," they said, "was rather doubtful;
Anyway it didn't matter.
What resulted in the long run:
Either he must hit the target
Much more often than at present,
Or himself would have to pay for
All the arrows he had wasted."
- Hiawatha, in a temper,
Quoted parts of R. A. Fisher,
Quoted Yates and quoted Finney,
Quoted reams of Oscar Kempthorne,
Quoted Anderson and Bancroft
(practically in extenso)
Trying to impress upon them
That what actually mattered
Was to estimate the error.
- Several of them admitted:
"Such a thing might have its uses;
Still," they said, "he would do better
If he shot a little straighter."
- Hiawatha, to convince them,
Organized a shooting contest.
Laid out in the proper manner
Of designs experimental
Recommended in the textbooks,
Mainly used for tasting tea
(but sometimes used in other cases)
Used factorial arrangements
And the theory of Galois,
Got a nicely balanced layout
And successfully confounded
Second order interactions.
- All the other tribal marksmen,
Ignorant benighted creatures
Of experimental setups,
Used their time of preparation
Putting in a lot of practice
Merely shooting at the target.
- Thus it happened in the contest
That their scores were most impressive
With one solitary exception.
This, I hate to have to say it,
Was the score of Hiawatha,
Who as usual shot his arrows,
Shot them with great strength and swiftness,
Managing to be unbiased,
Not however with a salvo
Managing to hit the target.
- "There!" they said to Hiawatha,
"That is what we all expected."
Hiawatha, nothing daunted,
Called for pen and called for paper.
But analysis of variance
Finally produced the figures
Showing beyond all peradventure,
Everybody else was biased.
And the variance components
Did not differ from each other's,
Or from Hiawatha's.
(This last point it might be mentioned,
Would have been much more convincing
If he hadn't been compelled to
Estimate his own components
>From experimental plots on
Which the values all were missing.)
- Still they couldn't understand it,
So they couldn't raise objections.
(Which is what so often happens
with analysis of variance.)
All the same his fellow tribesmen,
Ignorant benighted heathens,
Took away his bow and arrows,
Said that though my Hiawatha
Was a brilliant statistician,
He was useless as a bowman.
As for variance components
Several of the more outspoken
Make primeval observations
Hurtful of the finer feelings
Even of the statistician.
- In a corner of the forest
Sits alone my Hiawatha
Permanently cogitating
On the normal law of errors.
Wondering in idle moments
If perhaps increased precision
Might perhaps be sometimes better
Even at the cost of bias,
If one could thereby now and then
Register upon a target.
Subscribe to:
Posts (Atom)