January 16, 2013

Another formula for variance

Variance is the statistic which describe the spread of values with respect to the mean. We have a random variable \( x = (x_1,x_2,...,x_n) \).The classical formula is self explanatory:
$$ Var(x) = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n} $$
Deriving first formula:
$$ Var(x) = \frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2
= \frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\bar{x}+\bar{x}^2)
= \frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\frac{\sum_{i=1}^{n}x_i}{n}+[\frac{\sum_{i=1}^{n}x_i}{n}]^2) \\
= \frac{\sum_{i=1}^{n}x_i^2}{n} -\frac{2\sum_{i=1}^{n}x_i\sum_{i=1}^{n}x_i}{n^2}+\frac{1}{n^3}\sum_{i=1}^{n}[\sum_{i=1}^{n}x_i]^2
= \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{2}{n^2}[\sum_{i=1}^{n}x_i]^2+\frac{n}{n^3}[\sum_{i=1}^{n}x_i]^2
= \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}[\sum_{i=1}^{n}x_i]^2
$$
Leads to another well-known formula of variance:
$$ Var(x) = \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}\bigg[\sum_{i=1}^{n}x_i\bigg]^2 $$
There are also other formulas for variance more or less self-explanatory. While playing with those terms one evening I found also a new formula for variance which was not described in other places (at least I do not know and I have the strong excuse of any freshman in this field). So:
$$ Var(x)=\frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}\bigg[\sum_{i=1}^{n}x_i\bigg]^2
= \frac{1}{2n^2}\bigg[\sum_{i=1}^{n}x_i^2 - 2\sum_{i=1}^2x_i\sum_{i=1}^2x_i+ \sum_{i=1}^{n}x_i^2 \bigg]
= \frac{1}{2n^2}\bigg[n\sum_{i=1}^{n}x_i^2 - 2\sum_{i=1}^{n}x_i\sum_{j=1}^{n}x_j+ n\sum_{j=1}^{n}x_j^2 \bigg] \\
= \frac{1}{2n^2}\bigg[\sum_{i=1}^{n}\sum_{j=1}^{n}x_i^2 - 2\sum_{i=1}^{n}\sum_{j=1}^{n}x_ix_j+ \sum_{i=1}^{n}\sum_{j=1}^{n}x_j^2 \bigg]
= \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i^2 - 2x_ix_j+ x_j^2)
= \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i-x_j)^2 $$
Finally the new promised formula is:
$$ Var(x) = \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i-x_j)^2 $$
or simplified a little more:
$$ Var(x) = \frac{1}{n^2}\sum_{i=1}^{n}\sum_{j=i}^{n}(x_i-x_j)^2 $$
This formula is clearly not feasible from computational point of view. To calculate we need \(O(n^2\) running time. What I really found interesting at this formula is it's form. In plain English it can be translated like "Variance is the average of squared difference between all pairs of values".
That formula also gives a nice geometrical or spatial view of variance. Another idea which can be derived from this formula is that the sample variance is much closer to population variance if we have more values in sample. Explained by this formula it becomes more clear this idea. I imagine the that the trust one can put in sample variance to predict population variance as the density of links between sample values (where those links are the squared differences between its values).
I have a friend which knows much better than me statistics and he found interesting this formula, at least because at first he took a look at the final formula and stated that is not correct. Well, it seems he was wrong this time.

January 10, 2013

Approximate with a constant value


When we do a regression with a decision tree like CART, as part of the learning bias is the fact that we choose to approximate values from a region with a single constant value. Basically what decision trees does in case of regression is to recursively split the p-dimensional space into regions. Than for each region it selects the corresponding instances and using the class variable values from subset, it calculates a constant value used later for prediction.

How CART and its friends works, regarding the splitting process is not the subject here. What I want to talk about is how to find a proper constant value for a region. To reformulate in simpler terms, I am interested in how to calculate a constant value usable for prediction, given a set of \(y=(y_1,y_2,..y_n)\) variables. In order to obtain that we use a loss function. A loss function is a function which establish a certain criteria for choosing the proper constant values.

I was interested in to very often used loss functions: square loss and absolute loss.

Square loss


Square loss function is defined as $$ f(y,c)=(y-c)^2=\sum_{i=1}^{n}(y_i-c)^2 $$ and what we are searching for is the constant \( c \) for which the value of loss function is minimal.: $$ c = \underset{c}{\arg\min} \sum_{i=1}^n{(y_i-c)^2}$$ Here we find an simple way to calculate using the fact that the function is positive. Based on that we know that the function has a minimum and that minimum is given by derivative equals 0. $$ \frac{\partial}{\partial{c}}[\sum_{i=1}^{n}{(y_i-c)^2}]=0 $$ We calculate $$ \frac{\partial}{\partial{c}}[\sum_{i=1}^{n}{(y_i-c)^2}] = \sum_{i=1}^{n}{\frac{\partial}{\partial{c}}{[(y_i-c)^2]}} = \sum_{i=1}^{n}{2(y_i-c)\frac{\partial}{\partial{c}}(y_i-c)} = \sum_{i=1}^{n}{-2(y_i-c)} = 0 $$ $$ \sum_{i=1}^{n}{y_i} = \sum_{i=1}^{n}{c} = nc $$ $$ c = \frac{\sum_{i=1}^{n}{y_i}}{n} = \overline{y} $$

Absolute loss


Absolute loss function is defined as $$ f(y,c)=|y-c|=\sum_{i=1}^{n}|y_i-c| $$. Again, what we are searching for is the constant \( c \) for which the value of loss function is minimal.: $$ c = \underset{c}{\arg\min} \sum_{i=1}^n{|y_i-c|}$$ We know that function is positive so we know also that is has a minimum somewhere. This time we can't use derivation. I read somewhere that c is the median, and that is statistically correct. I do not know what is the meaning of statistically correct, and that sounds fuzzy to my. After few tries I saw a way to proof that. I hope some day I will learn about a shorter proof.

We consider without loose of generality that \(y\) variables are sorted (\(y_1 \leq y_2 \leq ... \leq y_n\)). We are interested first to see what happens when \(c < y_1\). We calculate $$ \underset{c<{y_1}}{f(y,c)}=\sum_{i=1}^{n}{|y_i-c|}=\sum_{i=1}^{n}{|y_i+c+y_1-y_1|}=\sum_{i=1}^{n}{|y_1+c+y_i-y_1|}=\sum_{i=1}^{n}{|y_1+c|}+\sum_{i=1}^{n}{|y_i-y_1|} \geq \sum_{i=1}^{n}{|y_i-y_1|}=\underset{c={y_1}}{f(y,c)}$$ So $$ \underset{c<{y_1}}{f(y,c)} \geq \underset{c={y_1}}{f(y,c)}$$ In a similar manner we show also that $$ \underset{c>{y_n}}{f(y,c)} \geq \underset{c={y_n}}{f(y,c)}$$ What we found is that to minimize loss absolute loss, \(c\) must lie between \(y_1\) and \(y_n\) inclusive. Next we note that, when \(c\) is in this interval, the distance to both ends is constant $$ |y_1-c|+|y_n-c|=|y_n-y_1| $$ We simply do not consider anymore those two points (\(y_1,y_n\)), we reduce our interest on interval \(y_2,y_{n-1}\) and apply recursively the same process.

Finally we will find either a point in the middle, or an interval with no other \(y\) values included. If we found a point than this is our solution. If we found an interval, any point from that interval is a possible solution. We note that the definition of the median fits those conditions and is a solution which minimize absolute loss function.

January 8, 2013

Complete rewrite myself

In the past years I slowly started to rewrite myself. My interests moved to fundamental computer science data structures and algorithms. More recently I discovered machine learning, statistics and data mining. I am completely conquered by those problems and I feel that this is the way to go for the rest of my life. So, no more about programming, technologies or frameworks.

As a starting point I would like to share a gorgeous and insightful poem, written by an eminent scientist Maurice Kendall, called "Hiawatha Designs an Experiment".

Original source:
Kendall, Maurice (1959). Hiawatha Designs an Experiment. The American Statistician 13: 23-24.

    Hiawatha, mighty hunter,
    He could shoot ten arrows upward,
    Shoot them with such strength and swiftness
    That the last had left the bow-string
    Ere the first to earth descended.
    This was commonly regarded
    As a feat of skill and cunning.
    Several sarcastic spirits
    Pointed out to him, however,
    That it might be much more useful
    If he sometimes hit the target.
    "Why not shoot a little straighter
    And employ a smaller sample?"
    Hiawatha, who at college
    Majored in applied statistics,
    Consequently felt entitled
    To instruct his fellow man
    In any subject whatsoever,
    Waxed exceedingly indignant,
    Talked about the law of errors,
    Talked about truncated normals,
    Talked of loss of information,
    Talked about his lack of bias,
    Pointed out that (in the long run)
    Independent observations,
    Even though they missed the target,
    Had an average point of impact
    Very near the spot he aimed at,
    With the possible exception
    of a set of measure zero.
    "This," they said, "was rather doubtful;
    Anyway it didn't matter.
    What resulted in the long run:
    Either he must hit the target
    Much more often than at present,
    Or himself would have to pay for
    All the arrows he had wasted."
    Hiawatha, in a temper,
    Quoted parts of R. A. Fisher,
    Quoted Yates and quoted Finney,
    Quoted reams of Oscar Kempthorne,
    Quoted Anderson and Bancroft
    (practically in extenso)
    Trying to impress upon them
    That what actually mattered
    Was to estimate the error.
    Several of them admitted:
    "Such a thing might have its uses;
    Still," they said, "he would do better
    If he shot a little straighter."
    Hiawatha, to convince them,
    Organized a shooting contest.
    Laid out in the proper manner
    Of designs experimental
    Recommended in the textbooks,
    Mainly used for tasting tea
    (but sometimes used in other cases)
    Used factorial arrangements
    And the theory of Galois,
    Got a nicely balanced layout
    And successfully confounded
    Second order interactions.
    All the other tribal marksmen,
    Ignorant benighted creatures
    Of experimental setups,
    Used their time of preparation
    Putting in a lot of practice
    Merely shooting at the target.
    Thus it happened in the contest
    That their scores were most impressive
    With one solitary exception.
    This, I hate to have to say it,
    Was the score of Hiawatha,
    Who as usual shot his arrows,
    Shot them with great strength and swiftness,
    Managing to be unbiased,
    Not however with a salvo
    Managing to hit the target.
    "There!" they said to Hiawatha,
    "That is what we all expected."
    Hiawatha, nothing daunted,
    Called for pen and called for paper.
    But analysis of variance
    Finally produced the figures
    Showing beyond all peradventure,
    Everybody else was biased.
    And the variance components
    Did not differ from each other's,
    Or from Hiawatha's.
    (This last point it might be mentioned,
    Would have been much more convincing
    If he hadn't been compelled to
    Estimate his own components
    >From experimental plots on
    Which the values all were missing.)
    Still they couldn't understand it,
    So they couldn't raise objections.
    (Which is what so often happens
    with analysis of variance.)
    All the same his fellow tribesmen,
    Ignorant benighted heathens,
    Took away his bow and arrows,
    Said that though my Hiawatha
    Was a brilliant statistician,
    He was useless as a bowman.
    As for variance components
    Several of the more outspoken
    Make primeval observations
    Hurtful of the finer feelings
    Even of the statistician.
    In a corner of the forest
    Sits alone my Hiawatha
    Permanently cogitating
    On the normal law of errors.
    Wondering in idle moments
    If perhaps increased precision
    Might perhaps be sometimes better
    Even at the cost of bias,
    If one could thereby now and then
    Register upon a target.