Broken thoughts: January 2013

January 16, 2013

Another formula for variance

Variance is the statistic which describe the spread of values with respect to the mean. We have a random variable $ x = (x_1,x_2,...,x_n) $.The classical formula is self explanatory:
$$ Var(x) = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n} $$
Deriving first formula:
$$ Var(x) = \frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2
= \frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\bar{x}+\bar{x}^2)
= \frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\frac{\sum_{i=1}^{n}x_i}{n}+[\frac{\sum_{i=1}^{n}x_i}{n}]^2) \\
= \frac{\sum_{i=1}^{n}x_i^2}{n} -\frac{2\sum_{i=1}^{n}x_i\sum_{i=1}^{n}x_i}{n^2}+\frac{1}{n^3}\sum_{i=1}^{n}[\sum_{i=1}^{n}x_i]^2
= \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{2}{n^2}[\sum_{i=1}^{n}x_i]^2+\frac{n}{n^3}[\sum_{i=1}^{n}x_i]^2
= \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}[\sum_{i=1}^{n}x_i]^2
$$
Leads to another well-known formula of variance:
$$ Var(x) = \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}\bigg[\sum_{i=1}^{n}x_i\bigg]^2 $$
There are also other formulas for variance more or less self-explanatory. While playing with those terms one evening I found also a new formula for variance which was not described in other places (at least I do not know and I have the strong excuse of any freshman in this field). So:
$$ Var(x)=\frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}\bigg[\sum_{i=1}^{n}x_i\bigg]^2
= \frac{1}{2n^2}\bigg[\sum_{i=1}^{n}x_i^2 - 2\sum_{i=1}^2x_i\sum_{i=1}^2x_i+ \sum_{i=1}^{n}x_i^2 \bigg]
= \frac{1}{2n^2}\bigg[n\sum_{i=1}^{n}x_i^2 - 2\sum_{i=1}^{n}x_i\sum_{j=1}^{n}x_j+ n\sum_{j=1}^{n}x_j^2 \bigg] \\
= \frac{1}{2n^2}\bigg[\sum_{i=1}^{n}\sum_{j=1}^{n}x_i^2 - 2\sum_{i=1}^{n}\sum_{j=1}^{n}x_ix_j+ \sum_{i=1}^{n}\sum_{j=1}^{n}x_j^2 \bigg]
= \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i^2 - 2x_ix_j+ x_j^2)
= \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i-x_j)^2 $$
Finally the new promised formula is:
$$ Var(x) = \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i-x_j)^2 $$
or simplified a little more:
$$ Var(x) = \frac{1}{n^2}\sum_{i=1}^{n}\sum_{j=i}^{n}(x_i-x_j)^2 $$
This formula is clearly not feasible from computational point of view. To calculate we need $O(n^2$ running time. What I really found interesting at this formula is it's form. In plain English it can be translated like "Variance is the average of squared difference between all pairs of values".
That formula also gives a nice geometrical or spatial view of variance. Another idea which can be derived from this formula is that the sample variance is much closer to population variance if we have more values in sample. Explained by this formula it becomes more clear this idea. I imagine the that the trust one can put in sample variance to predict population variance as the density of links between sample values (where those links are the squared differences between its values).
I have a friend which knows much better than me statistics and he found interesting this formula, at least because at first he took a look at the final formula and stated that is not correct. Well, it seems he was wrong this time.

January 10, 2013

Approximate with a constant value

When we do a regression with a decision tree like CART, as part of the learning bias is the fact that we choose to approximate values from a region with a single constant value. Basically what decision trees does in case of regression is to recursively split the p-dimensional space into regions. Than for each region it selects the corresponding instances and using the class variable values from subset, it calculates a constant value used later for prediction.

How CART and its friends works, regarding the splitting process is not the subject here. What I want to talk about is how to find a proper constant value for a region. To reformulate in simpler terms, I am interested in how to calculate a constant value usable for prediction, given a set of $y=(y_1,y_2,..y_n)$ variables. In order to obtain that we use a loss function. A loss function is a function which establish a certain criteria for choosing the proper constant values.

I was interested in to very often used loss functions: square loss and absolute loss.

Square loss

Square loss function is defined as $$ f(y,c)=(y-c)^2=\sum_{i=1}^{n}(y_i-c)^2 $$ and what we are searching for is the constant $ c $ for which the value of loss function is minimal.: $$ c = \underset{c}{\arg\min} \sum_{i=1}^n{(y_i-c)^2}$$ Here we find an simple way to calculate using the fact that the function is positive. Based on that we know that the function has a minimum and that minimum is given by derivative equals 0. $$ \frac{\partial}{\partial{c}}[\sum_{i=1}^{n}{(y_i-c)^2}]=0 $$ We calculate $$ \frac{\partial}{\partial{c}}[\sum_{i=1}^{n}{(y_i-c)^2}] = \sum_{i=1}^{n}{\frac{\partial}{\partial{c}}{[(y_i-c)^2]}} = \sum_{i=1}^{n}{2(y_i-c)\frac{\partial}{\partial{c}}(y_i-c)} = \sum_{i=1}^{n}{-2(y_i-c)} = 0 $$ $$ \sum_{i=1}^{n}{y_i} = \sum_{i=1}^{n}{c} = nc $$ $$ c = \frac{\sum_{i=1}^{n}{y_i}}{n} = \overline{y} $$

Absolute loss

Absolute loss function is defined as $$ f(y,c)=|y-c|=\sum_{i=1}^{n}|y_i-c| $$. Again, what we are searching for is the constant $ c $ for which the value of loss function is minimal.: $$ c = \underset{c}{\arg\min} \sum_{i=1}^n{|y_i-c|}$$ We know that function is positive so we know also that is has a minimum somewhere. This time we can't use derivation. I read somewhere that c is the median, and that is statistically correct. I do not know what is the meaning of statistically correct, and that sounds fuzzy to my. After few tries I saw a way to proof that. I hope some day I will learn about a shorter proof.

We consider without loose of generality that $y$ variables are sorted ($y_1 \leq y_2 \leq ... \leq y_n$). We are interested first to see what happens when $c < y_1$. We calculate $$ \underset{c<{y_1}}{f(y,c)}=\sum_{i=1}^{n}{|y_i-c|}=\sum_{i=1}^{n}{|y_i+c+y_1-y_1|}=\sum_{i=1}^{n}{|y_1+c+y_i-y_1|}=\sum_{i=1}^{n}{|y_1+c|}+\sum_{i=1}^{n}{|y_i-y_1|} \geq \sum_{i=1}^{n}{|y_i-y_1|}=\underset{c={y_1}}{f(y,c)}$$ So $$ \underset{c<{y_1}}{f(y,c)} \geq \underset{c={y_1}}{f(y,c)}$$ In a similar manner we show also that $$ \underset{c>{y_n}}{f(y,c)} \geq \underset{c={y_n}}{f(y,c)}$$ What we found is that to minimize loss absolute loss, $c$ must lie between $y_1$ and $y_n$ inclusive. Next we note that, when $c$ is in this interval, the distance to both ends is constant $$ |y_1-c|+|y_n-c|=|y_n-y_1| $$ We simply do not consider anymore those two points ($y_1,y_n$), we reduce our interest on interval $y_2,y_{n-1}$ and apply recursively the same process.

Finally we will find either a point in the middle, or an interval with no other $y$ values included. If we found a point than this is our solution. If we found an interval, any point from that interval is a possible solution. We note that the definition of the median fits those conditions and is a solution which minimize absolute loss function.

January 8, 2013

Complete rewrite myself

In the past years I slowly started to rewrite myself. My interests moved to fundamental computer science data structures and algorithms. More recently I discovered machine learning, statistics and data mining. I am completely conquered by those problems and I feel that this is the way to go for the rest of my life. So, no more about programming, technologies or frameworks.

As a starting point I would like to share a gorgeous and insightful poem, written by an eminent scientist Maurice Kendall, called "Hiawatha Designs an Experiment".

Original source:
Kendall, Maurice (1959). Hiawatha Designs an Experiment. The American Statistician 13: 23-24.