Broken thoughts

January 16, 2013

Another formula for variance

Variance is the statistic which describe the spread of values with respect to the mean. We have a random variable $ x = (x_1,x_2,...,x_n) $.The classical formula is self explanatory:
$$ Var(x) = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n} $$
Deriving first formula:
$$ Var(x) = \frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2
= \frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\bar{x}+\bar{x}^2)
= \frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\frac{\sum_{i=1}^{n}x_i}{n}+[\frac{\sum_{i=1}^{n}x_i}{n}]^2) \\
= \frac{\sum_{i=1}^{n}x_i^2}{n} -\frac{2\sum_{i=1}^{n}x_i\sum_{i=1}^{n}x_i}{n^2}+\frac{1}{n^3}\sum_{i=1}^{n}[\sum_{i=1}^{n}x_i]^2
= \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{2}{n^2}[\sum_{i=1}^{n}x_i]^2+\frac{n}{n^3}[\sum_{i=1}^{n}x_i]^2
= \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}[\sum_{i=1}^{n}x_i]^2
$$
Leads to another well-known formula of variance:
$$ Var(x) = \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}\bigg[\sum_{i=1}^{n}x_i\bigg]^2 $$
There are also other formulas for variance more or less self-explanatory. While playing with those terms one evening I found also a new formula for variance which was not described in other places (at least I do not know and I have the strong excuse of any freshman in this field). So:
$$ Var(x)=\frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}\bigg[\sum_{i=1}^{n}x_i\bigg]^2
= \frac{1}{2n^2}\bigg[\sum_{i=1}^{n}x_i^2 - 2\sum_{i=1}^2x_i\sum_{i=1}^2x_i+ \sum_{i=1}^{n}x_i^2 \bigg]
= \frac{1}{2n^2}\bigg[n\sum_{i=1}^{n}x_i^2 - 2\sum_{i=1}^{n}x_i\sum_{j=1}^{n}x_j+ n\sum_{j=1}^{n}x_j^2 \bigg] \\
= \frac{1}{2n^2}\bigg[\sum_{i=1}^{n}\sum_{j=1}^{n}x_i^2 - 2\sum_{i=1}^{n}\sum_{j=1}^{n}x_ix_j+ \sum_{i=1}^{n}\sum_{j=1}^{n}x_j^2 \bigg]
= \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i^2 - 2x_ix_j+ x_j^2)
= \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i-x_j)^2 $$
Finally the new promised formula is:
$$ Var(x) = \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i-x_j)^2 $$
or simplified a little more:
$$ Var(x) = \frac{1}{n^2}\sum_{i=1}^{n}\sum_{j=i}^{n}(x_i-x_j)^2 $$
This formula is clearly not feasible from computational point of view. To calculate we need $O(n^2$ running time. What I really found interesting at this formula is it's form. In plain English it can be translated like "Variance is the average of squared difference between all pairs of values".
That formula also gives a nice geometrical or spatial view of variance. Another idea which can be derived from this formula is that the sample variance is much closer to population variance if we have more values in sample. Explained by this formula it becomes more clear this idea. I imagine the that the trust one can put in sample variance to predict population variance as the density of links between sample values (where those links are the squared differences between its values).
I have a friend which knows much better than me statistics and he found interesting this formula, at least because at first he took a look at the final formula and stated that is not correct. Well, it seems he was wrong this time.

January 10, 2013

Approximate with a constant value

When we do a regression with a decision tree like CART, as part of the learning bias is the fact that we choose to approximate values from a region with a single constant value. Basically what decision trees does in case of regression is to recursively split the p-dimensional space into regions. Than for each region it selects the corresponding instances and using the class variable values from subset, it calculates a constant value used later for prediction.

How CART and its friends works, regarding the splitting process is not the subject here. What I want to talk about is how to find a proper constant value for a region. To reformulate in simpler terms, I am interested in how to calculate a constant value usable for prediction, given a set of $y=(y_1,y_2,..y_n)$ variables. In order to obtain that we use a loss function. A loss function is a function which establish a certain criteria for choosing the proper constant values.

I was interested in to very often used loss functions: square loss and absolute loss.

Square loss

Square loss function is defined as $$ f(y,c)=(y-c)^2=\sum_{i=1}^{n}(y_i-c)^2 $$ and what we are searching for is the constant $ c $ for which the value of loss function is minimal.: $$ c = \underset{c}{\arg\min} \sum_{i=1}^n{(y_i-c)^2}$$ Here we find an simple way to calculate using the fact that the function is positive. Based on that we know that the function has a minimum and that minimum is given by derivative equals 0. $$ \frac{\partial}{\partial{c}}[\sum_{i=1}^{n}{(y_i-c)^2}]=0 $$ We calculate $$ \frac{\partial}{\partial{c}}[\sum_{i=1}^{n}{(y_i-c)^2}] = \sum_{i=1}^{n}{\frac{\partial}{\partial{c}}{[(y_i-c)^2]}} = \sum_{i=1}^{n}{2(y_i-c)\frac{\partial}{\partial{c}}(y_i-c)} = \sum_{i=1}^{n}{-2(y_i-c)} = 0 $$ $$ \sum_{i=1}^{n}{y_i} = \sum_{i=1}^{n}{c} = nc $$ $$ c = \frac{\sum_{i=1}^{n}{y_i}}{n} = \overline{y} $$

Absolute loss

Absolute loss function is defined as $$ f(y,c)=|y-c|=\sum_{i=1}^{n}|y_i-c| $$. Again, what we are searching for is the constant $ c $ for which the value of loss function is minimal.: $$ c = \underset{c}{\arg\min} \sum_{i=1}^n{|y_i-c|}$$ We know that function is positive so we know also that is has a minimum somewhere. This time we can't use derivation. I read somewhere that c is the median, and that is statistically correct. I do not know what is the meaning of statistically correct, and that sounds fuzzy to my. After few tries I saw a way to proof that. I hope some day I will learn about a shorter proof.

We consider without loose of generality that $y$ variables are sorted ($y_1 \leq y_2 \leq ... \leq y_n$). We are interested first to see what happens when $c < y_1$. We calculate $$ \underset{c<{y_1}}{f(y,c)}=\sum_{i=1}^{n}{|y_i-c|}=\sum_{i=1}^{n}{|y_i+c+y_1-y_1|}=\sum_{i=1}^{n}{|y_1+c+y_i-y_1|}=\sum_{i=1}^{n}{|y_1+c|}+\sum_{i=1}^{n}{|y_i-y_1|} \geq \sum_{i=1}^{n}{|y_i-y_1|}=\underset{c={y_1}}{f(y,c)}$$ So $$ \underset{c<{y_1}}{f(y,c)} \geq \underset{c={y_1}}{f(y,c)}$$ In a similar manner we show also that $$ \underset{c>{y_n}}{f(y,c)} \geq \underset{c={y_n}}{f(y,c)}$$ What we found is that to minimize loss absolute loss, $c$ must lie between $y_1$ and $y_n$ inclusive. Next we note that, when $c$ is in this interval, the distance to both ends is constant $$ |y_1-c|+|y_n-c|=|y_n-y_1| $$ We simply do not consider anymore those two points ($y_1,y_n$), we reduce our interest on interval $y_2,y_{n-1}$ and apply recursively the same process.

Finally we will find either a point in the middle, or an interval with no other $y$ values included. If we found a point than this is our solution. If we found an interval, any point from that interval is a possible solution. We note that the definition of the median fits those conditions and is a solution which minimize absolute loss function.

January 8, 2013

Complete rewrite myself

In the past years I slowly started to rewrite myself. My interests moved to fundamental computer science data structures and algorithms. More recently I discovered machine learning, statistics and data mining. I am completely conquered by those problems and I feel that this is the way to go for the rest of my life. So, no more about programming, technologies or frameworks.

As a starting point I would like to share a gorgeous and insightful poem, written by an eminent scientist Maurice Kendall, called "Hiawatha Designs an Experiment".

Original source:
Kendall, Maurice (1959). Hiawatha Designs an Experiment. The American Statistician 13: 23-24.

July 28, 2010

PerforceNB 1.0.5.1 build for 6.9 - Call for help!

I am aware that in this new modern world, one of the main problems is time. That happens to me, also. I don't write here to complain about that. I write this call only for two things.
First is to thank everybody who have enough patience to try and use this plugin. I write this plugin only to give something back to NetBeans and to NetBeans great community. That thing is enough.
The second thing is that I ask for some help from you, the guys who spent time and nerves using this plugin. If you want to contribute to this plugin, you are welcomed. Just ask for proper rights. If you can't afford time for that, please fill bugs and suggestions. As many as possible. I can't cover everything and I am sure you have many brilliant and useful ideas to make your and mine life easier using this plugin.
Finally, thanks to all of you. NetBeans rocks!

http://kenai.com/projects/perforcenb/pages/Home

March 8, 2010

Five things I like most at NetBeans Platform 6.8

NetBeans 6.8 is there since December 2009. It did not look like a very important upgrade at first. At least not very important for Platform developers. But using the new version of platform, some things which may appear small at first, has been visibly improved. The most important improvements are listed here (according to my personal feelings). For a complete overview of changes in platform API you can take a look here.

Wrapped libraries
That was one of the major headaches source. Developing modules for NetBeans (as for other platforms) requires sometimes links to other libraries. Third-party libraries, other than the modules provided by the platform, needs to be linked and wrapped in your custom pluggins or platform-based applications.

Wrapping libraries was done in the past in two ways. By creating a module library wrapper plugin for every library which need to be imported. But when you have to link a bigger library, which have a lot of jars, that was a pain. Because a module library wrapper could be created only for one jar. You can end up with easy to have twenty-thirty libraries only to link to something like apache logging libraries or Seam or similar.

The second approach was to create a big jar with all jars from library. For this operation, NetBeans is helpfull, but you have other two problem to solve. First is that the new jar will have it's own manifest. That means that if there is information in manifest files of every jar, that information is lost or not complete. If that information is important, you have a nice problem to solve. The second is that you can loose control over which libraries you have, which versions. Become very difficult in time to maintain the upgrades.

Solution is the new wrapped libraries feature. Now each plugin module in NetBeans can have it's own dependencies on external libraries. For configuration, you have a new tab panel to add as much as needed libraries. The cool thing is that you can specify also the source code and javadocs for every wrapped jar. That was not possible in the old versions.

ActionListeners are everywhere

That could not look as important as it is. In the old versions you could use ActionListeners only for always enabled type of actions. Not you can use the same interface also for context aware and callback action types. That is really cool, even it does not sounds like that at first. A main aspect here is that it is used a "well known" interface.

The first consequence is that if you want to migrate a swing-based app on NetBeans platform, is easier because you just have to pipe your actions. The second consequence is that you don't have to use specific platform interfaces like cookies. I don't say that cookies were bad, but they are old and there is better and much usable alternative.

Declarative Asynchronous actions

That's small, but makes code cleaner and flexible. Also throws away the need to manage yourself the asynchronous behavior of your actions. Nice done.

Enhanced IO API

We already had colors and links. And it was very useful. But now is even more flexible.

You can color the output as you like because you can output with IOColorPrint.print. So you can made your output lines from peaces colored independently. You can also put more hypelinks on the same line. You can, again, add an importance marker, which improves readability on verbose outputs. Finally, you can specify the parent of an IOTab and set its icon and tooltip message. All of them are small things, but used together can be very helpful on creating really effective and friendly output for your application. I appreciate very much an application which let you know what is going on.

Annotations

That's not a specific point. There are some very useful annotations added. The point is, thought, not the specific annotations added. But the trend to use as much as possible annotations. I hope that this trend will continue. I already used @OptionsPanelController.SubRegistration and @ConvertAsJavaBean. But there are more, take a look.