January 16, 2013

Another formula for variance

Variance is the statistic which describe the spread of values with respect to the mean. We have a random variable \( x = (x_1,x_2,...,x_n) \).The classical formula is self explanatory:
$$ Var(x) = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n} $$
Deriving first formula:
$$ Var(x) = \frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2
= \frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\bar{x}+\bar{x}^2)
= \frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\frac{\sum_{i=1}^{n}x_i}{n}+[\frac{\sum_{i=1}^{n}x_i}{n}]^2) \\
= \frac{\sum_{i=1}^{n}x_i^2}{n} -\frac{2\sum_{i=1}^{n}x_i\sum_{i=1}^{n}x_i}{n^2}+\frac{1}{n^3}\sum_{i=1}^{n}[\sum_{i=1}^{n}x_i]^2
= \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{2}{n^2}[\sum_{i=1}^{n}x_i]^2+\frac{n}{n^3}[\sum_{i=1}^{n}x_i]^2
= \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}[\sum_{i=1}^{n}x_i]^2
$$
Leads to another well-known formula of variance:
$$ Var(x) = \frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}\bigg[\sum_{i=1}^{n}x_i\bigg]^2 $$
There are also other formulas for variance more or less self-explanatory. While playing with those terms one evening I found also a new formula for variance which was not described in other places (at least I do not know and I have the strong excuse of any freshman in this field). So:
$$ Var(x)=\frac{1}{n}\sum_{i=1}^{n}x_i^2-\frac{1}{n^2}\bigg[\sum_{i=1}^{n}x_i\bigg]^2
= \frac{1}{2n^2}\bigg[\sum_{i=1}^{n}x_i^2 - 2\sum_{i=1}^2x_i\sum_{i=1}^2x_i+ \sum_{i=1}^{n}x_i^2 \bigg]
= \frac{1}{2n^2}\bigg[n\sum_{i=1}^{n}x_i^2 - 2\sum_{i=1}^{n}x_i\sum_{j=1}^{n}x_j+ n\sum_{j=1}^{n}x_j^2 \bigg] \\
= \frac{1}{2n^2}\bigg[\sum_{i=1}^{n}\sum_{j=1}^{n}x_i^2 - 2\sum_{i=1}^{n}\sum_{j=1}^{n}x_ix_j+ \sum_{i=1}^{n}\sum_{j=1}^{n}x_j^2 \bigg]
= \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i^2 - 2x_ix_j+ x_j^2)
= \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i-x_j)^2 $$
Finally the new promised formula is:
$$ Var(x) = \frac{1}{2n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}(x_i-x_j)^2 $$
or simplified a little more:
$$ Var(x) = \frac{1}{n^2}\sum_{i=1}^{n}\sum_{j=i}^{n}(x_i-x_j)^2 $$
This formula is clearly not feasible from computational point of view. To calculate we need \(O(n^2\) running time. What I really found interesting at this formula is it's form. In plain English it can be translated like "Variance is the average of squared difference between all pairs of values".
That formula also gives a nice geometrical or spatial view of variance. Another idea which can be derived from this formula is that the sample variance is much closer to population variance if we have more values in sample. Explained by this formula it becomes more clear this idea. I imagine the that the trust one can put in sample variance to predict population variance as the density of links between sample values (where those links are the squared differences between its values).
I have a friend which knows much better than me statistics and he found interesting this formula, at least because at first he took a look at the final formula and stated that is not correct. Well, it seems he was wrong this time.

January 10, 2013

Approximate with a constant value


When we do a regression with a decision tree like CART, as part of the learning bias is the fact that we choose to approximate values from a region with a single constant value. Basically what decision trees does in case of regression is to recursively split the p-dimensional space into regions. Than for each region it selects the corresponding instances and using the class variable values from subset, it calculates a constant value used later for prediction.

How CART and its friends works, regarding the splitting process is not the subject here. What I want to talk about is how to find a proper constant value for a region. To reformulate in simpler terms, I am interested in how to calculate a constant value usable for prediction, given a set of \(y=(y_1,y_2,..y_n)\) variables. In order to obtain that we use a loss function. A loss function is a function which establish a certain criteria for choosing the proper constant values.

I was interested in to very often used loss functions: square loss and absolute loss.

Square loss


Square loss function is defined as $$ f(y,c)=(y-c)^2=\sum_{i=1}^{n}(y_i-c)^2 $$ and what we are searching for is the constant \( c \) for which the value of loss function is minimal.: $$ c = \underset{c}{\arg\min} \sum_{i=1}^n{(y_i-c)^2}$$ Here we find an simple way to calculate using the fact that the function is positive. Based on that we know that the function has a minimum and that minimum is given by derivative equals 0. $$ \frac{\partial}{\partial{c}}[\sum_{i=1}^{n}{(y_i-c)^2}]=0 $$ We calculate $$ \frac{\partial}{\partial{c}}[\sum_{i=1}^{n}{(y_i-c)^2}] = \sum_{i=1}^{n}{\frac{\partial}{\partial{c}}{[(y_i-c)^2]}} = \sum_{i=1}^{n}{2(y_i-c)\frac{\partial}{\partial{c}}(y_i-c)} = \sum_{i=1}^{n}{-2(y_i-c)} = 0 $$ $$ \sum_{i=1}^{n}{y_i} = \sum_{i=1}^{n}{c} = nc $$ $$ c = \frac{\sum_{i=1}^{n}{y_i}}{n} = \overline{y} $$

Absolute loss


Absolute loss function is defined as $$ f(y,c)=|y-c|=\sum_{i=1}^{n}|y_i-c| $$. Again, what we are searching for is the constant \( c \) for which the value of loss function is minimal.: $$ c = \underset{c}{\arg\min} \sum_{i=1}^n{|y_i-c|}$$ We know that function is positive so we know also that is has a minimum somewhere. This time we can't use derivation. I read somewhere that c is the median, and that is statistically correct. I do not know what is the meaning of statistically correct, and that sounds fuzzy to my. After few tries I saw a way to proof that. I hope some day I will learn about a shorter proof.

We consider without loose of generality that \(y\) variables are sorted (\(y_1 \leq y_2 \leq ... \leq y_n\)). We are interested first to see what happens when \(c < y_1\). We calculate $$ \underset{c<{y_1}}{f(y,c)}=\sum_{i=1}^{n}{|y_i-c|}=\sum_{i=1}^{n}{|y_i+c+y_1-y_1|}=\sum_{i=1}^{n}{|y_1+c+y_i-y_1|}=\sum_{i=1}^{n}{|y_1+c|}+\sum_{i=1}^{n}{|y_i-y_1|} \geq \sum_{i=1}^{n}{|y_i-y_1|}=\underset{c={y_1}}{f(y,c)}$$ So $$ \underset{c<{y_1}}{f(y,c)} \geq \underset{c={y_1}}{f(y,c)}$$ In a similar manner we show also that $$ \underset{c>{y_n}}{f(y,c)} \geq \underset{c={y_n}}{f(y,c)}$$ What we found is that to minimize loss absolute loss, \(c\) must lie between \(y_1\) and \(y_n\) inclusive. Next we note that, when \(c\) is in this interval, the distance to both ends is constant $$ |y_1-c|+|y_n-c|=|y_n-y_1| $$ We simply do not consider anymore those two points (\(y_1,y_n\)), we reduce our interest on interval \(y_2,y_{n-1}\) and apply recursively the same process.

Finally we will find either a point in the middle, or an interval with no other \(y\) values included. If we found a point than this is our solution. If we found an interval, any point from that interval is a possible solution. We note that the definition of the median fits those conditions and is a solution which minimize absolute loss function.

January 8, 2013

Complete rewrite myself

In the past years I slowly started to rewrite myself. My interests moved to fundamental computer science data structures and algorithms. More recently I discovered machine learning, statistics and data mining. I am completely conquered by those problems and I feel that this is the way to go for the rest of my life. So, no more about programming, technologies or frameworks.

As a starting point I would like to share a gorgeous and insightful poem, written by an eminent scientist Maurice Kendall, called "Hiawatha Designs an Experiment".

Original source:
Kendall, Maurice (1959). Hiawatha Designs an Experiment. The American Statistician 13: 23-24.

    Hiawatha, mighty hunter,
    He could shoot ten arrows upward,
    Shoot them with such strength and swiftness
    That the last had left the bow-string
    Ere the first to earth descended.
    This was commonly regarded
    As a feat of skill and cunning.
    Several sarcastic spirits
    Pointed out to him, however,
    That it might be much more useful
    If he sometimes hit the target.
    "Why not shoot a little straighter
    And employ a smaller sample?"
    Hiawatha, who at college
    Majored in applied statistics,
    Consequently felt entitled
    To instruct his fellow man
    In any subject whatsoever,
    Waxed exceedingly indignant,
    Talked about the law of errors,
    Talked about truncated normals,
    Talked of loss of information,
    Talked about his lack of bias,
    Pointed out that (in the long run)
    Independent observations,
    Even though they missed the target,
    Had an average point of impact
    Very near the spot he aimed at,
    With the possible exception
    of a set of measure zero.
    "This," they said, "was rather doubtful;
    Anyway it didn't matter.
    What resulted in the long run:
    Either he must hit the target
    Much more often than at present,
    Or himself would have to pay for
    All the arrows he had wasted."
    Hiawatha, in a temper,
    Quoted parts of R. A. Fisher,
    Quoted Yates and quoted Finney,
    Quoted reams of Oscar Kempthorne,
    Quoted Anderson and Bancroft
    (practically in extenso)
    Trying to impress upon them
    That what actually mattered
    Was to estimate the error.
    Several of them admitted:
    "Such a thing might have its uses;
    Still," they said, "he would do better
    If he shot a little straighter."
    Hiawatha, to convince them,
    Organized a shooting contest.
    Laid out in the proper manner
    Of designs experimental
    Recommended in the textbooks,
    Mainly used for tasting tea
    (but sometimes used in other cases)
    Used factorial arrangements
    And the theory of Galois,
    Got a nicely balanced layout
    And successfully confounded
    Second order interactions.
    All the other tribal marksmen,
    Ignorant benighted creatures
    Of experimental setups,
    Used their time of preparation
    Putting in a lot of practice
    Merely shooting at the target.
    Thus it happened in the contest
    That their scores were most impressive
    With one solitary exception.
    This, I hate to have to say it,
    Was the score of Hiawatha,
    Who as usual shot his arrows,
    Shot them with great strength and swiftness,
    Managing to be unbiased,
    Not however with a salvo
    Managing to hit the target.
    "There!" they said to Hiawatha,
    "That is what we all expected."
    Hiawatha, nothing daunted,
    Called for pen and called for paper.
    But analysis of variance
    Finally produced the figures
    Showing beyond all peradventure,
    Everybody else was biased.
    And the variance components
    Did not differ from each other's,
    Or from Hiawatha's.
    (This last point it might be mentioned,
    Would have been much more convincing
    If he hadn't been compelled to
    Estimate his own components
    >From experimental plots on
    Which the values all were missing.)
    Still they couldn't understand it,
    So they couldn't raise objections.
    (Which is what so often happens
    with analysis of variance.)
    All the same his fellow tribesmen,
    Ignorant benighted heathens,
    Took away his bow and arrows,
    Said that though my Hiawatha
    Was a brilliant statistician,
    He was useless as a bowman.
    As for variance components
    Several of the more outspoken
    Make primeval observations
    Hurtful of the finer feelings
    Even of the statistician.
    In a corner of the forest
    Sits alone my Hiawatha
    Permanently cogitating
    On the normal law of errors.
    Wondering in idle moments
    If perhaps increased precision
    Might perhaps be sometimes better
    Even at the cost of bias,
    If one could thereby now and then
    Register upon a target.

July 28, 2010

PerforceNB 1.0.5.1 build for 6.9 - Call for help!

I am aware that in this new modern world, one of the main problems is time. That happens to me, also. I don't write here to complain about that. I write this call only for two things.
First is to thank everybody who have enough patience to try and use this plugin. I write this plugin only to give something back to NetBeans and to NetBeans great community. That thing is enough.
The second thing is that I ask for some help from you, the guys who spent time and nerves using this plugin. If you want to contribute to this plugin, you are welcomed. Just ask for proper rights. If you can't afford time for that, please fill bugs and suggestions. As many as possible. I can't cover everything and I am sure you have many brilliant and useful ideas to make your and mine life easier using this plugin.
Finally, thanks to all of you. NetBeans rocks!

http://kenai.com/projects/perforcenb/pages/Home

March 8, 2010

Five things I like most at NetBeans Platform 6.8

NetBeans 6.8 is there since December 2009. It did not look like a very important upgrade at first. At least not very important for Platform developers. But using the new version of platform, some things which may appear small at first, has been visibly improved. The most important improvements are listed here (according to my personal feelings). For a complete overview of changes in platform API you can take a look here.

Wrapped libraries
That was one of the major headaches source. Developing modules for NetBeans (as for other platforms) requires sometimes links to other libraries. Third-party libraries, other than the modules provided by the platform, needs to be linked and wrapped in your custom pluggins or platform-based applications.

Wrapping libraries was done in the past in two ways. By creating a module library wrapper plugin for every library which need to be imported. But when you have to link a bigger library, which have a lot of jars, that was a pain. Because a module library wrapper could be created only for one jar. You can end up with easy to have twenty-thirty libraries only to link to something like apache logging libraries or Seam or similar.


The second approach was to create a big jar with all jars from library. For this operation, NetBeans is helpfull, but you have other two problem to solve. First is that the new jar will have it's own manifest. That means that if there is information in manifest files of every jar, that information is lost or not complete. If that information is important, you have a nice problem to solve. The second is that you can loose control over which libraries you have, which versions. Become very difficult in time to maintain the upgrades.

Solution is the new wrapped libraries feature. Now each plugin module in NetBeans can have it's own dependencies on external libraries. For configuration, you have a new tab panel to add as much as needed libraries. The cool thing is that you can specify also the source code and javadocs for every wrapped jar. That was not possible in the old versions.

ActionListeners are everywhere

That could not look as important as it is. In the old versions you could use ActionListeners only for always enabled type of actions. Not you can use the same interface also for context aware and callback action types. That is really cool, even it does not sounds like that at first. A main aspect here is that it is used a "well known" interface.

The first consequence is that if you want to migrate a swing-based app on NetBeans platform, is easier because you just have to pipe your actions. The second consequence is that you don't have to use specific platform interfaces like cookies. I don't say that cookies were bad, but they are old and there is better and much usable alternative.



Declarative Asynchronous actions

That's small, but makes code cleaner and flexible. Also throws away the need to manage yourself the asynchronous behavior of your actions. Nice done.

Enhanced IO API


We already had colors and links. And it was very useful. But now is even more flexible.

You can color the output as you like because you can output with IOColorPrint.print. So you can made your output lines from peaces colored independently. You can also put more hypelinks on the same line. You can, again, add an importance marker, which improves readability on verbose outputs. Finally, you can specify the parent of an IOTab and set its icon and tooltip message. All of them are small things, but used together can be very helpful on creating really effective and friendly output for your application. I appreciate very much an application which let you know what is going on.

Annotations

That's not a specific point. There are some very useful annotations added. The point is, thought, not the specific annotations added. But the trend to use as much as possible annotations. I hope that this trend will continue. I already used @OptionsPanelController.SubRegistration and @ConvertAsJavaBean. But there are more, take a look.