December 30, 2013

AdaBoost.M1 and SAMME formula simplification

During the implementation of AdaBoost algorithm for rapaio library I studied the relevant papers and some well known implementations of this algorithm. I am talking here about the original proposed algorithm called M1, and its multi-class generalization called SAMME. Both of them are described in the paper which propose SAMME algorithm: Hastie, T., Rosset, S., Zhu, J., & Zou, H. (2009). Multi-class AdaBoost. Statistics and Its Interface, 2(3), 349–360. doi:10.4310/SII.2009.v2.n3.a8.

Perhaps for most of you the description of the algorithm is well-known. However, I will post it here since it helps to understand my idea.

The algorithm is straightforward: use multiple iterations where, on each iteration train a weak-learner (errors less than \((k-1)/k\) ) on the current weights, after which the weights corresponding to mislabeled cases are boosted. In the end all the weak classifiers are used to produce the prediction.

AdaBoost.SAMME

1. Initialize the observation weights \( w_i = \frac{1}{n} \)
2. For m = 1 to M
    a. Fit classifier \(h_m(x)\) using weights \(w_i\)
    b. Compute error \(err_m = \sum_{i=1}^{n}w_i I(c_i \neq h_m(x_i))/\sum_{i=1}^{n}w_i \)
    c. Let \( \alpha_m = log\frac{1-err_m}{err_m} + log(K-1) \)
    d. Set \( w_i \gets w_i exp(\alpha_m I(c_i \neq h_m(x_i)) ) \)
    e. Re-normalize \( w_i \)
3. Output\( C(x) = \arg \max_k \sum_{m=1}^{M} \alpha_m I(h_m(x) = k) \)

When I tried to work out my implementation I found that there is a step which could be simplified.

At step 2.d the weights corresponding to the mis-classified instances are increased. Accordingly, the weights corresponding to the correctly classified instances remains the same. Because of that, the total amount of weight is increased, and is for sure greater than 1. Now, the algorithm works when the total amount of weight is 1, so we have to normalize those quantities (to have a total amount of 1). This is done in step 2.e, which basically contains, from the point of view of the implementation, one iteration to find the total amount and one iteration for the division of each weight.

So, to summarize steps 2.d and 2.e we have weights summed up to 1, some of them are increased by a factor, and in the end all the weights are normalized to sum up again to 1. While studying this, I wondered if is possible to update all the weights \(w_i\) in a single iteration. And there is a clean way to do that.

Our initial total weight mass is 1, we denote that by :
$$T_0 = \sum_{i=1}^{n}w_i = 1$$
We update all weights corresponding to misclassified instances with \(\alpha_m\), the other weights remain the same. We denote the resulting total weight mass by:
$$T_1 = \sum_{i=1}^{n}w_i*\alpha_m*(K-1)*I(h_m(x_i) \neq c_i) + \sum_{i=1}^{n}w_i*I(h_m(x_i)=c_i)$$
In order to have a total weight mass equal to 1, we have to scale the whole equation of \(T_1\) with it's own value. But which is the value of \(T_1\)?

We note that \(\sum_{i=1}^{n}w_i*I(h_m(x_i) \neq c_i \) equals by definition with \(err_m\) (see step 2.b of the algorithm). Accordingly \(\sum_{i=1}^{n}w_i*I(h_m(x_i)=c_i) = 1 - err_m\). Now we have:

$$ T_1 = err*\alpha_m*(K-1) + 1 - err_m = err_m * \frac{(1 -err_m)*(K-1)}{err_m} + 1 - err_m = K(1-err_m) $$
Using this scaling factor we replace steps 2.d and 2.e with a new one:

\( w_i \gets w_i /(K err_m) \) if \( h_m(x_i) \neq c_i \)

\( w_i \gets w_i / K(1-err_m) \) if \( h_m(x_i) = c_i \)


If we note also that \( \sum_{i=1}^{n}w_i = 1 \), we simplify 2.b, also. And we have finally a new description of the SAMME algorithm:

AdaBoost.SAMME.2

1. Initialize the observation weights \( w_i = \frac{1}{n} \)
2. For m = 1 to M
    a. Fit classifier \(h_m(x)\) using weights \(w_i\)
    b. Compute error \(err_m = \sum_{i=1}^{n}w_i I(c_i \neq h_m(x_i)) \)
    c. Let \( \alpha_m = log\frac{1-err_m}{err_m} + log(K-1) \)
    d. Set \( w_i \gets w_i /(K err_m) \) if \( h_m(x_i) \neq c_i\). Set \( w_i \gets w_i / K(1-err_m) \) if \( h_m(x_i) = c_i \)
3. Output\( C(x) = \arg \max_k \sum_{m=1}^{M} \alpha_m I(h_m(x) = k) \)


Notes

Worth noting that the original formulation of the algorithm express much clearer the mechanism of AdaBoost. The improvement is not great. It is M*n*2, in execution time, which might be dominated by the execution time of the weak learner if it is greater than linear time (it is not always the case, often decision stumps or onerule are used with AdaBoost).

Another thing worth noting is that I found this simplification in no other implementation I studied (for sure this is not how it's implemented in scikit learn ( see adaboost source here, and also is not implemented in weka). A potential reason would be that the simplified variant might lead to unstable weights with more iterations due to numerical computation. One might argue that periodical normalization will keep the inherent errors under a certain limit. My belief is that periodical normalization will keep eventually the total amount of weight near 1, but this does not offer by itself any guarantee on the computed values of the individual weights. However, some tests could be realized. I will leave that for the next time.






December 27, 2013

Andrica's Conjecture

I've always loved to stare at prime numbers. Probably that happens because I really do not understand their properties. However, taking into consideration the limited capacity of my own skull, I like at least to try to understand what others, much more capable than me, have to say about prime numbers.

And taking about others, I tried these days to find what other Romanian fellows have to say about this topic. And one interesting thing I found is called Andrica's Conjecture.

Andrica's Conjecture

Prof. Univ. Dr. Dorin Andrica works at the Babeş-Bolyai University, Cluj, Romania where he is a full professor within the Faculty of Mathematics and Computer Science.

This is a list of books from amazon.com which contains his name: list.

Among them, there is a book co-authored by the regretted Ioan Cucurezeanu, which was also my own Arithmetics professor. My only professor who put me to take the same exam nine times in a row (for the whole sumer holiday).

Andrica's Conjecture definition can be found here: http://planetmath.org/AndricasConjecture.

It simply states that: Given the n-th prime \(p_{n}\), it is always the case that \(1 > \sqrt{p_{n+1}}-\sqrt{p_{n}}\).

The conjecture is unproved until now. The description states that the conjecture has been proven by computers up to n = 10^5. This looks odd since the vector of primes which I have is build with a trivial version of a sieve and has 50847534 prime numbers.

As always I try to use my rapaio tools, so plotting a line with the Andrica's gap function is easy like

    private void drawAndrica(Vector primes, int size) {
        Frame m = Frames.newMatrixFrame(size, new String[]{"index", "gap"});
        for (int i = 0; i < m.getRowCount(); i++) {
            m.setValue(i, "index", i + 1);
            m.setValue(i, "gap", sqrt(primes.getValue(i + 1)) - sqrt(primes.getValue(i)));
        }
        draw(new Plot()
                .add(new Lines(m.getCol("index"), m.getCol("gap")))
                .add(new ABLine(0, true))
                .setYRange(0, 0.7),
                700, 300);
    }

First we plot the function for the first 500 values.

graphics

What is already been discovered is that the largest known difference in this function happens for a small n, aka for n=4.

The plotted function looks like a decreasing function (not monotone, but decreasing on average). The question is, if we plot many values the tendency remains the same?

graphics

This are all the prime numbers I have at hand. There are more than 50 millions. And the graph shows the tendency is to decrease on average. Which is really interesting. This is not a proof, of course, but one could find reasonable to expect the conjecture to be true. Nice work Dr. Andrica!