December 30, 2013

AdaBoost.M1 and SAMME formula simplification

During the implementation of AdaBoost algorithm for rapaio library I studied the relevant papers and some well known implementations of this algorithm. I am talking here about the original proposed algorithm called M1, and its multi-class generalization called SAMME. Both of them are described in the paper which propose SAMME algorithm: Hastie, T., Rosset, S., Zhu, J., & Zou, H. (2009). Multi-class AdaBoost. Statistics and Its Interface, 2(3), 349–360. doi:10.4310/SII.2009.v2.n3.a8.

Perhaps for most of you the description of the algorithm is well-known. However, I will post it here since it helps to understand my idea.

The algorithm is straightforward: use multiple iterations where, on each iteration train a weak-learner (errors less than \((k-1)/k\) ) on the current weights, after which the weights corresponding to mislabeled cases are boosted. In the end all the weak classifiers are used to produce the prediction.

AdaBoost.SAMME

1. Initialize the observation weights \( w_i = \frac{1}{n} \)
2. For m = 1 to M
    a. Fit classifier \(h_m(x)\) using weights \(w_i\)
    b. Compute error \(err_m = \sum_{i=1}^{n}w_i I(c_i \neq h_m(x_i))/\sum_{i=1}^{n}w_i \)
    c. Let \( \alpha_m = log\frac{1-err_m}{err_m} + log(K-1) \)
    d. Set \( w_i \gets w_i exp(\alpha_m I(c_i \neq h_m(x_i)) ) \)
    e. Re-normalize \( w_i \)
3. Output\( C(x) = \arg \max_k \sum_{m=1}^{M} \alpha_m I(h_m(x) = k) \)

When I tried to work out my implementation I found that there is a step which could be simplified.

At step 2.d the weights corresponding to the mis-classified instances are increased. Accordingly, the weights corresponding to the correctly classified instances remains the same. Because of that, the total amount of weight is increased, and is for sure greater than 1. Now, the algorithm works when the total amount of weight is 1, so we have to normalize those quantities (to have a total amount of 1). This is done in step 2.e, which basically contains, from the point of view of the implementation, one iteration to find the total amount and one iteration for the division of each weight.

So, to summarize steps 2.d and 2.e we have weights summed up to 1, some of them are increased by a factor, and in the end all the weights are normalized to sum up again to 1. While studying this, I wondered if is possible to update all the weights \(w_i\) in a single iteration. And there is a clean way to do that.

Our initial total weight mass is 1, we denote that by :
$$T_0 = \sum_{i=1}^{n}w_i = 1$$
We update all weights corresponding to misclassified instances with \(\alpha_m\), the other weights remain the same. We denote the resulting total weight mass by:
$$T_1 = \sum_{i=1}^{n}w_i*\alpha_m*(K-1)*I(h_m(x_i) \neq c_i) + \sum_{i=1}^{n}w_i*I(h_m(x_i)=c_i)$$
In order to have a total weight mass equal to 1, we have to scale the whole equation of \(T_1\) with it's own value. But which is the value of \(T_1\)?

We note that \(\sum_{i=1}^{n}w_i*I(h_m(x_i) \neq c_i \) equals by definition with \(err_m\) (see step 2.b of the algorithm). Accordingly \(\sum_{i=1}^{n}w_i*I(h_m(x_i)=c_i) = 1 - err_m\). Now we have:

$$ T_1 = err*\alpha_m*(K-1) + 1 - err_m = err_m * \frac{(1 -err_m)*(K-1)}{err_m} + 1 - err_m = K(1-err_m) $$
Using this scaling factor we replace steps 2.d and 2.e with a new one:

\( w_i \gets w_i /(K err_m) \) if \( h_m(x_i) \neq c_i \)

\( w_i \gets w_i / K(1-err_m) \) if \( h_m(x_i) = c_i \)


If we note also that \( \sum_{i=1}^{n}w_i = 1 \), we simplify 2.b, also. And we have finally a new description of the SAMME algorithm:

AdaBoost.SAMME.2

1. Initialize the observation weights \( w_i = \frac{1}{n} \)
2. For m = 1 to M
    a. Fit classifier \(h_m(x)\) using weights \(w_i\)
    b. Compute error \(err_m = \sum_{i=1}^{n}w_i I(c_i \neq h_m(x_i)) \)
    c. Let \( \alpha_m = log\frac{1-err_m}{err_m} + log(K-1) \)
    d. Set \( w_i \gets w_i /(K err_m) \) if \( h_m(x_i) \neq c_i\). Set \( w_i \gets w_i / K(1-err_m) \) if \( h_m(x_i) = c_i \)
3. Output\( C(x) = \arg \max_k \sum_{m=1}^{M} \alpha_m I(h_m(x) = k) \)


Notes

Worth noting that the original formulation of the algorithm express much clearer the mechanism of AdaBoost. The improvement is not great. It is M*n*2, in execution time, which might be dominated by the execution time of the weak learner if it is greater than linear time (it is not always the case, often decision stumps or onerule are used with AdaBoost).

Another thing worth noting is that I found this simplification in no other implementation I studied (for sure this is not how it's implemented in scikit learn ( see adaboost source here, and also is not implemented in weka). A potential reason would be that the simplified variant might lead to unstable weights with more iterations due to numerical computation. One might argue that periodical normalization will keep the inherent errors under a certain limit. My belief is that periodical normalization will keep eventually the total amount of weight near 1, but this does not offer by itself any guarantee on the computed values of the individual weights. However, some tests could be realized. I will leave that for the next time.






4 comments:

  1. Hi Aurelian,
    Could I possibly get in touch with you for a GentleBoost implementation? I'm the same person who posted this question, and I'm very grateful for your consistent help here: http://stats.stackexchange.com/questions/93691/what-does-it-mean-to-fit-a-regression-function-and-then-use-it-to-update-other/93693?noredirect=1#comment187991_93693

    I just want to understand a couple of more things, but the explanation of my question is getting too lengthy for the comments on that question. Could I please email or chat with you? I'm really rather stuck in this project and from the various forums I've been posting on, you do seem to be the most qualified in terms of statistics, especially since you've worked on a very closely related thing! (AdaBoost SAMME).

    My questions are more conceptual and related to general implementation and understanding, they're not related to python. So I really hope you wouldn't mind helping a bit more! Thank you so much...

    ReplyDelete
  2. Of course I want to work on that. My email address is padreati at yahoo dot com. I am not sure if I will be able to talk online since I do not own entirely my time, however for an off line long discussion it will work.
    So, what I am thinking about is to implement myself gentle boost in order to have something to put on table. However you can write me in the meantime.

    ReplyDelete
  3. Hi Aurelian,
    Excuse me i'm new with the Adaboost concept. I have some questions:
    1) What do you mean by "I" in I(ci≠hm(xi))?
    2) Do we need to multiply the observations by their weights at first before diving into the loop (step2) ?
    Thanks

    ReplyDelete
  4. Hi. Thank you for reading, first.
    1. I comes from identity function. It is 1 if the category of i-th observation is equal with what the classifier hm produces for i-th observation, 0 otherwise. In plain English, it is 1 if the classifier works correctly for that observation, 0 otherwise.
    2. It depends on how the classifier uses the weights, it is an implementation detail of the hm classifier. For trees, during computation of the impurity function (test function) it is true that this is done by multiplication.
    Best regards,
    Aurelian

    ReplyDelete