Perhaps for most of you the description of the algorithm is well-known. However, I will post it here since it helps to understand my idea.
The algorithm is straightforward: use multiple iterations where, on each iteration train a weak-learner (errors less than \((k-1)/k\) ) on the current weights, after which the weights corresponding to mislabeled cases are boosted. In the end all the weak classifiers are used to produce the prediction.
AdaBoost.SAMME
1. Initialize the observation weights \( w_i = \frac{1}{n} \)2. For m = 1 to M
a. Fit classifier \(h_m(x)\) using weights \(w_i\)
b. Compute error \(err_m = \sum_{i=1}^{n}w_i I(c_i \neq h_m(x_i))/\sum_{i=1}^{n}w_i \)
c. Let \( \alpha_m = log\frac{1-err_m}{err_m} + log(K-1) \)
d. Set \( w_i \gets w_i exp(\alpha_m I(c_i \neq h_m(x_i)) ) \)
e. Re-normalize \( w_i \)
3. Output\( C(x) = \arg \max_k \sum_{m=1}^{M} \alpha_m I(h_m(x) = k) \)
When I tried to work out my implementation I found that there is a step which could be simplified.
At step 2.d the weights corresponding to the mis-classified instances are increased. Accordingly, the weights corresponding to the correctly classified instances remains the same. Because of that, the total amount of weight is increased, and is for sure greater than 1. Now, the algorithm works when the total amount of weight is 1, so we have to normalize those quantities (to have a total amount of 1). This is done in step 2.e, which basically contains, from the point of view of the implementation, one iteration to find the total amount and one iteration for the division of each weight.
So, to summarize steps 2.d and 2.e we have weights summed up to 1, some of them are increased by a factor, and in the end all the weights are normalized to sum up again to 1. While studying this, I wondered if is possible to update all the weights \(w_i\) in a single iteration. And there is a clean way to do that.
Our initial total weight mass is 1, we denote that by :
$$T_0 = \sum_{i=1}^{n}w_i = 1$$
We update all weights corresponding to misclassified instances with \(\alpha_m\), the other weights remain the same. We denote the resulting total weight mass by:
$$T_1 = \sum_{i=1}^{n}w_i*\alpha_m*(K-1)*I(h_m(x_i) \neq c_i) + \sum_{i=1}^{n}w_i*I(h_m(x_i)=c_i)$$
In order to have a total weight mass equal to 1, we have to scale the whole equation of \(T_1\) with it's own value. But which is the value of \(T_1\)?
We note that \(\sum_{i=1}^{n}w_i*I(h_m(x_i) \neq c_i \) equals by definition with \(err_m\) (see step 2.b of the algorithm). Accordingly \(\sum_{i=1}^{n}w_i*I(h_m(x_i)=c_i) = 1 - err_m\). Now we have:
$$ T_1 = err*\alpha_m*(K-1) + 1 - err_m = err_m * \frac{(1 -err_m)*(K-1)}{err_m} + 1 - err_m = K(1-err_m) $$
Using this scaling factor we replace steps 2.d and 2.e with a new one:
\( w_i \gets w_i /(K err_m) \) if \( h_m(x_i) \neq c_i \)
\( w_i \gets w_i / K(1-err_m) \) if \( h_m(x_i) = c_i \)
If we note also that \( \sum_{i=1}^{n}w_i = 1 \), we simplify 2.b, also. And we have finally a new description of the SAMME algorithm:
AdaBoost.SAMME.2
1. Initialize the observation weights \( w_i = \frac{1}{n} \)2. For m = 1 to M
a. Fit classifier \(h_m(x)\) using weights \(w_i\)
b. Compute error \(err_m = \sum_{i=1}^{n}w_i I(c_i \neq h_m(x_i)) \)
c. Let \( \alpha_m = log\frac{1-err_m}{err_m} + log(K-1) \)
d. Set \( w_i \gets w_i /(K err_m) \) if \( h_m(x_i) \neq c_i\). Set \( w_i \gets w_i / K(1-err_m) \) if \( h_m(x_i) = c_i \)
3. Output\( C(x) = \arg \max_k \sum_{m=1}^{M} \alpha_m I(h_m(x) = k) \)
Notes
Worth noting that the original formulation of the algorithm express much clearer the mechanism of AdaBoost. The improvement is not great. It is M*n*2, in execution time, which might be dominated by the execution time of the weak learner if it is greater than linear time (it is not always the case, often decision stumps or onerule are used with AdaBoost).Another thing worth noting is that I found this simplification in no other implementation I studied (for sure this is not how it's implemented in scikit learn ( see adaboost source here, and also is not implemented in weka). A potential reason would be that the simplified variant might lead to unstable weights with more iterations due to numerical computation. One might argue that periodical normalization will keep the inherent errors under a certain limit. My belief is that periodical normalization will keep eventually the total amount of weight near 1, but this does not offer by itself any guarantee on the computed values of the individual weights. However, some tests could be realized. I will leave that for the next time.