Broken thoughts: Rapaio Random Forest - a benchmark

While searching for various ideas and implementations of random forests I noticed the project called fast-random-forest. This is basically a faster and optimized version of Weka implementation. Take a look here: http://code.google.com/p/fast-random-forest/.

I don't discuss the implementation. Some things seems to be faster than what Weka provides, other seems to provide no significant performance boost. Anyway, it deserves kudos for his effort and the fact that he open sourced the code.

However what I found more interesting for my purpose is a small benchmark which compares some results of Weka Random Forest and fast-random-forest. I tried to check my implementation against those two.

Here are some results:

test               Weka       FastRF      RapaioRF
anneal        0.9960000000 0.9930000000 0.8630289532
balance-scale 0.8160000000 0.8130000000 0.8208000000
breast-w      0.9670000000 0.9680000000 0.9642346209
credit-a      0.8620000000 0.8650000000 0.8724637681
credit-g      0.7580000000 0.7510000000 0.7520000000
diabetes      0.7630000000 0.7620000000 0.7565104167
hypothyroid 0.9940000000 0.9950000000 0.9904559915
kr-vs-kp      0.9920000000 0.9870000000 0.9881101377
mushroom      1.0000000000 1.0000000000 1.0000000000
segment       0.9810000000 0.9800000000 0.9766233766
sick          0.9830000000 0.9840000000 0.9803817603
soybean       0.9320000000 0.9250000000 0.9385065886
vehicle       0.7470000000 0.7520000000 0.7517730496
vowel         0.9810000000 0.9680000000 0.9585858586
letter        0.9620000000 0.9620000000 0.9659500000
splice        0.9300000000 0.9410000000 0.9639498433
waveform-5000 0.8480000000 0.8480000000 0.8494000000

Note: the comparison does not show a best algorithm. The test is not intended to induce the idea of searching for the best implementation. Such a tentative is a dead end for far too many reasons, and I don't spend time to enumerate them here.

However what I wanted to know is if my implementation is comparable. And it is.

Later update: after I fix a bug I ran again same test.

         test         Weka       FastRF     RapaioRF
       anneal 0.9960000000 0.9930000000 0.9109131403
balance-scale 0.8160000000 0.8130000000 0.8208000000
     breast-w 0.9670000000 0.9680000000 0.9642346209
     credit-a 0.8620000000 0.8650000000 0.8753623188
     credit-g 0.7580000000 0.7510000000 0.7630000000
     diabetes 0.7630000000 0.7620000000 0.7643229167
hypothyroid 0.9940000000 0.9950000000 0.9931071050
     kr-vs-kp 0.9920000000 0.9870000000 0.9871714643
     mushroom 1.0000000000 1.0000000000 1.0000000000
      segment 0.9810000000 0.9800000000 0.9753246753
         sick 0.9830000000 0.9840000000 0.9827677625
      soybean 0.9320000000 0.9250000000 0.9355783309
      vehicle 0.7470000000 0.7520000000 0.7565011820
        vowel 0.9810000000 0.9680000000 0.9626262626
       letter 0.9620000000 0.9620000000 0.9658500000
       splice 0.9300000000 0.9410000000 0.9620689655
waveform-5000 0.8480000000 0.8480000000 0.8522000000

This time the results of my implementation seems better. Again, I am not finding the best implementation. There are, however, some differences which I think is better to highlight here:

Rapaio implementation follows the instructions proposed by Breiman regarding the impurity function to be used. I use Gini impurity function and the splitting criteria is the gain Gini impurity given by split. Both Weka and fast-random-forest uses InfoGain as impurity function.
Gini variable importance, one of the VI proposed by Breiman, is computed based on Gini index (again, both Weka and fast-random-forest uses InfoGain for that purpose)
Rapaio and fast-random-forest uses binary splits for growing trees. Weka does not.
Regarding missing values there are even more differences:

Impurity value computation: Weka incorporates missing values in the computed distributions, proportionally with the classes frequencies. Fast-random-forest does the same, however, for nominal attributes uses only binary splits. Rapaio ignores the missing values for this computation.
Split data: Weka and Rapaio uses all values with missing data on selected attribute, this instances are all distributed to all children nodes (only 2 for Rapaio, possible more than 2 for Weka) with weights adjusted by the proportions given by class distribution. Fast-random-forest does not, it distributes randomly instances in children nodes, proportionally with the class distribution.

Rapaio implementation is probably the slowest among all of them. That is explained partially because "early optimization is the root of all evil" and partly because it uses the most expensive choices.

Broken thoughts

October 16, 2013

Rapaio Random Forest - a benchmark - updated

No comments:

Post a Comment