I don't discuss the implementation. Some things seems to be faster than what Weka provides, other seems to provide no significant performance boost. Anyway, it deserves kudos for his effort and the fact that he open sourced the code.
However what I found more interesting for my purpose is a small benchmark which compares some results of Weka Random Forest and fast-random-forest. I tried to check my implementation against those two.
Here are some results:
test Weka FastRF RapaioRF
anneal 0.9960000000 0.9930000000 0.8630289532
balance-scale 0.8160000000 0.8130000000 0.8208000000
breast-w 0.9670000000 0.9680000000 0.9642346209
credit-a 0.8620000000 0.8650000000 0.8724637681
credit-g 0.7580000000 0.7510000000 0.7520000000
diabetes 0.7630000000 0.7620000000 0.7565104167
hypothyroid 0.9940000000 0.9950000000 0.9904559915
kr-vs-kp 0.9920000000 0.9870000000 0.9881101377
mushroom 1.0000000000 1.0000000000 1.0000000000
segment 0.9810000000 0.9800000000 0.9766233766
sick 0.9830000000 0.9840000000 0.9803817603
soybean 0.9320000000 0.9250000000 0.9385065886
vehicle 0.7470000000 0.7520000000 0.7517730496
vowel 0.9810000000 0.9680000000 0.9585858586
letter 0.9620000000 0.9620000000 0.9659500000
splice 0.9300000000 0.9410000000 0.9639498433
waveform-5000 0.8480000000 0.8480000000 0.8494000000
Note: the comparison does not show a best algorithm. The test is not intended to induce the idea of searching for the best implementation. Such a tentative is a dead end for far too many reasons, and I don't spend time to enumerate them here.
However what I wanted to know is if my implementation is comparable. And it is.
Later update: after I fix a bug I ran again same test.
test Weka FastRF RapaioRF
anneal 0.9960000000 0.9930000000 0.9109131403
balance-scale 0.8160000000 0.8130000000 0.8208000000
breast-w 0.9670000000 0.9680000000 0.9642346209
credit-a 0.8620000000 0.8650000000 0.8753623188
credit-g 0.7580000000 0.7510000000 0.7630000000
diabetes 0.7630000000 0.7620000000 0.7643229167
hypothyroid 0.9940000000 0.9950000000 0.9931071050
kr-vs-kp 0.9920000000 0.9870000000 0.9871714643
mushroom 1.0000000000 1.0000000000 1.0000000000
segment 0.9810000000 0.9800000000 0.9753246753
sick 0.9830000000 0.9840000000 0.9827677625
soybean 0.9320000000 0.9250000000 0.9355783309
vehicle 0.7470000000 0.7520000000 0.7565011820
vowel 0.9810000000 0.9680000000 0.9626262626
letter 0.9620000000 0.9620000000 0.9658500000
splice 0.9300000000 0.9410000000 0.9620689655
waveform-5000 0.8480000000 0.8480000000 0.8522000000
This time the results of my implementation seems better. Again, I am not finding the best implementation. There are, however, some differences which I think is better to highlight here:
- Rapaio implementation follows the instructions proposed by Breiman regarding the impurity function to be used. I use Gini impurity function and the splitting criteria is the gain Gini impurity given by split. Both Weka and fast-random-forest uses InfoGain as impurity function.
- Gini variable importance, one of the VI proposed by Breiman, is computed based on Gini index (again, both Weka and fast-random-forest uses InfoGain for that purpose)
- Rapaio and fast-random-forest uses binary splits for growing trees. Weka does not.
- Regarding missing values there are even more differences:
- Impurity value computation: Weka incorporates missing values in the computed distributions, proportionally with the classes frequencies. Fast-random-forest does the same, however, for nominal attributes uses only binary splits. Rapaio ignores the missing values for this computation.
- Split data: Weka and Rapaio uses all values with missing data on selected attribute, this instances are all distributed to all children nodes (only 2 for Rapaio, possible more than 2 for Weka) with weights adjusted by the proportions given by class distribution. Fast-random-forest does not, it distributes randomly instances in children nodes, proportionally with the class distribution.
No comments:
Post a Comment