Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.
The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and "Random Forests" is their trademark. More about this topic can be found at http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Implementation
Rapaio toolbox is and will continue to be under construction for a long period. However, there is a Random Forest implementation available and ready to be used. Implemented features:1. Classification only (regression will follow soon).
2. It uses Gini impurity function for finding the best split attributes/values, according with the original specifications. It can be implemented also with InfoGain (as it is implemented in Weka), orsomething else. However there is not a huge difference in results, thus for now remains according with the original specifications.
3. It computes OOB (Out of bag) error, which is an estimation constructed in the same way as a cross validation. For faster execution one can disable oob computation.
4. Does not perform yet any of the two ways of feature importance, it will be implemented soon.
5. There is no computation of proximity and I do not know for sure if I want that in the immediate future.
Why Random Forests?
Random Forests are very popular due to the fact that the intuition behind the algorithm is easy to understand, and, to some extent, easy to implement. However, I found a very popular opinion that Random Forests is something like a panacea for learning. In my humble opinion it is far from that, simply because there is no such thing in machine learning.Random Forests learns well in a variety of situations and is usually useful when it is very hard or complex to understand the mechanics of your data. However finding and exploiting valuable knowledge from data is often more successful than random forests.
I like random forests for some other qualities which I found more valuable, but not so popular:
- ability to capture knowledge about importance of the features
- possibility to be used as an exploratory tool or for unsupervised learning
- the theory behind the algorithm which explains how some variance vanishes, how some "noisy random salt" produces stability; all the inspiring simple or subtle things from the theory behind it
Data setup
I will use a classical data set called spam-base, data set which was imported into Rapaio toolbox from the well-known UCI repository: http://archive.ics.uci.edu/ml/datasets/SpambaseIn order to compute faster I will use only some dimensions of this dataset. I will use for prediction only the first 20 features and the class called "spam"
Frame all = Datasets.loadSpamBase();
all = ColFilters.retainCols(all, "1-20,spam");
Summary.summary(ColFilters.retainCols(all, "1-5"));
Summary.summary(ColFilters.retainCols(all, "spam"));
>>summary("spam-base", [word_freq_all, word_freq_3d, word_freq_our, word_freq_over, word_freq_remove])
rows: 4601, cols: 5
word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu. : 0.000 1st Qu. : 0.000 1st Qu. : 0.000 1st Qu. : 0.000 1st Qu. : 0.000
Median : 0.000 Median : 0.000 Median : 0.000 Median : 0.000 Median : 0.000
Mean : 0.281 Mean : 0.065 Mean : 0.312 Mean : 0.096 Mean : 0.114
2nd Qu. : 0.420 2nd Qu. : 0.000 2nd Qu. : 0.383 2nd Qu. : 0.000 2nd Qu. : 0.000
Max. : 5.100 Max. : 42.810 Max. : 10.000 Max. : 5.880 Max. : 7.270
>>summary("spam-base", [spam])
rows: 4601, cols: 1
spam
0 : 2788
1 : 1813
Above you see some 5-number information on the data. It is not exhaustive since it is not the purpose of this tutorial.We will split the data set in two parts, one will be used for training the random forest and another one will be used for testing its prediction accuracy.
List frames = Sample.randomSample(all, new int[]{all.getRowCount() * 15 / 100});
Frame train = frames.get(0);
Frame test = frames.get(1);
Playing with number of trees grown
Now that we have a train and a test data set we can learn and predict. RF grows a number of trees over bootstrap samples and use voting for classification. How large this number of trees must be? You can check how well you predict as the number of trees grows.Note from the previous plot how both test and oob errors goes down as the number of trained trees grown. However, the improvement stops at some point and become useless to add new trees.
int pos = 0;
Vector index = new IndexVector("number of trees", 1000);
Vector accuracy = new NumericVector("test error", 1000);
Vector oob = new NumericVector("oob error", 1000);
for (int mtree = 1; mtree < 100; mtree += 5) {
RandomForest rf = new RandomForest(mtree, 3, true);
rf.learn(train, "spam");
ClassifierModel model = rf.predict(test);
index.setIndex(pos, mtree);
accuracy.setValue(pos, 1 - computeAccuracy(model, test));
oob.setValue(pos, rf.getOobError());
pos++;
}
Plot p = new Plot();
Lines lines = new Lines(p, index, accuracy);
lines.opt().setColorIndex(new OneIndexVector(2));
p.add(lines);
Points pts = new Points(p, index, accuracy);
pts.opt().setColorIndex(new OneIndexVector(2));
p.add(pts);
p.add(new Lines(p, index, oob));
p.add(new Points(p, index, oob));
p.setLeftLabel("test (blue), oob (black)");
p.setTitle("Accuracy errors (% misclassified)");
p.getOp().setYRange(0, 0.4);
draw(p, 600, 400);
Playing with number of random features
The main difference between bagging and random forests is that while bagging relies only grows trees on bootstraps, the random forests introduces randomization in order to uncorrelate those trees. The main effect of this is that it will further reduce the variance of the prediction and the compensation is better accuracy.It can be seen here that the best prediction according with oob and the test used is when the number of random features lies in 3 to 6 interval.
And the code which produced the last plot is listed below.
pos = 0;
index = new IndexVector("mtree", 1000);
accuracy = new NumericVector("test error", 1000);
oob = new NumericVector("oob error", 1000);
for (int mtree = 1; mtree < 20; mtree += 1) {
RandomForest rf = new RandomForest(10, mtree, true);
rf.learn(train, "spam");
ClassifierModel model = rf.predict(test);
index.setIndex(pos, mtree);
accuracy.setValue(pos, 1 - computeAccuracy(model, test));
oob.setValue(pos, rf.getOobError());
pos++;
}
p = new Plot();
lines = new Lines(p, index, accuracy);
lines.opt().setColorIndex(new OneIndexVector(2));
p.add(lines);
pts = new Points(p, index, accuracy);
pts.opt().setColorIndex(new OneIndexVector(2));
p.add(pts);
p.add(new Lines(p, index, oob));
p.add(new Points(p, index, oob));
p.setLeftLabel("test (blue), oob (black");
p.setBottomLabel("mcols - number of features considered");
p.setTitle("Accuracy errors (% misclassified)");
p.getOp().setYRange(0, 0.4);
draw(p, 600, 400);
Note: the sole purpose of this tutorial is to show what and how it can be done with Rapaio toolbox library. >>>This tutorial is generated with Rapaio document printer facilities.<<<
No comments:
Post a Comment