October 9, 2013

Random Forests classification with Rapaio toolbox


Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.
The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and "Random Forests" is their trademark. More about this topic can be found at http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Implementation

Rapaio toolbox is and will continue to be under construction for a long period. However, there is a Random Forest implementation available and ready to be used. Implemented features:
1. Classification only (regression will follow soon).
2. It uses Gini impurity function for finding the best split attributes/values, according with the original specifications. It can be implemented also with InfoGain (as it is implemented in Weka), orsomething else. However there is not a huge difference in results, thus for now remains according with the original specifications.
3. It computes OOB (Out of bag) error, which is an estimation constructed in the same way as a cross validation. For faster execution one can disable oob computation.
4. Does not perform yet any of the two ways of feature importance, it will be implemented soon.
5. There is no computation of proximity and I do not know for sure if I want that in the immediate future.

Why Random Forests?

Random Forests are very popular due to the fact that the intuition behind the algorithm is easy to understand, and, to some extent, easy to implement. However, I found a very popular opinion that Random Forests is something like a panacea for learning. In my humble opinion it is far from that, simply because there is no such thing in machine learning.
Random Forests learns well in a variety of situations and is usually useful when it is very hard or complex to understand the mechanics of your data. However finding and exploiting valuable knowledge from data is often more successful than random forests.
I like random forests for some other qualities which I found more valuable, but not so popular:
- ability to capture knowledge about importance of the features
- possibility to be used as an exploratory tool or for unsupervised learning
- the theory behind the algorithm which explains how some variance vanishes, how some "noisy random salt" produces stability; all the inspiring simple or subtle things from the theory behind it

Data setup

I will use a classical data set called spam-base, data set which was imported into Rapaio toolbox from the well-known UCI repository: http://archive.ics.uci.edu/ml/datasets/Spambase
In order to compute faster I will use only some dimensions of this dataset. I will use for prediction only the first 20 features and the class called "spam"
        Frame all = Datasets.loadSpamBase();
        all = ColFilters.retainCols(all, "1-20,spam");

        Summary.summary(ColFilters.retainCols(all, "1-5"));
        Summary.summary(ColFilters.retainCols(all, "spam"));
>>summary("spam-base", [word_freq_all, word_freq_3d, word_freq_our, word_freq_over, word_freq_remove])
rows: 4601, cols: 5
  word_freq_all      word_freq_3d     word_freq_our   word_freq_over  word_freq_remove 
   Min. : 0.000     Min. :  0.000     Min. :  0.000     Min. : 0.000      Min. : 0.000 
1st Qu. : 0.000  1st Qu. :  0.000  1st Qu. :  0.000  1st Qu. : 0.000   1st Qu. : 0.000 
 Median : 0.000   Median :  0.000   Median :  0.000   Median : 0.000    Median : 0.000 
   Mean : 0.281     Mean :  0.065     Mean :  0.312     Mean : 0.096      Mean : 0.114 
2nd Qu. : 0.420  2nd Qu. :  0.000  2nd Qu. :  0.383  2nd Qu. : 0.000   2nd Qu. : 0.000 
   Max. : 5.100     Max. : 42.810     Max. : 10.000     Max. : 5.880      Max. : 7.270 
                                                                                       
>>summary("spam-base", [spam])
rows: 4601, cols: 1
    spam 
0 : 2788 
1 : 1813 
         
         
         
         
         
Above you see some 5-number information on the data. It is not exhaustive since it is not the purpose of this tutorial.
We will split the data set in two parts, one will be used for training the random forest and another one will be used for testing its prediction accuracy.
        List frames = Sample.randomSample(all, new int[]{all.getRowCount() * 15 / 100});
        Frame train = frames.get(0);
        Frame test = frames.get(1);

Playing with number of trees grown

Now that we have a train and a test data set we can learn and predict. RF grows a number of trees over bootstrap samples and use voting for classification. How large this number of trees must be? You can check how well you predict as the number of trees grows.

graphics

Note from the previous plot how both test and oob errors goes down as the number of trained trees grown. However, the improvement stops at some point and become useless to add new trees.
        int pos = 0;
        Vector index = new IndexVector("number of trees", 1000);
        Vector accuracy = new NumericVector("test error", 1000);
        Vector oob = new NumericVector("oob error", 1000);
        for (int mtree = 1; mtree < 100; mtree += 5) {
            RandomForest rf = new RandomForest(mtree, 3, true);
            rf.learn(train, "spam");
            ClassifierModel model = rf.predict(test);
            index.setIndex(pos, mtree);
            accuracy.setValue(pos, 1 - computeAccuracy(model, test));
            oob.setValue(pos, rf.getOobError());
            pos++;
        }
        Plot p = new Plot();
        Lines lines = new Lines(p, index, accuracy);
        lines.opt().setColorIndex(new OneIndexVector(2));
        p.add(lines);
        Points pts = new Points(p, index, accuracy);
        pts.opt().setColorIndex(new OneIndexVector(2));
        p.add(pts);
        p.add(new Lines(p, index, oob));
        p.add(new Points(p, index, oob));

        p.setLeftLabel("test (blue), oob (black)");
        p.setTitle("Accuracy errors (% misclassified)");
        p.getOp().setYRange(0, 0.4);
        draw(p, 600, 400);

Playing with number of random features

The main difference between bagging and random forests is that while bagging relies only grows trees on bootstraps, the random forests introduces randomization in order to uncorrelate those trees. The main effect of this is that it will further reduce the variance of the prediction and the compensation is better accuracy.

graphics

It can be seen here that the best prediction according with oob and the test used is when the number of random features lies in 3 to 6 interval.
And the code which produced the last plot is listed below.
        pos = 0;
        index = new IndexVector("mtree", 1000);
        accuracy = new NumericVector("test error", 1000);
        oob = new NumericVector("oob error", 1000);
        for (int mtree = 1; mtree < 20; mtree += 1) {

            RandomForest rf = new RandomForest(10, mtree, true);
            rf.learn(train, "spam");
            ClassifierModel model = rf.predict(test);

            index.setIndex(pos, mtree);
            accuracy.setValue(pos, 1 - computeAccuracy(model, test));
            oob.setValue(pos, rf.getOobError());

            pos++;
        }
        p = new Plot();
        lines = new Lines(p, index, accuracy);
        lines.opt().setColorIndex(new OneIndexVector(2));
        p.add(lines);
        pts = new Points(p, index, accuracy);
        pts.opt().setColorIndex(new OneIndexVector(2));
        p.add(pts);
        p.add(new Lines(p, index, oob));
        p.add(new Points(p, index, oob));
        p.setLeftLabel("test (blue), oob (black");
        p.setBottomLabel("mcols - number of features considered");
        p.setTitle("Accuracy errors (% misclassified)");
        p.getOp().setYRange(0, 0.4);
        draw(p, 600, 400);
Note: the sole purpose of this tutorial is to show what and how it can be done with Rapaio toolbox library.
>>>This tutorial is generated with Rapaio document printer facilities.<<<

No comments:

Post a Comment