December 21, 2013

ROC graphs with Rapaio

Back

Data set preparation

For exemplification I will use the classical spam data set. We load the data set and we split randomly in two pieces. The first sample will be used for training purposes and it will have ~ 0.66 of the data, the second sample will be used for testing our model.
        RandomSource.setSeed(2718);
        final Frame spam = ColFilters.retainCols(Datasets.loadSpamBase(), "0-4,spam");
        List samples = randomSample(spam, new int[]{(int) (spam.getRowCount() * 0.6)});
        final Frame train = samples.get(0);
        final Frame test = samples.get(1);
If you are not aware how the data for spam data looks like that what you will have to know is that it consists of many numerical attributes used to predict a nominal attribute called \(spam\)
Thus we know there are 2788 instances classified as \(ham\), codified by value 0 (\(not spam\)), and 1813 instances codified by 1, which denotes spam emails. There are a lot of numeric features in this data set. We use only the first 5 numerical features for prediction.
>>summary(frame, [word_freq_make, word_freq_address, word_freq_all, word_freq_3d, word_freq_our, spam])
rows: 4601, cols: 6
 word_freq_make  word_freq_address    word_freq_all      word_freq_3d     word_freq_our 
   Min. : 0.000      Min. :  0.000     Min. : 0.000     Min. :  0.000     Min. :  0.000 
1st Qu. : 0.000   1st Qu. :  0.000  1st Qu. : 0.000  1st Qu. :  0.000  1st Qu. :  0.000 
 Median : 0.000    Median :  0.000   Median : 0.000   Median :  0.000   Median :  0.000 
   Mean : 0.105      Mean :  0.213     Mean : 0.281     Mean :  0.065     Mean :  0.312 
2nd Qu. : 0.000   2nd Qu. :  0.000  2nd Qu. : 0.420  2nd Qu. :  0.000  2nd Qu. :  0.383 
   Max. : 4.540      Max. : 14.280     Max. : 5.100     Max. : 42.810     Max. : 10.000 
                                                                                        
    spam 
0 : 2788 
1 : 1813 
         
         
         
         
         
Now we can do some predictions.

Binary classification

We will build 3 models for prediction. We will use the train test which consists of 66% percents of our initial data. For testing how well the model predicts we use the remaining data.

OneRule

This first model is one of the simplest model possible. It basically build a decision tree with a single level. For documentation obout this algorithm you can check the original paper Holte, R.C. Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Machine Learning 11, 63-91 (1993).
        OneRule oneRule = new OneRule();
        oneRule.learn(train, "spam");
        oneRule.predict(test);
One of the most used ways to check the performance of a classifier is the accuracy. Accuracy is the percentage of cases with correct prediction from total number of cases. With rapaio library one way to see the accuracy is to summarize the confusion matrix.
        ROC rocOR = new ROC(oneRule.getPrediction(), test.getCol("spam"), "1");
Confusion matrix

      | Predicted
Actual|     0      1| Total 
------ ------ ------ ------ 
     0|   941    182|  1123
     1|   361    357|   718
------ ------ ------ ------ 
 Total|  1302    539|  1841

Complete cases 1841 from 1841
Accuracy: 0.7051

Random Forest

The second prediction model is a random forest with 200 random trees.
        RandomForest rf = new RandomForest().setMtrees(200);
        rf.learn(train, "spam");
        rf.predict(test);
Confusion matrix

      | Predicted
Actual|     0      1| Total 
------ ------ ------ ------ 
     0|   968    155|  1123
     1|   272    446|   718
------ ------ ------ ------ 
 Total|  1240    601|  1841

Complete cases 1841 from 1841
Accuracy: 0.7681

AdaBoost.M1

The third prediction model is a boosting algorithm called AdaBoost.M1. This model is is build with decision stump as a weak learner, and boosting 200 iterations. The following code shows how one can achieve that using rapaio.
        AdaBoostM1 ab = new AdaBoostM1(new DecisionStump(), 200);
        ab.learn(train, "spam");
        ab.predict(test);
Confusion matrix

      | Predicted
Actual|     0      1| Total 
------ ------ ------ ------ 
     0|   909    214|  1123
     1|   263    455|   718
------ ------ ------ ------ 
 Total|  1172    669|  1841

Complete cases 1841 from 1841
Accuracy: 0.7409

ROC Curves

When accuracy is used to compare the performance of some classifiers it is very often the case that the comparison is misleading. That happens because accuracy is a measure which depends on many factors which pose some assumptions which are not always true.
I will not explain what a ROC graph is. There is enought literature on this topic. Among many useful documents, I found one which gives crystal clear details and explanations on ROC curves: Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine Learning.
In order to draw ROC graphs for the previous models with rapaio you can use the ROCCurve plot component which builds and draws a curve according with a given computed ROC object. The following code does this.
graphics

        ROC rocOR = new ROC(oneRule.getPrediction(), test.getCol("spam"), "1");
        ROC rocRF = new ROC(rf.getDistribution().getCol("1"), test.getCol("spam"), "1");
        ROC rocAB = new ROC(ab.getDistribution().getCol("1"), test.getCol("spam"), "1");
        draw(new Plot()
                .add(new ROCCurve(rocOR).setColorIndex(1))
                .add(new ROCCurve(rocRF).setColorIndex(2))
                .add(new ROCCurve(rocAB).setColorIndex(3))
                .add(new Legend(0.6, 0.33, 
                        new String[]{"onerule", "rf", "adaboost.m1"}, 
                        new int[]{1, 2, 3})),
                600, 400);
As you can see, ROC objects are used to compute values for ROC curves, and ROCCurve plot is used to add these on a plot graphic.
Note however, that Random Forst model used exhibits a ROC graph which is better than adaboost model most of the times in the conservative area of the graph. AdaBoost tends to be a little better in the liberal area, but in the extreme liberal area, again the random forest model exhibits better performance.
OneRule behaves sub-optimal, as it was expected in this specific case.
>>>This tutorial is generated with Rapaio document printer facilities.<<<

Back

No comments:

Post a Comment