Data set preparation
For exemplification I will use the classical spam data set. We load the data set and we split randomly in two pieces. The first sample will be used for training purposes and it will have ~ 0.66 of the data, the second sample will be used for testing our model. RandomSource.setSeed(2718);
final Frame spam = ColFilters.retainCols(Datasets.loadSpamBase(), "0-4,spam");
List samples = randomSample(spam, new int[]{(int) (spam.getRowCount() * 0.6)});
final Frame train = samples.get(0);
final Frame test = samples.get(1);
If you are not aware how the data for spam data looks like that what you will have to know is that it consists of many numerical attributes used to predict a nominal attribute called \(spam\)Thus we know there are 2788 instances classified as \(ham\), codified by value 0 (\(not spam\)), and 1813 instances codified by 1, which denotes spam emails. There are a lot of numeric features in this data set. We use only the first 5 numerical features for prediction.
>>summary(frame, [word_freq_make, word_freq_address, word_freq_all, word_freq_3d, word_freq_our, spam])
rows: 4601, cols: 6
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu. : 0.000 1st Qu. : 0.000 1st Qu. : 0.000 1st Qu. : 0.000 1st Qu. : 0.000
Median : 0.000 Median : 0.000 Median : 0.000 Median : 0.000 Median : 0.000
Mean : 0.105 Mean : 0.213 Mean : 0.281 Mean : 0.065 Mean : 0.312
2nd Qu. : 0.000 2nd Qu. : 0.000 2nd Qu. : 0.420 2nd Qu. : 0.000 2nd Qu. : 0.383
Max. : 4.540 Max. : 14.280 Max. : 5.100 Max. : 42.810 Max. : 10.000
0 : 2788
1 : 1813
Now we can do some predictions.Binary classification
We will build 3 models for prediction. We will use the train test which consists of 66% percents of our initial data. For testing how well the model predicts we use the remaining data.OneRule
This first model is one of the simplest model possible. It basically build a decision tree with a single level. For documentation obout this algorithm you can check the original paper Holte, R.C. Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Machine Learning 11, 63-91 (1993). OneRule oneRule = new OneRule();
oneRule.learn(train, "spam");
One of the most used ways to check the performance of a classifier is the accuracy. Accuracy is the percentage of cases with correct prediction from total number of cases. With rapaio library one way to see the accuracy is to summarize the confusion matrix. ROC rocOR = new ROC(oneRule.getPrediction(), test.getCol("spam"), "1");
Confusion matrix
| Predicted
Actual| 0 1| Total
------ ------ ------ ------
0| 941 182| 1123
1| 361 357| 718
------ ------ ------ ------
Total| 1302 539| 1841
Complete cases 1841 from 1841
Accuracy: 0.7051
Random Forest
The second prediction model is a random forest with 200 random trees. RandomForest rf = new RandomForest().setMtrees(200);
rf.learn(train, "spam");
Confusion matrix
| Predicted
Actual| 0 1| Total
------ ------ ------ ------
0| 968 155| 1123
1| 272 446| 718
------ ------ ------ ------
Total| 1240 601| 1841
Complete cases 1841 from 1841
Accuracy: 0.7681
The third prediction model is a boosting algorithm called AdaBoost.M1. This model is is build with decision stump as a weak learner, and boosting 200 iterations. The following code shows how one can achieve that using rapaio. AdaBoostM1 ab = new AdaBoostM1(new DecisionStump(), 200);
ab.learn(train, "spam");
Confusion matrix
| Predicted
Actual| 0 1| Total
------ ------ ------ ------
0| 909 214| 1123
1| 263 455| 718
------ ------ ------ ------
Total| 1172 669| 1841
Complete cases 1841 from 1841
Accuracy: 0.7409
ROC Curves
When accuracy is used to compare the performance of some classifiers it is very often the case that the comparison is misleading. That happens because accuracy is a measure which depends on many factors which pose some assumptions which are not always true.I will not explain what a ROC graph is. There is enought literature on this topic. Among many useful documents, I found one which gives crystal clear details and explanations on ROC curves: Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine Learning.
In order to draw ROC graphs for the previous models with rapaio you can use the ROCCurve plot component which builds and draws a curve according with a given computed ROC object. The following code does this.
ROC rocOR = new ROC(oneRule.getPrediction(), test.getCol("spam"), "1");
ROC rocRF = new ROC(rf.getDistribution().getCol("1"), test.getCol("spam"), "1");
ROC rocAB = new ROC(ab.getDistribution().getCol("1"), test.getCol("spam"), "1");
draw(new Plot()
.add(new ROCCurve(rocOR).setColorIndex(1))
.add(new ROCCurve(rocRF).setColorIndex(2))
.add(new ROCCurve(rocAB).setColorIndex(3))
.add(new Legend(0.6, 0.33,
new String[]{"onerule", "rf", "adaboost.m1"},
new int[]{1, 2, 3})),
600, 400);
As you can see, ROC objects are used to compute values for ROC curves, and ROCCurve plot is used to add these on a plot graphic. Note however, that Random Forst model used exhibits a ROC graph which is better than adaboost model most of the times in the conservative area of the graph. AdaBoost tends to be a little better in the liberal area, but in the extreme liberal area, again the random forest model exhibits better performance.
OneRule behaves sub-optimal, as it was expected in this specific case.
>>>This tutorial is generated with Rapaio document printer facilities.<<<
No comments:
Post a Comment