Broken thoughts

Very often, my good friend Aurel comes to me with some mathematical facts more or less funny, but always interesting. This time he let me know about something which is called Chebyshev Bias.

Chebyshev noted in 1853, that if you take modulo 4 of the prime numbers, the which gives 1 as rest are less than the ones which gives rest 3. We denote by \(\pi(x,4,1)\) primes \(p \leq x \), congruent with 1 modulo 4 and with \(\pi(x,4,3)\) primes \(p \leq x\), congruent with 3 modulo 4.

It is a known fact that \(\pi(x,a,b) \sim \frac{1}{\varphi(q)}\frac{x}{ln(x)}\). That means that this number does not depend on the value of \(a\). As a consequence, one would expect that \(\pi(x,4,1) \sim \pi(x,4,3)\). Which is true when \(x \to \infty\).

If we draw the ration between \(\pi(x,4,3)\) and \(\pi(x,4,1)\) we can start to see what Chebyshev noted. The ratio gets closer to 1, but it seems to be almost always a little bit upper than that.

But we can get a better picture if we plot the difference between those values, not the ratio. We denote by \(delta(x) = \pi(x,4,3) - \pi(x,4,1) \). If we plot delta values for \( x \leq 5000\) we have the plot below.

It is obvious that the value of delta function is always greater or equal than 0 with one exception. This exception was determined also by Chebyshev, and the value is 2946. But this is the only exception? We can see something if we plot delta values for \(x\leq 60000\).

It is obvious that there are some points where delta function is negative, and those points are around 50380. This time there are many points not just one.

If I plot the delta for all the prime number values that I have (in a file computed with a naive sieve), we see that the next time when the delta function has negative values is somewhere around 48 millions, which is preety high.

It is sure that we can't draw any conclusion from those plots. The delta function looks like one which is able to provide unexpected behavior. Howevere it seems safe to assume that most of the time the delta function has a positive value. For me, at least, this looks like an interesting and unexpected fact.

Back

Data set preparation

For exemplification I will use the classical spam data set. We load the data set and we split randomly in two pieces. The first sample will be used for training purposes and it will have ~ 0.66 of the data, the second sample will be used for testing our model.

        RandomSource.setSeed(2718);
        final Frame spam = ColFilters.retainCols(Datasets.loadSpamBase(), "0-4,spam");
        List samples = randomSample(spam, new int[]{(int) (spam.getRowCount() * 0.6)});
        final Frame train = samples.get(0);
        final Frame test = samples.get(1);

If you are not aware how the data for spam data looks like that what you will have to know is that it consists of many numerical attributes used to predict a nominal attribute called \(spam\)
Thus we know there are 2788 instances classified as \(ham\), codified by value 0 (\(not spam\)), and 1813 instances codified by 1, which denotes spam emails. There are a lot of numeric features in this data set. We use only the first 5 numerical features for prediction.

>>summary(frame, [word_freq_make, word_freq_address, word_freq_all, word_freq_3d, word_freq_our, spam])
rows: 4601, cols: 6
 word_freq_make  word_freq_address    word_freq_all      word_freq_3d     word_freq_our 
   Min. : 0.000      Min. :  0.000     Min. : 0.000     Min. :  0.000     Min. :  0.000 
1st Qu. : 0.000   1st Qu. :  0.000  1st Qu. : 0.000  1st Qu. :  0.000  1st Qu. :  0.000 
 Median : 0.000    Median :  0.000   Median : 0.000   Median :  0.000   Median :  0.000 
   Mean : 0.105      Mean :  0.213     Mean : 0.281     Mean :  0.065     Mean :  0.312 
2nd Qu. : 0.000   2nd Qu. :  0.000  2nd Qu. : 0.420  2nd Qu. :  0.000  2nd Qu. :  0.383 
   Max. : 4.540      Max. : 14.280     Max. : 5.100     Max. : 42.810     Max. : 10.000 
                                                                                        
    spam 
0 : 2788 
1 : 1813

Now we can do some predictions.

Binary classification

We will build 3 models for prediction. We will use the train test which consists of 66% percents of our initial data. For testing how well the model predicts we use the remaining data.

OneRule

This first model is one of the simplest model possible. It basically build a decision tree with a single level. For documentation obout this algorithm you can check the original paper Holte, R.C. Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Machine Learning 11, 63-91 (1993).

        OneRule oneRule = new OneRule();
        oneRule.learn(train, "spam");
        oneRule.predict(test);

One of the most used ways to check the performance of a classifier is the accuracy. Accuracy is the percentage of cases with correct prediction from total number of cases. With rapaio library one way to see the accuracy is to summarize the confusion matrix.

        ROC rocOR = new ROC(oneRule.getPrediction(), test.getCol("spam"), "1");

Confusion matrix

      | Predicted
Actual|     0      1| Total 
------ ------ ------ ------ 
     0|   941    182|  1123
     1|   361    357|   718
------ ------ ------ ------ 
 Total|  1302    539|  1841

Complete cases 1841 from 1841
Accuracy: 0.7051

Random Forest

The second prediction model is a random forest with 200 random trees.

        RandomForest rf = new RandomForest().setMtrees(200);
        rf.learn(train, "spam");
        rf.predict(test);

Confusion matrix

      | Predicted
Actual|     0      1| Total 
------ ------ ------ ------ 
     0|   968    155|  1123
     1|   272    446|   718
------ ------ ------ ------ 
 Total|  1240    601|  1841

Complete cases 1841 from 1841
Accuracy: 0.7681

AdaBoost.M1

The third prediction model is a boosting algorithm called AdaBoost.M1. This model is is build with decision stump as a weak learner, and boosting 200 iterations. The following code shows how one can achieve that using rapaio.

        AdaBoostM1 ab = new AdaBoostM1(new DecisionStump(), 200);
        ab.learn(train, "spam");
        ab.predict(test);

Confusion matrix

      | Predicted
Actual|     0      1| Total 
------ ------ ------ ------ 
     0|   909    214|  1123
     1|   263    455|   718
------ ------ ------ ------ 
 Total|  1172    669|  1841

Complete cases 1841 from 1841
Accuracy: 0.7409

ROC Curves

When accuracy is used to compare the performance of some classifiers it is very often the case that the comparison is misleading. That happens because accuracy is a measure which depends on many factors which pose some assumptions which are not always true.
I will not explain what a ROC graph is. There is enought literature on this topic. Among many useful documents, I found one which gives crystal clear details and explanations on ROC curves: Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine Learning.
In order to draw ROC graphs for the previous models with rapaio you can use the ROCCurve plot component which builds and draws a curve according with a given computed ROC object. The following code does this.
graphics

        ROC rocOR = new ROC(oneRule.getPrediction(), test.getCol("spam"), "1");
        ROC rocRF = new ROC(rf.getDistribution().getCol("1"), test.getCol("spam"), "1");
        ROC rocAB = new ROC(ab.getDistribution().getCol("1"), test.getCol("spam"), "1");
        draw(new Plot()
                .add(new ROCCurve(rocOR).setColorIndex(1))
                .add(new ROCCurve(rocRF).setColorIndex(2))
                .add(new ROCCurve(rocAB).setColorIndex(3))
                .add(new Legend(0.6, 0.33, 
                        new String[]{"onerule", "rf", "adaboost.m1"}, 
                        new int[]{1, 2, 3})),
                600, 400);

As you can see, ROC objects are used to compute values for ROC curves, and ROCCurve plot is used to add these on a plot graphic.
Note however, that Random Forst model used exhibits a ROC graph which is better than adaboost model most of the times in the conservative area of the graph. AdaBoost tends to be a little better in the liberal area, but in the extreme liberal area, again the random forest model exhibits better performance.
OneRule behaves sub-optimal, as it was expected in this specific case.
>>>This tutorial is generated with Rapaio document printer facilities.<<<

Back

Broken thoughts

December 23, 2013

Chebyshev Bias

December 21, 2013

ROC graphs with Rapaio