October 23, 2013

Discrete Sampling with Rapaio

Back

Discrete Sampling with Rapaio

In statistics, sampling is the process which selects a subset of observations from within a statistical population. There are many types of sampling methods which can be employed in statistics, like simple random sampling, stratified sampling, systematic sampling and so on.
However the purpose of this tutorial is to present four algorithms which are used as building blocks by various statistical sampling methods. The purpose of this tutorial is to present discrete random sampling methods.
Definition:
A statistical distribution whose values can take only discrete values is called a discrete distribution.
A discrete probability distribution function is completely described by the set of possible values the random variable can take and by the probabilities assigned to each value.
An example of discrete distribution is the process of throwing a standard dice. We have a finite set of outcomes of the process (6 possible values) and a probability function value associated with each output (for a fair dice we can associate probability \( p(x_i) = \frac{1}{6} \)).
Drawing a sample from a distribution is the process of selecting some values from the possible values \( x_i,i=1..n \) according with their probabilities. Sampling is useful for far many purposes that I can describe here. Among some very important scenarios are the simulation and the fapt that working with a smaller sample than the given population is faster and provides enough information for analysis. For the latter example we can note that working with the heights and weights of the all the people from the world (assuming that this information is possible to collect, there are probably over 7 billions records) is much harder than working with a sample much smaller.

Uniform random sample with replacement

An uniform random sample is a sample from a discrete population with an uniform distribution. A discrete uniform distribution is a distribution which assigns equal probability mass function values to each outcome. The previous example of throwing a fair dice is an example of uniform distribution, since it assigns equal value \(\frac{1}{6}\) to each possible outcome \( x_i \).
A sample with replacement is a sample where values of the sample can appear multiple times. The expression "with replacement" can be misleading. The intuition behind seems to follow from the following description of the process:
We have a set of possible elements \(x_i\), each with assigned equal probability of \(p(x_i) \). Take randomly one element from the set, according with their probabilities (denote the taken element with \(x_k\)). Replace the element taken from the set with a new element, which has the same value as the element previously removed. At this stage we have again a situation identical with the initial situation. Repeat the process of taking elements from the original set, followed by replacing that element with another similar element unit you collect the desired number of elements for the sample.
Of course, we don't have to replace effectively the element. Repeating the process of throwing a fair dice multiple times is a sampling with replacement (we don't remove and replace something during process).
The algorithm for taking a discrete uniform sample with replacement is fairly simple since it is based on a basic operation provided by Java language (java.util.Random.nextInt(int n);)
        int[] sample = new DiscreteSamplingWR(6).sample(1000);
        Vector vector = new NumericVector("fair-dice", sample);
        draw(new Histogram(vector, 6, false), 500, 200);
graphics

In the presented histogram we see frequencies obtained be taking a sample of size 1000 of the fair-dice process outcomes. We note that the proportions are somehow equal, which is according with our assumption that each element have equal probability. However the values from sample looks random:
2, 3, 4, 1, 0, 1, 4, 0, 3, 4, 5, 4, 3, 0, 3, 2, 2, 2, 1, 4, 5, 1, 4, 4, 2, 3, 5, 3, 4, 4, 
2, 3, 4, 5, 3, 1, 5, 0, 0, 1, 1, 3, 1, 0, 5, 1, 5, 4, 4, 2, 2, 4, 0, 2, 1, 4, 2, 2, 2, 3, 
3, 0, 3, 2, 0, 1, 0, 4, 1, 3, 0, 2, 2, 2, 4, 1, 3, 2, 2, 4, 5, 2, 2, 0, 1, 0, 0, 5, 3, 3, 
2, 3, 0, 1, 0, 1, 2, 4, 4, 4, ...

Uniform random sample without replacement

Sampling without replacement implies that during the process of selection of elements which will be collected in teh desired sample, the chosen elements are not available again for further selection. A process like this appears when are drawn numbers for lottery. The already selected numbers are not available for selection again. Another expression which, at least for me, seems much clearer is "without repetition". So, with replacement means a sample with repetition, and without replacement is a sample without repetition.
Let's suppose that we have a lottery 6/49. At each draw 6 unique numbers in range 1-69 are drawn. We take 100 draws and simulate:
        final int TRIALS = 100;
        final int SAMPLE_SIZE = 6;
        final int POPULATION_SIZE = 49;
        Vector[] vectors = new Vector[2];
        vectors[0] = new NumericVector("lottery trial", SAMPLE_SIZE * TRIALS);
        vectors[1] = new NumericVector("winning number", SAMPLE_SIZE * TRIALS);

        for (int i = 0; i < TRIALS; i++) {
            int[] numbers = new DiscreteSamplingWOR(POPULATION_SIZE).sample(SAMPLE_SIZE);
            for (int j = 0; j < numbers.length; j++) {
                vectors[0].setValue(i * SAMPLE_SIZE + j, i);
                vectors[1].setValue(i * SAMPLE_SIZE + j, numbers[j] + 1);
            }
        }

        final Frame df = new SolidFrame("lottery", SAMPLE_SIZE * TRIALS, vectors);
        draw(new Plot() {{
            add(new Points(this, df.getCol(0), df.getCol(1)));
            getOp().setPchIndex(new OneIndexVector(1));
            getOp().setColorIndex(new OneIndexVector(34));
        }}, 600, 300);
graphics

There is random in that plot. Everywhere. A summary on the data, however, can give us enough clues to understand that the distribution of those numbers are still symmetric and somehow uniform.
>>summary("lottery", [lottery trial, winning number])
rows: 600, cols: 2
    lottery trial    winning number 
   Min. :   1.000     Min. :  1.000 
1st Qu. :  25.417  1st Qu. : 13.000 
 Median :  50.500   Median : 25.000 
   Mean :  50.500     Mean : 25.148 
2nd Qu. :  75.583  2nd Qu. : 38.000 
   Max. : 100.000     Max. : 49.000 
                                    

Weighted random sample with replacement

A weighted sample is a discrete random sample which does not have an uniform distribution. A well-know example used in almost all introductory probability classes is the biased coin. A biased coin is a coin which is not fair. That mean the probability after a draw to see HEAD is different than the probability to see a TAIL.
Let's suppose we have a biased coin with \( p(coin=HEAD) = 0.6\) and \( p(coin=TAIL)=0.4\). We simulate this experiment and we draw the coins a lot of times. The law of large numbers tells us that after a reasonable amount of repetitions the plugged in estimator will tend to go the the population parameter estimated.
During the experiment we will throw the coin 300 times and we will plot the plugin estimate which is the number of times HEAD is drawn divided by the number of experiments.
        RandomSource.setSeed(1);
        final Vector index = new IndexVector("experiment no.", 1, 1000, 1);
        final Vector value = new NumericVector("HEAD/TOTAL", 1000);
        double count = 0;
        double total = 0;
        for (int i = 0; i < 300; i++) {
            int[] samples = new DiscreteWeightedSamplingWR(new double[]{0.6, 0.4}).sample(1);
            if (samples[0] == 0) count++;
            total++;
            value.setValue(i, count / total);
        }
        draw(new Plot() {{
            add(new ABLine(this, 0.6, true));
            add(new Lines(this, index, value) {{
                opt().setColorIndex(new OneIndexVector(2));
                opt().setLwd(1.5f);
            }});
            getOp().setYRange(0, 1);
        }});
graphics

From the previous function line we see that the plugged in estimate of the probability of HEAD has a large variation at the beginning of our experiment. However, as number of trials increases we have clear reasoning to confirm that the coin is biased, since the variation decrease, the estimator converge to value 0.6 which is not what we could expect from a fair coin.
The sampling algorithm implemented is one of the family of alias method, specifically is called Vose algorithm and is one of the linear algorithms used today for discrete weighted random sampling. See more about this algorithm here: http://en.wikipedia.org/wiki/Alias_method.

Weighted random sample without replacement

This is the last type of discrete random sampling covered here. What we are interested in is to generate samples without replacement (no repetition), from a discrete distribution different than uniform distribution.
We consider again the lottery experiment. However we want to simulate a situation when some winning numbers are preferred over the others. Let's suppose that our lottery favors big numbers, >= that 40. And some other numbers, ones in interval 8-12 have smaller probability than usual. We repeat the experiment with a weighted sampling technique.
        vectors[0] = new NumericVector("loaded lottery", SAMPLE_SIZE * TRIALS);
        vectors[1] = new NumericVector("winning number", SAMPLE_SIZE * TRIALS);

        double[] prob = new double[49];
        for (int i = 0; i =8 && i-1<=12) {
                prob[i]=3;
                continue;
            }
            if(i-1>=40) {
                prob[i]=30;
                continue;
            }
            prob[i]=10;
        }
        for (int i = 0; i < TRIALS; i++) {
            int[] numbers = new DiscreteWeightedSamplingWOR(prob).sample(SAMPLE_SIZE);
            for (int j = 0; j < numbers.length; j++) {
                vectors[0].setValue(i * SAMPLE_SIZE + j, i + 1);
                vectors[1].setValue(i * SAMPLE_SIZE + j, numbers[j] + 1);
            }
        }

        final Frame df2 = new SolidFrame("lottery", SAMPLE_SIZE * TRIALS, vectors);
        draw(new Plot() {{
            add(new Points(this, df2.getCol(0), df2.getCol(1)));
            getOp().setPchIndex(new OneIndexVector(1));
            getOp().setColorIndex(new OneIndexVector(34));
            getOp().setSizeIndex(new OneNumericVector(2));
        }}, 600, 300);
graphics

This time we see more than random there. There is a clear more dense region in the upper side of the graph. Also, we can note, perhaps not as very clear, a stripe with low density somewhere under y=13.
To clarify a kernel density plot which approximates the population density would help more.
graphics

Rapaio implementation of this last algorithm is based on a wonderful algorithm invented by Efraimidis-Spirakis. http://link.springer.com/content/pdf/10.1007/978-0-387-30162-4_478.pdf.
Note: the sole purpose of this tutorial is to show what and how it can be done with Rapaio toolbox library.
>>>This tutorial is generated with Rapaio document printer facilities.<<<

Back

October 16, 2013

Rapaio Random Forest - a benchmark - updated

While searching for various ideas and implementations of random forests I noticed the project called fast-random-forest. This is basically a faster and optimized version of Weka implementation. Take a look here: http://code.google.com/p/fast-random-forest/.

I don't discuss the implementation. Some things seems to be faster than what Weka provides, other seems to provide no significant performance boost. Anyway, it deserves kudos for his effort and the fact that he open sourced the code.

However what I found more interesting for my purpose is a small benchmark which compares some results of Weka Random Forest and fast-random-forest. I tried to check my implementation against those two.

Here are some results:

test               Weka       FastRF      RapaioRF
anneal        0.9960000000 0.9930000000 0.8630289532
balance-scale 0.8160000000 0.8130000000 0.8208000000
breast-w      0.9670000000 0.9680000000 0.9642346209
credit-a      0.8620000000 0.8650000000 0.8724637681
credit-g      0.7580000000 0.7510000000 0.7520000000
diabetes      0.7630000000 0.7620000000 0.7565104167
hypothyroid   0.9940000000 0.9950000000 0.9904559915
kr-vs-kp      0.9920000000 0.9870000000 0.9881101377
mushroom      1.0000000000 1.0000000000 1.0000000000
segment       0.9810000000 0.9800000000 0.9766233766
sick          0.9830000000 0.9840000000 0.9803817603
soybean       0.9320000000 0.9250000000 0.9385065886
vehicle       0.7470000000 0.7520000000 0.7517730496
vowel         0.9810000000 0.9680000000 0.9585858586
letter        0.9620000000 0.9620000000 0.9659500000
splice        0.9300000000 0.9410000000 0.9639498433
waveform-5000 0.8480000000 0.8480000000 0.8494000000


Note: the comparison does not show a best algorithm. The test is not intended to induce the idea of searching for the best implementation. Such a tentative is a dead end for far too many reasons, and I don't spend time to enumerate them here.

However what I wanted to know is if my implementation is comparable. And it is.

Later update: after I fix a bug I ran again same test.

         test         Weka       FastRF     RapaioRF
       anneal 0.9960000000 0.9930000000 0.9109131403
balance-scale 0.8160000000 0.8130000000 0.8208000000
     breast-w 0.9670000000 0.9680000000 0.9642346209
     credit-a 0.8620000000 0.8650000000 0.8753623188
     credit-g 0.7580000000 0.7510000000 0.7630000000
     diabetes 0.7630000000 0.7620000000 0.7643229167
  hypothyroid 0.9940000000 0.9950000000 0.9931071050
     kr-vs-kp 0.9920000000 0.9870000000 0.9871714643
     mushroom 1.0000000000 1.0000000000 1.0000000000
      segment 0.9810000000 0.9800000000 0.9753246753
         sick 0.9830000000 0.9840000000 0.9827677625
      soybean 0.9320000000 0.9250000000 0.9355783309
      vehicle 0.7470000000 0.7520000000 0.7565011820
        vowel 0.9810000000 0.9680000000 0.9626262626
       letter 0.9620000000 0.9620000000 0.9658500000
       splice 0.9300000000 0.9410000000 0.9620689655
waveform-5000 0.8480000000 0.8480000000 0.8522000000


This time the results of my implementation seems better. Again, I am not finding the best implementation. There are, however, some differences which I think is better to highlight here:
  • Rapaio implementation follows the instructions proposed by Breiman regarding the impurity function to be used. I use Gini impurity function and the splitting criteria is the gain Gini impurity given by split. Both Weka and fast-random-forest uses InfoGain as impurity function.
  •  Gini variable importance, one of the VI proposed by Breiman, is computed based on Gini index (again, both Weka and fast-random-forest uses InfoGain for that purpose)
  • Rapaio and fast-random-forest uses binary splits for growing trees. Weka does not.
  • Regarding missing values there are even more differences:
    • Impurity value computation: Weka incorporates missing values in the computed distributions, proportionally with the classes frequencies. Fast-random-forest does the same, however, for nominal attributes uses only binary splits. Rapaio ignores the missing values for this computation.
    • Split data: Weka and Rapaio uses all values with missing data on selected attribute, this instances are all distributed to all children nodes (only 2 for Rapaio, possible more than 2 for Weka) with weights adjusted by the proportions given by class distribution. Fast-random-forest does not, it distributes randomly instances in children nodes, proportionally with the class distribution. 
Rapaio implementation is probably the slowest among all of them. That is explained partially because "early optimization is the root of all evil" and partly because it uses the most expensive choices.
   

October 9, 2013

Random Forests classification with Rapaio toolbox


Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.
The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and "Random Forests" is their trademark. More about this topic can be found at http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Implementation

Rapaio toolbox is and will continue to be under construction for a long period. However, there is a Random Forest implementation available and ready to be used. Implemented features:
1. Classification only (regression will follow soon).
2. It uses Gini impurity function for finding the best split attributes/values, according with the original specifications. It can be implemented also with InfoGain (as it is implemented in Weka), orsomething else. However there is not a huge difference in results, thus for now remains according with the original specifications.
3. It computes OOB (Out of bag) error, which is an estimation constructed in the same way as a cross validation. For faster execution one can disable oob computation.
4. Does not perform yet any of the two ways of feature importance, it will be implemented soon.
5. There is no computation of proximity and I do not know for sure if I want that in the immediate future.

Why Random Forests?

Random Forests are very popular due to the fact that the intuition behind the algorithm is easy to understand, and, to some extent, easy to implement. However, I found a very popular opinion that Random Forests is something like a panacea for learning. In my humble opinion it is far from that, simply because there is no such thing in machine learning.
Random Forests learns well in a variety of situations and is usually useful when it is very hard or complex to understand the mechanics of your data. However finding and exploiting valuable knowledge from data is often more successful than random forests.
I like random forests for some other qualities which I found more valuable, but not so popular:
- ability to capture knowledge about importance of the features
- possibility to be used as an exploratory tool or for unsupervised learning
- the theory behind the algorithm which explains how some variance vanishes, how some "noisy random salt" produces stability; all the inspiring simple or subtle things from the theory behind it

Data setup

I will use a classical data set called spam-base, data set which was imported into Rapaio toolbox from the well-known UCI repository: http://archive.ics.uci.edu/ml/datasets/Spambase
In order to compute faster I will use only some dimensions of this dataset. I will use for prediction only the first 20 features and the class called "spam"
        Frame all = Datasets.loadSpamBase();
        all = ColFilters.retainCols(all, "1-20,spam");

        Summary.summary(ColFilters.retainCols(all, "1-5"));
        Summary.summary(ColFilters.retainCols(all, "spam"));
>>summary("spam-base", [word_freq_all, word_freq_3d, word_freq_our, word_freq_over, word_freq_remove])
rows: 4601, cols: 5
  word_freq_all      word_freq_3d     word_freq_our   word_freq_over  word_freq_remove 
   Min. : 0.000     Min. :  0.000     Min. :  0.000     Min. : 0.000      Min. : 0.000 
1st Qu. : 0.000  1st Qu. :  0.000  1st Qu. :  0.000  1st Qu. : 0.000   1st Qu. : 0.000 
 Median : 0.000   Median :  0.000   Median :  0.000   Median : 0.000    Median : 0.000 
   Mean : 0.281     Mean :  0.065     Mean :  0.312     Mean : 0.096      Mean : 0.114 
2nd Qu. : 0.420  2nd Qu. :  0.000  2nd Qu. :  0.383  2nd Qu. : 0.000   2nd Qu. : 0.000 
   Max. : 5.100     Max. : 42.810     Max. : 10.000     Max. : 5.880      Max. : 7.270 
                                                                                       
>>summary("spam-base", [spam])
rows: 4601, cols: 1
    spam 
0 : 2788 
1 : 1813 
         
         
         
         
         
Above you see some 5-number information on the data. It is not exhaustive since it is not the purpose of this tutorial.
We will split the data set in two parts, one will be used for training the random forest and another one will be used for testing its prediction accuracy.
        List frames = Sample.randomSample(all, new int[]{all.getRowCount() * 15 / 100});
        Frame train = frames.get(0);
        Frame test = frames.get(1);

Playing with number of trees grown

Now that we have a train and a test data set we can learn and predict. RF grows a number of trees over bootstrap samples and use voting for classification. How large this number of trees must be? You can check how well you predict as the number of trees grows.

graphics

Note from the previous plot how both test and oob errors goes down as the number of trained trees grown. However, the improvement stops at some point and become useless to add new trees.
        int pos = 0;
        Vector index = new IndexVector("number of trees", 1000);
        Vector accuracy = new NumericVector("test error", 1000);
        Vector oob = new NumericVector("oob error", 1000);
        for (int mtree = 1; mtree < 100; mtree += 5) {
            RandomForest rf = new RandomForest(mtree, 3, true);
            rf.learn(train, "spam");
            ClassifierModel model = rf.predict(test);
            index.setIndex(pos, mtree);
            accuracy.setValue(pos, 1 - computeAccuracy(model, test));
            oob.setValue(pos, rf.getOobError());
            pos++;
        }
        Plot p = new Plot();
        Lines lines = new Lines(p, index, accuracy);
        lines.opt().setColorIndex(new OneIndexVector(2));
        p.add(lines);
        Points pts = new Points(p, index, accuracy);
        pts.opt().setColorIndex(new OneIndexVector(2));
        p.add(pts);
        p.add(new Lines(p, index, oob));
        p.add(new Points(p, index, oob));

        p.setLeftLabel("test (blue), oob (black)");
        p.setTitle("Accuracy errors (% misclassified)");
        p.getOp().setYRange(0, 0.4);
        draw(p, 600, 400);

Playing with number of random features

The main difference between bagging and random forests is that while bagging relies only grows trees on bootstraps, the random forests introduces randomization in order to uncorrelate those trees. The main effect of this is that it will further reduce the variance of the prediction and the compensation is better accuracy.

graphics

It can be seen here that the best prediction according with oob and the test used is when the number of random features lies in 3 to 6 interval.
And the code which produced the last plot is listed below.
        pos = 0;
        index = new IndexVector("mtree", 1000);
        accuracy = new NumericVector("test error", 1000);
        oob = new NumericVector("oob error", 1000);
        for (int mtree = 1; mtree < 20; mtree += 1) {

            RandomForest rf = new RandomForest(10, mtree, true);
            rf.learn(train, "spam");
            ClassifierModel model = rf.predict(test);

            index.setIndex(pos, mtree);
            accuracy.setValue(pos, 1 - computeAccuracy(model, test));
            oob.setValue(pos, rf.getOobError());

            pos++;
        }
        p = new Plot();
        lines = new Lines(p, index, accuracy);
        lines.opt().setColorIndex(new OneIndexVector(2));
        p.add(lines);
        pts = new Points(p, index, accuracy);
        pts.opt().setColorIndex(new OneIndexVector(2));
        p.add(pts);
        p.add(new Lines(p, index, oob));
        p.add(new Points(p, index, oob));
        p.setLeftLabel("test (blue), oob (black");
        p.setBottomLabel("mcols - number of features considered");
        p.setTitle("Accuracy errors (% misclassified)");
        p.getOp().setYRange(0, 0.4);
        draw(p, 600, 400);
Note: the sole purpose of this tutorial is to show what and how it can be done with Rapaio toolbox library.
>>>This tutorial is generated with Rapaio document printer facilities.<<<