Back

Discrete Sampling with Rapaio

In statistics, sampling is the process which selects a subset of observations from within a statistical population. There are many types of sampling methods which can be employed in statistics, like simple random sampling, stratified sampling, systematic sampling and so on.
However the purpose of this tutorial is to present four algorithms which are used as building blocks by various statistical sampling methods. The purpose of this tutorial is to present discrete random sampling methods.
Definition:
A statistical distribution whose values can take only discrete values is called a discrete distribution.
A discrete probability distribution function is completely described by the set of possible values the random variable can take and by the probabilities assigned to each value.
An example of discrete distribution is the process of throwing a standard dice. We have a finite set of outcomes of the process (6 possible values) and a probability function value associated with each output (for a fair dice we can associate probability \( p(x_i) = \frac{1}{6} \)).
Drawing a sample from a distribution is the process of selecting some values from the possible values \( x_i,i=1..n \) according with their probabilities. Sampling is useful for far many purposes that I can describe here. Among some very important scenarios are the simulation and the fapt that working with a smaller sample than the given population is faster and provides enough information for analysis. For the latter example we can note that working with the heights and weights of the all the people from the world (assuming that this information is possible to collect, there are probably over 7 billions records) is much harder than working with a sample much smaller.

Uniform random sample with replacement

An uniform random sample is a sample from a discrete population with an uniform distribution. A discrete uniform distribution is a distribution which assigns equal probability mass function values to each outcome. The previous example of throwing a fair dice is an example of uniform distribution, since it assigns equal value \(\frac{1}{6}\) to each possible outcome \( x_i \).
A sample with replacement is a sample where values of the sample can appear multiple times. The expression "with replacement" can be misleading. The intuition behind seems to follow from the following description of the process:
We have a set of possible elements \(x_i\), each with assigned equal probability of \(p(x_i) \). Take randomly one element from the set, according with their probabilities (denote the taken element with \(x_k\)). Replace the element taken from the set with a new element, which has the same value as the element previously removed. At this stage we have again a situation identical with the initial situation. Repeat the process of taking elements from the original set, followed by replacing that element with another similar element unit you collect the desired number of elements for the sample.
Of course, we don't have to replace effectively the element. Repeating the process of throwing a fair dice multiple times is a sampling with replacement (we don't remove and replace something during process).
The algorithm for taking a discrete uniform sample with replacement is fairly simple since it is based on a basic operation provided by Java language (java.util.Random.nextInt(int n);)

        int[] sample = new DiscreteSamplingWR(6).sample(1000);
        Vector vector = new NumericVector("fair-dice", sample);
        draw(new Histogram(vector, 6, false), 500, 200);

In the presented histogram we see frequencies obtained be taking a sample of size 1000 of the fair-dice process outcomes. We note that the proportions are somehow equal, which is according with our assumption that each element have equal probability. However the values from sample looks random:

2, 3, 4, 1, 0, 1, 4, 0, 3, 4, 5, 4, 3, 0, 3, 2, 2, 2, 1, 4, 5, 1, 4, 4, 2, 3, 5, 3, 4, 4, 
2, 3, 4, 5, 3, 1, 5, 0, 0, 1, 1, 3, 1, 0, 5, 1, 5, 4, 4, 2, 2, 4, 0, 2, 1, 4, 2, 2, 2, 3, 
3, 0, 3, 2, 0, 1, 0, 4, 1, 3, 0, 2, 2, 2, 4, 1, 3, 2, 2, 4, 5, 2, 2, 0, 1, 0, 0, 5, 3, 3, 
2, 3, 0, 1, 0, 1, 2, 4, 4, 4, ...

Uniform random sample without replacement

Sampling without replacement implies that during the process of selection of elements which will be collected in teh desired sample, the chosen elements are not available again for further selection. A process like this appears when are drawn numbers for lottery. The already selected numbers are not available for selection again. Another expression which, at least for me, seems much clearer is "without repetition". So, with replacement means a sample with repetition, and without replacement is a sample without repetition.
Let's suppose that we have a lottery 6/49. At each draw 6 unique numbers in range 1-69 are drawn. We take 100 draws and simulate:

        final int TRIALS = 100;
        final int SAMPLE_SIZE = 6;
        final int POPULATION_SIZE = 49;
        Vector[] vectors = new Vector[2];
        vectors[0] = new NumericVector("lottery trial", SAMPLE_SIZE * TRIALS);
        vectors[1] = new NumericVector("winning number", SAMPLE_SIZE * TRIALS);

        for (int i = 0; i < TRIALS; i++) {
            int[] numbers = new DiscreteSamplingWOR(POPULATION_SIZE).sample(SAMPLE_SIZE);
            for (int j = 0; j < numbers.length; j++) {
                vectors[0].setValue(i * SAMPLE_SIZE + j, i);
                vectors[1].setValue(i * SAMPLE_SIZE + j, numbers[j] + 1);
            }
        }

        final Frame df = new SolidFrame("lottery", SAMPLE_SIZE * TRIALS, vectors);
        draw(new Plot() {{
            add(new Points(this, df.getCol(0), df.getCol(1)));
            getOp().setPchIndex(new OneIndexVector(1));
            getOp().setColorIndex(new OneIndexVector(34));
        }}, 600, 300);

There is random in that plot. Everywhere. A summary on the data, however, can give us enough clues to understand that the distribution of those numbers are still symmetric and somehow uniform.

>>summary("lottery", [lottery trial, winning number])
rows: 600, cols: 2
    lottery trial    winning number 
   Min. :   1.000     Min. :  1.000 
1st Qu. :  25.417  1st Qu. : 13.000 
 Median :  50.500   Median : 25.000 
   Mean :  50.500     Mean : 25.148 
2nd Qu. :  75.583  2nd Qu. : 38.000 
   Max. : 100.000     Max. : 49.000

Weighted random sample with replacement

A weighted sample is a discrete random sample which does not have an uniform distribution. A well-know example used in almost all introductory probability classes is the biased coin. A biased coin is a coin which is not fair. That mean the probability after a draw to see HEAD is different than the probability to see a TAIL.
Let's suppose we have a biased coin with \( p(coin=HEAD) = 0.6\) and \( p(coin=TAIL)=0.4\). We simulate this experiment and we draw the coins a lot of times. The law of large numbers tells us that after a reasonable amount of repetitions the plugged in estimator will tend to go the the population parameter estimated.
During the experiment we will throw the coin 300 times and we will plot the plugin estimate which is the number of times HEAD is drawn divided by the number of experiments.

        RandomSource.setSeed(1);
        final Vector index = new IndexVector("experiment no.", 1, 1000, 1);
        final Vector value = new NumericVector("HEAD/TOTAL", 1000);
        double count = 0;
        double total = 0;
        for (int i = 0; i < 300; i++) {
            int[] samples = new DiscreteWeightedSamplingWR(new double[]{0.6, 0.4}).sample(1);
            if (samples[0] == 0) count++;
            total++;
            value.setValue(i, count / total);
        }
        draw(new Plot() {{
            add(new ABLine(this, 0.6, true));
            add(new Lines(this, index, value) {{
                opt().setColorIndex(new OneIndexVector(2));
                opt().setLwd(1.5f);
            }});
            getOp().setYRange(0, 1);
        }});

From the previous function line we see that the plugged in estimate of the probability of HEAD has a large variation at the beginning of our experiment. However, as number of trials increases we have clear reasoning to confirm that the coin is biased, since the variation decrease, the estimator converge to value 0.6 which is not what we could expect from a fair coin.
The sampling algorithm implemented is one of the family of alias method, specifically is called Vose algorithm and is one of the linear algorithms used today for discrete weighted random sampling. See more about this algorithm here: http://en.wikipedia.org/wiki/Alias_method.

Weighted random sample without replacement

This is the last type of discrete random sampling covered here. What we are interested in is to generate samples without replacement (no repetition), from a discrete distribution different than uniform distribution.
We consider again the lottery experiment. However we want to simulate a situation when some winning numbers are preferred over the others. Let's suppose that our lottery favors big numbers, >= that 40. And some other numbers, ones in interval 8-12 have smaller probability than usual. We repeat the experiment with a weighted sampling technique.

        vectors[0] = new NumericVector("loaded lottery", SAMPLE_SIZE * TRIALS);
        vectors[1] = new NumericVector("winning number", SAMPLE_SIZE * TRIALS);

        double[] prob = new double[49];
        for (int i = 0; i =8 && i-1<=12) {
                prob[i]=3;
                continue;
            }
            if(i-1>=40) {
                prob[i]=30;
                continue;
            }
            prob[i]=10;
        }
        for (int i = 0; i < TRIALS; i++) {
            int[] numbers = new DiscreteWeightedSamplingWOR(prob).sample(SAMPLE_SIZE);
            for (int j = 0; j < numbers.length; j++) {
                vectors[0].setValue(i * SAMPLE_SIZE + j, i + 1);
                vectors[1].setValue(i * SAMPLE_SIZE + j, numbers[j] + 1);
            }
        }

        final Frame df2 = new SolidFrame("lottery", SAMPLE_SIZE * TRIALS, vectors);
        draw(new Plot() {{
            add(new Points(this, df2.getCol(0), df2.getCol(1)));
            getOp().setPchIndex(new OneIndexVector(1));
            getOp().setColorIndex(new OneIndexVector(34));
            getOp().setSizeIndex(new OneNumericVector(2));
        }}, 600, 300);

This time we see more than random there. There is a clear more dense region in the upper side of the graph. Also, we can note, perhaps not as very clear, a stripe with low density somewhere under y=13.
To clarify a kernel density plot which approximates the population density would help more.
graphics

Rapaio implementation of this last algorithm is based on a wonderful algorithm invented by Efraimidis-Spirakis. http://link.springer.com/content/pdf/10.1007/978-0-387-30162-4_478.pdf.
Note: the sole purpose of this tutorial is to show what and how it can be done with Rapaio toolbox library.
>>>This tutorial is generated with Rapaio document printer facilities.<<<

Back

Broken thoughts

October 23, 2013