September 25, 2013

Histograms and kernel density estimation with Rapaio toolbox

Histogram and Kernel Density Estimation Tutorial

Data set used

First we need to load a frame with data. For convenience we don't use input/output facilities. Instead, we load a set of data already built-in the library.

We will use the classical Pearson father-son data set.

>>summary("pearson", [Father, Son])
rows: 1078, cols: 2
          Father               Son 
   Min. : 59.000     Min. : 58.500 
1st Qu. : 65.800  1st Qu. : 66.900 
 Median : 67.800   Median : 68.600 
   Mean : 67.687     Mean : 68.684 
2nd Qu. : 69.600  2nd Qu. : 70.500 
   Max. : 75.400     Max. : 78.400 
                                   

We have two continuous random variables which observes height of pairs of father and sons.

The 5-number summaries gives us some hints on how the distribution of the values for the 2 random variables looks like. We note that the mean and median are very close which leads us to think that the distribution of the values are somehow symmetric.

Histograms

One of the most important and usual tools to obtain information about the distribution of a random variable is histogram.

In statistics, a histogram is a graphical representation of the distribution of data. It is an estimate of the probability distribution of a continuous variable and was first introduced by Karl Pearson. A histogram is a representation of tabulated frequencies, shown as adjacent rectangles, erected over discrete intervals (bins), with an area equal to the frequency of the observations in the interval.

We draw histograms with Rapaio toolbox in the following way:

        draw(new Histogram(df.getCol("Father")));

graphics

The height of a rectangle is also equal to the frequency density of the interval, i.e., the frequency divided by the width of the interval. The total area of the histogram is equal to the number of data.

The previous code uses the simplest form of building a histogram. There are some parameters which can help somebody to tune the output.

One can change the number of bins.

        draw(new Histogram(df.getCol("Father"), 100, false));

graphics

Note that on the vertical axis we found the count of the elements which are held by the bins that are displayed. We can change how the heights of the bins are computed into densities which makes the total area under curve to be 1. That feature is a key property of a probability density function, also.

draw(new Histogram(df.getCol("Father"), 30, true));

graphics

The histogram is useful but have a weak point. Its weak point lies into it's flexibility given by the number of bins. There is no "best" number of bins, and different bin sizes can reveal different features of the data.

Some theoreticians have attempted to determine an optimal number of bins, but these methods generally make strong assumptions about the shape of the distribution. Depending on the actual data distribution and the goals of the analysis, different bin widths may be appropriate, so experimentation is usually needed to determine an appropriate width.

An alternative to the histogram is kernel density estimation, which uses a kernel to smooth samples. This will construct a smooth probability density function, which will in general more accurately reflect the underlying variable.

One can draw also the kernel density approximation, over a histogram or as a separate plot.

        Vector col = df.getCol("Father");
        Plot plot = new Plot();
        HistogramBars hb = new HistogramBars(plot, col);
        hb.opt().setColorIndex(new IndexVector("rainbow", 1, 255, 1));
        plot.add(hb);
        plot.add(new DensityLine(plot, col));
        draw(plot);

graphics

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form.

In it's default implementation, used without parameters, the Rapaio toolbox build a kernel density estimation with Gaussian kernels and with bandwidth approximated by Silverman's rule of thumb.

However one can use a different value for bandwidth in order to obtain a smooth or less smooth approximation of the density function.

graphics

Another thing one can try with kernel density estimator is to change the kernel function, which is the function used to disperse in a small neighborhood around a point the probability mass initially assigned to the points of the discrete sample. The kernel choices are: uniform, triangular, Epanechnikov, biweight, triweight, tricube, Gaussian and cosine. Of course, it is easy to implement your own smoothing method once you implement a custom kernel function.

graphics

We could agree that my implementation of kernel function is ugly and maybe no so useful, however you have to know that the purpose of Rapaio toolbox is to give you sharp and precise standard tools and, in the same time, the opportunity to experiment with your owns.

One final graph will show the kernel approximation of density functions for both numerical variables that we have: father's heights and son's heights.

Blue line represents density approximation of father's heights, red line represents density approximation of son's heights.

graphics

Note: the sole purpose of this tutorial is to show what and how it can be done with Rapaio toolbox library.

>>>This tutorial is generated with Rapaio document printer facilities.<<<