Introduction
This tutorial presents you the correlations tools offered by Rapaio library.We will use the classical iris data set. The numerical columns of this dataset are:
Frame df = Datasets.loadIrisDataset();
df = ColFilters.retainNumeric(df);
names(df);
>>names("iris")
sepal-length
sepal-width
petal-length
petal-width
Pearson product-moment correlation
Pearson product-moment correlation measures the linear correlation between two random variables. Among other type of correlation measures, the Pearson product-moment detects only linear correlations.Definition
Pearson product-moment coefficient measures the linear correlation between two random variables \(X\) and \(Y\), giving a value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation, and −1 is negative correlation.Pearson's correlation coefficient when applied to a population is commonly represented by the Greek letter \( ho\) (rho) and may be referred to as the population correlation coefficient or the population Pearson correlation coefficient. The formula for \(\rho\) is:
$$ \rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X\sigma_Y} = \frac{E[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y} $$Pearson's correlation coefficient when applied to a sample is commonly represented by the letter \(r\) and may be referred to as the sample correlation coefficient or the sample Pearson correlation coefficient. We can obtain a formula for \(r\) by substituting estimates of the covariances and variances based on a sample into the formula above. That formula for \(r\) is:
$$ r = \frac{\sum ^n _{i=1}(X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\sum ^n _{i=1}(X_i - \bar{X})^2} \sqrt{\sum ^n _{i=1}(Y_i - \bar{Y})^2}} $$The interpretation of a correlation coefficient depends on the context and purposes. A correlation of 0.8 may be very low if one is verifying a physical law using high-quality instruments, but may be regarded as very high in the social sciences where there may be a greater contribution from complicating factors.
Usa Rapaio for Pearson correlation
Rapaio library allows one to compute Pearson \(r\) for more then one vector at a time. Thus the result will be a matrix with computed \(r\) values between vectors,using vectors index position as indexes in resulted matrix. PearsonRCorrelation corr = new PearsonRCorrelation(df);
summary(corr);
pearson[[sepal-length, sepal-width, petal-length, petal-width]] - Pearson product-moment correlation coefficient
1.sepal-length 2.sepal-width 3.petal-length 4.petal-width
1. x -0.109369 0.871754 0.817954
2. -0.109369 x -0.420516 -0.356544
3. 0.871754 -0.420516 x 0.962757
4. 0.817954 -0.356544 0.962757 x
We can spot with eas that many of the attributes are linearly correlated. As a sample we find from the correlation summary that petal-length and petal-width have a very strong linear correlation. Let's check this intuition with a plot:Another \(r\) coefficient which have a value close to \(1\) is between sepal-length and petal-length. Let's check that with a plot, also:
Finally, we plot again, but this time using a coefficient which is closer to 0, which could mean that the variables are not linearly correlated. Such a value for correlation we have between sepal-length and sepal-width.
Spearman's rank correlation coefficient
often denoted by the Greek letter \(\rho\) (rho) or as \(r_s\), is a nonparametric measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of \(+1\) or \(−1\) occurs when each of the variables is a perfect monotone function of the other.Definition
The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables. For a sample of size \(n\), the \(n\) raw scores \(X_i\), \(Y_i\) are converted to ranks \(x_i\), \(y_i\), and \(\rho\) is computed from these:$$ \rho = \frac{\sum_i(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_i (x_i-\bar{x})^2 \sum_i(y_i-\bar{y})^2}} $$Identical values (rank ties or value duplicates) are assigned a rank equal to the average of their positions in the ascending order of the values.
Use Rapaio to compute Spearman's rank correlation
Rapaio library allows one to compute Spearman \(\rho\) for more then one vector at a time. Thus the result will be a matrix with computed \(\rho\) values between vectors,using vectors index position as indexes in resulted matrix.spearman[[sepal-length, sepal-width, petal-length, petal-width]] - Spearman's rank correlation coefficient
1.sepal-length 2.sepal-width 3.petal-length 4.petal-width
1. x -0.159457 0.881386 0.834421
2. -0.159457 x -0.303421 -0.277511
3. 0.881386 -0.303421 x 0.936003
4. 0.834421 -0.277511 0.936003 x
pearson[[sepal-length, sepal-width, petal-length, petal-width]] - Pearson product-moment correlation coefficient
1.sepal-length 2.sepal-width 3.petal-length 4.petal-width
1. x -0.109369 0.871754 0.817954
2. -0.109369 x -0.420516 -0.356544
3. 0.871754 -0.420516 x 0.962757
4. 0.817954 -0.356544 0.962757 x
No comments:
Post a Comment