The RMarkdown file for this lesson can be found here
This lesson will follow Chapter 5 in Quinn and Keough (2002).
Correlation analysis
Consider a study, where we are interested in the relationship between two random variables.
Bivariate normal distribution
We need to think of our data as a population of \( y_{i1} \) and \(y_{i2} \) pairs (a joint distribution of two variables or a bivariate distribution).
The bivariate normal distribution is defined by the mean and standard deviation of each variable and a parameter called the correlation coefficient, which measures the strength of the relationship between the two variables. A bivariate normal distribution implies that the individual variables are also normally distributed and also implies that any relationship between the two variables is a linear one.
Covariance and correlation
Covariance is the linear relationship between two continuous variables.
Covariance
and goes from \( -\infty \) to \( +\infty \)
One problem with covariance is that the absolute magnitude depends on the units of the two variables
Pearson correlation
The covariance can be standardized by dividing by the standard deviations of the two variables so that the value range between -1 and +1. This is called the Pearson (product-moment) correlation.
The Pearson correlation measures the “strength” of the linear relationship between the two continuous variables.
Remember up above when we generated x1 and y1 that we used a correlation value, r, of 0.55.
Robust correlation (Spearman’s rank correlation)
We may have a situation where the joint distribution of our two variables is not bivariate normal.
non-normality in either variable
monotonic relationships that are not linear
Parametric and non-parametric confidence regions
When representing a bivariate relationship with a scatterplot, it is often useful to include confidence regions. The confidence region is the region within which we would expect the observation represented by the population mean of the two variables to occur a percent of the time under repeated sampling from this population.
Confidence ellipse
Assuming our two variables follow a bivariate normal distribution, the confidence band will always be an ellipse centered on the sample means of \( y_{i1} \) and \(y_{i2} \) and the orientation of the ellipse is determined by the covariance or correlation.
Kernel density
Sometimes we are not interested in the population mean of \( y_{i1} \) and \(y_{i2} \) but we just want a confidence interval based on the observed data. The kernel density for a value of *y
is the sum of hte estimates from a series of symmetrical distributuoins fitted to groups of local observations. Note that they are not constrained to an elliptical shape.