The RMarkdown file for this lesson can be found here.
This lesson will follow Chapter 2 in Quinn and Keough (2002).
Samples and populations
Biologists want to make inferences about a population based on subsamples of that population.
collections of the population are the sample
number of observations is the sample size
Basic method of sampling is simple random sampling (all observations have the same probability of being sampled)
rarely does this happen (Why is this a concern?)
Random sampling is important because we want to use sample statistics to estimate the population parameters. We can not directly measure the population parameters because it is too large.
A good estimator of a population should:
be unbiased. Repeated samples should produce estimates that do not under or over estimate the population parameter
consistent so with increases in sample size should bring the sample parameter closer to the population parameter.
should be efficient (has the lowest variance among competing estimates)
There are two broad types of estimators:
point estimates (single value)
interval estimates (range of values)
Common parameters and statistics
Center (location of distribution)
To explore these statistics, we will generate a large sample of data.
Lets visualize this using a histogram. There are two approaches we can do this. One is to generate the binned data with dplyr and the other is to use geom_histogram in ggplot.
L-estimator
L-estimator is based on ordering data from smalles to largest then using a linear combination of weighted-order statitistics
Mean
The mean is an unbiased estimator of population mean where each observation is weighted by 1/n.
Median
Median is an unbiased estimator of population mean for normal distribution, better estimator in skewed distributions
Trimmed mean
Trimmed mean is the mean after trimming off a proportion of the data from the highest and lowest observations. Can help deal with outliers.
Winsorized mean
Winsorized mean is similar to the trimmed mean but the values that are excluded are replaced by the neighboring value (substituted rather than dropped).
Notice that these numbers are slightly different. That is because instead of replacing the trimmed values with the nearest number, winsor.mean replaces them with the 20% and 80% quantile.
M-estimators
M-estimators give different weights gradually from the middle of the sample and incorporate a measure of variability in the estimation procedure. Uses an iterative approach and are useful with extreme outliers.
Not commonly used but do have a role in robust regression and ANOVA techniques for analyzing linear models.
We can see from this data that there is one HUGE outlier. Running huber and mean give us different results.
R-estimators
R-estimators based on the ranks of the observations rather than the observations themselves and form the basis for many rank- based “non-parametric” tests.
Hodges–Lehmann estimator
Hodges–Lehmann estimator is the median of the averages of all possible pairs of observations
Spread or variability of your sample
Like estimators for the central tendency of your data, there are also numerous ways to assess the spread in your sample.
Going back to our concentration data that we created earlier, we will look at some of these estimates.
Range
The range is perhaps the simplest estimate of the spread of your data
Variance
Variance is the average squared deviation from the mean.
Standard deviation
Square root of the variance.
Coefficient of variation
Used to compare standard deviations across populations with different means because it is independent of the measurement units.
Median absolute deviation
Less sensitive to outliers than the above measures and is presented in association with medians.
Interquartile range
The difference between the first quartile and the third quartile. Used in ggplots geom_boxplot.
Degrees of freedom
Degrees of freedom is simply the number of observations in our sample that are “free to vary” when we are estimating the variance. We already know one of those values is the mean, thus df = n - 1.
Standard error
The standard error of the mean is describing the variation in our sample mean. It is termed an error because it is the error about $\bar{y}$. If the error is large, then repeated sampling would produce different means.
Standard error is calculated as the standard deviation divided by the square root of the number of observations. There is no function in stats that calculates the standard error.
Confidence intervals
NOTE: all of this is assuming normality. If you are not working with a normal distribution then other methods maybe necessary to calculate variance (in particular).
In frequentist terms, the confidence interval is not a probability statement. A confidince interval can be thought of, in this context, as one interval generated by a procedure that will give correct intervals 95% of the time.
Confidence intervals can be calculated using the t-statistic critical value and the standard error. The width of the confidence interval can be found using the qt() function in R.
We can illustrate this using ggplot2.
You can see the difference in the above plot. Standard deviations describe the spread in the data, whereas the standard error describes where the mean (or predicted value) falls.
What happens to the standard error as we increase the sample size to 200, 500, 1000?
Resampling methods
There are a couple of different ways to calculate the spread or confidence interval when the sampling distribution is unknown or is definitively not normal. These methods involve resampling your data over many different iterations to build distributions of the expected values (i.e., means, medians, confidence intervals).
Jackknifing
Jackknifining is a predecessor of bootstrapping and is less computationally expensive. Jackknifing is done by going through the data and leaving one observation out and calulating the sample statistic of interest. The term ‘jackknifing’ was coined by John Tukey referring to its robustness as a tool.
We can explore this below:
Bootstrapping
Bootstrapping is another resampling technique. Instead of “leaving one out”, we resample the data. We have two options. One is to take a random subset of the data (likely with no replacement) or resample the full number of the data set (with replacement). Like the jackknife, we calculate our sample statistic. Unlike the jackknife, it tends to be a general rule that we resample a lot of times. The general rule is ~ 1,000 times.
Now it is your turn
Load the lovett data from our github website (data/ExperimentalDesignData/chpt2/) using the appropriate means
Calculate these statistics for SO4, SO4MOD, and CL