The RMarkdown file for this lesson can be found here.
This lesson will follow Chapter 5 in Quinn and Keough (2002).
Load the packages we will be using in this lesson
Linear regression analysis
Statistical models that assume a linear relationship between a single, continuous (usually) predictor value are simple linear regression models.
These models have three primary purposes:
describe a linear relationship between \( Y \) and \( X \)
determine the amount of variation (explained) in \( Y \) with \( X \) and the amount of variation unexplained
predict values of \( Y \) from \( X \)
Simple bivariate linear regression
Linear model for regression
Consider you have a set of observations (\( i = 1 :n \) ), where the each observation was chosen based on its \( X \) value and its \( Y \) value for each observation is sampled from a population of possible \( Y \) values.
This model can be represented as:
\( y_i \) is the value of \( Y \) for the ith observation when the predictor \( X = x_i \)
\( \beta_0 \) is the population intercept (i.e., mean value of the probability distribution) when \( x_i = 0\)
\( \beta_1 \) is the population slope and measures the change in \( Y \) with a change in \( X \)
\( \epsilon_i \) is the random or unexplained error associated with the ith observation
In this model, the response variable \( Y \) is a random variable and \( X \) represents fixed values choed by the researcher. Thus repeated sampling, you would have the same values of \( X \) while \( Y \) would vary.
Estimating model parameters
The main goal in regression analysis is estimating \( \beta_0 \), \( \beta_1 \), and \( \sigma_\epsilon^2 \).
We discussed solving for \( \beta_0 \) and \( \beta_1 \) using OLS in an earlier lesson
Regression slope
The most informative of the parameters in a regression equation is \( \beta_{1} \), because this describes the relationship between \( Y \) and \( X \).
Intercept
The OLS regression line must pass through \( \bar{x} \) and \( \bar{y} \). We can then estimate \( \beta_{0} \) by substituting in \( \beta_{1} \), \( \bar{x} \) and \( \bar{y} \).
Often the intercept does not contain a lot of usable information because rarely do we have situations where \( X = 0 \).
Lets begin to explore this with the coarse woody debris data in lakes. christ data in Chap 5 on github.
`lm()’ is the function in R to conduct simple linear regression.
There is a lot of information stored in our object mod_cwd.
We can call on these directly from our mod_cwd or use several ‘helper’ functions.
The broom package makes inspection of the models a bit easier (although they are not too difficult) in base R. The biggest plus for broom, is that the outputs of the models are returned in a tidy format.
Confidence intervals
Confidence intervals for \( \beta_{1} \) are calculated in the usual manner when we know the standard error of a statistic and use the t distribution.
This can be represented as a confidence band (e.g. 95%) for the regression line. The 95% confidence band is a band that will contain the true population regression line 95% of the time.
We can display our confidence intervals using geom_smooth in ggplot.
We can also use geom_smooth to explore other non-linear relationships between \( X \) and \( Y \).
While using geom_smooth makes nice visuals, I think you have a lot more flexibility when you build your own predictions. The predict() function is one of my favorite functions in R.
Predicted values
Prediction from the OLS regression equation is straightforward by substituting an X-value into the regression equation and calculating the predicted Y-value. Do not predict from X-values outside the range of your data.
If we run the predict() with just the model, we get results the same as in the fitted.values.
It helps to bind, your predictions (and the standard error) with those from your original data. NOTE: that augment already does this for you.
While these values are helpful in displaying the basic model fit, there are often times (especially when doing multiple linear regression) when you want to look at predictions based off specific values. We can do this by using the newdata in predict(). NOTE that the column header names need to reflect the independent values in your model.
Residuals
This difference between each observed \( y_{i} \) and each predicted \( \hat{y_i} \) is called a residual \( e_{i} \):
Analysis of variance
In biological sciences we often want to partition the total variation in \( Y \) in part to \( X \) and the other part to the unexplained variation. The partitioned variance is often presented as an analysis of variance (ANOVA) table.
Total variation in \( Y \) is the sum of squared deviations of each observation from the sample mean
\( SS_{total} \) has n-1 df and can be partitioned into two additive components
Variation in \( Y \) explained by \( X \) (difference between \( \hat{y_i} \) and \( \bar{y} \). The number of degrees of freedom associated with a linear model is usually the number of parameters minus one.
Variation in \( Y \) not explained by \( X \) (difference between each observed Y-value and \( \hat{y_i} \). Residual (or error) variation. The \( df_{residual} \) is n-2, because we have already estimated \( \beta_0 \) and \( \beta_1 \) to determine the \( \hat{y_i} \).
The SS and df are additive
The \( SS_{total} \) increases with sample size. The Mean SS is a measure of variability that does not depend on sample size. MS is calculated by dividing SS by their df and thus, are not additive.
The \( MS_{Residual} \) estimates the common variance of the error terms \( e_{i} \), and therefore of the Y-values at each \( x_i \). NOTE a key assumption is homogeneity of variances.
We can calculate the ANOVA table from our linear model in R by using the anova() statment.
Variance explaned ( \(r^2\) or \( R^2 \))
descriptive measure of association between Y and X (also termed coefficient of variation). the proportion of the total variation in Y that is explained by its linear relationship with X.
\( 1 = \frac{SS_{residual}}{SS_{total}} \)
Scatterplot with marginal boxplots
Assumptions of a regression model
Normality (except GLMs)
Homogeneity of variance
Independence
Fixed X
Regression diagnostics
A proper interpretation of a linear regression analysis should also include checks of how well the model fits the observed data
Is a straight line appropriate?
Influence of outliers?
See-saw, balanced on the mean of X
Leverage
Leverage is a measure of how extreme an observation is for the \(X \)-variable
Generally concerned when a value is 2 or 3 times greater than the mean value
Residuals
Residuals are an important way of checking regression assumptions
Studentized residuals do have constant variance so different studentized residuals can be compared
Influence
Cook’s distance statistic, \( D_i \), is the measure of the influence each observation has on the fitted regression line and the estimates of the regression parameters.
A large \( D_i \) indicates that removal of that observation would change the estimates of the regression parameters considerably
Both \(X \) and \(Y \) chosen haphazardly or at random
Model II Regression and the approach is controversial
If the purpose of regression is prediction, then OLS
If the purpose of regression is mechanisms, then not OLS (?)
error variability associated with both Y \( \sigma_{\epsilon}^2 \) and X \( \sigma_{\gamma}^2 \) and the OLS estimate of \( \beta_1 \) is biased towards zero
Major axis (MA) regression fits line minimizing the sum of squared perpendicular distances from each observation to the fitted line
Reduced major axis (RMA) regression or standard major axis (SMA) regression is fitted by minimizing the sum of areas of the triangles formed by vertical and horizontal lines from each observation to the fitted line
\( \sigma_{\epsilon}^2 \propto \sigma_x^2 \) and \( \sigma_{\gamma}^2 \propto \sigma_y^2\)
Simulated comparisons of OLS, MA and RMA regression analyses when X is random indicated:
RMA estimate of \( /beta_1 )\ is less biased than the MA estimate
If the error variability in X is more than ~ a third of the error variability in Y, then RMA is the preferred method; otherwise OLS is acceptable
Robust regression
Limitation of OLS is that the estimates of model parameters, and therefore subsequent hypothesis tests, can be sensitive to distributional assumptions and affected by outlying observations
Least absolute deviance (LAD)
Minimize the sum of absolute values of the residuals rather than the sum of squared residuals
not squaring the residuals, extreme observations have less influence on the fitted model
M-estimator
M-estimators involve minimizing the sum of some function of \( e_i \)
Huber M-estimators, described earlier, weight the observations differently depending how far they are from the central tendency
Rank-based regression
Does not assume any specific distribution of the error terms but still fits the usual linear regression model
Transformations are either ineffective or misrepresent the underlying biological process
Relationship between regression and correlation
Simple correlation analysis is used when we seek to measure the strength of the linear relationship (the correlation coefficient) between the two variables
Regression analysis is used when we can biologically distinguish a response \(( Y \) to a predictor variable \( X \)
We can construct a model relating \(( Y \) to \( X \) and this to predict \(( Y \) from \( X \)