1. Introduction and Review
In the previous tutorial, we explored the basics of linear models, focusing on models with homogeneous variance.
We discussed how to fit linear models using the lm()
function in R, examined the structure of linear models, and analyzed how to interpret the output from summary(lm_fit)
.
In this tutorial, we'll dive deeper into these concepts, particularly focusing on using the CO2 dataset as an example.
2. The lm()
Function
The lm()
function in R is used to fit linear models. It takes a formula, which specifies the model, and a dataset as its primary arguments. The syntax is as follows:
lm(formula, data, subset, weights, na.action, method, ...)
– formula
: Specifies the model to be fitted, using a response variable and one or more predictor variables.
– data
: The dataset containing the variables specified in the formula.
– subset
: An optional argument specifying a subset of observations to be used.
– weights
: An optional vector of weights to be used in the fitting process.
– na.action
: A function specifying how to handle missing values.
– method
: The method to be used for fitting; the default is “qr”.
3. Mean Structure Using a Model Formula
In linear models, the mean structure is specified using a model formula, which describes the relationship between the response and predictor variables. In R, the syntax is:
response ~ predictor1 + predictor2 + ...
Each term in the formula represents a different component of the model:
– Main effects: Represent the effect of individual predictors.
– Interaction effects: Represent the combined effect of multiple predictors.
The model formula is interpreted as a sequence of terms, and each term corresponds to a column in the design matrix created by the model frame.
4. Creating a Model Frame and Design Matrix
The model frame is a data structure that includes the response and predictor variables, as well as any necessary transformations specified in the formula. The design matrix is a matrix representation of the model, where each column corresponds to a term in the formula. In R, you can create a model frame using the model.frame()
function and a design matrix using the model.matrix()
function.
5. Fitting a Linear Model Using lm()
and gls()
To fit a linear model, you can use the lm()
function for ordinary least squares regression, or the gls()
function from the nlme
package for generalized least squares. The gls()
function is useful when you need to account for correlation structures or heteroscedasticity in your data.
lm_fit <- lm(uptake ~ conc * Type, data = CO2)
gls_fit <- gls(uptake ~ conc * Type, data = CO2)
6. Extracting Information from the Model Fit Object
After fitting a model, the resulting object contains a wealth of information. You can use functions like summary()
, coef()
, confint()
, and anova()
to extract and interpret various aspects of the model. For example, summary(lm_fit)
provides a detailed summary of the model, including estimated coefficients, standard errors, t-values, and p-values.
7. Testing a Linear Hypothesis for Fixed Effects
Testing linear hypotheses is crucial for understanding whether specific effects in your model are significant. There are two main types of tests:
- t-Test for Individual Coefficients: This test evaluates the significance of individual regression coefficients. The test statistic is:\[
t_j = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)},
\]where \( \hat{\beta}_j \) is the estimated coefficient and \( \text{SE}(\hat{\beta}_j) \) is its standard error. The t-value follows a t-distribution with \( n – p \) degrees of freedom.
- F-Test for Multiple Coefficients: This test evaluates the significance of one or more coefficients simultaneously. The test statistic is:\[
F = \frac{\left(\text{SSR}_{\text{full}} – \text{SSR}_{\text{reduced}}\right) / (p_1 – p_0)}{\text{SSR}_{\text{full}} / (n – p_1)},
\]where \( \text{SSR}_{\text{full}} \) and \( \text{SSR}_{\text{reduced}} \) are the sum of squared residuals for the full and reduced models, respectively. The F-statistic follows an F-distribution with \( p_1 – p_0 \) and \( n – p_1 \) degrees of freedom.
8. Example with the CO2 Dataset
8.1. Plot and Analysis
Below is a repeated measures plot of the CO2 data, showing how CO2 uptake varies with concentration for each plant type. This plot helps us understand the trends and variations within the data.
# Load necessary libraries
library(ggplot2)
# Load the CO2 dataset
data(CO2)
# Plotting the CO2 dataset
ggplot(CO2, aes(x = conc, y = uptake, color = Type, group = interaction(Type, Plant))) +
geom_line() +
geom_point(size = 2) +
labs(title = "Repeated Measures Plot of CO2 Uptake",
x = "Concentration (conc)",
y = "CO2 Uptake (uptake)",
color = "Plant Type") +
theme_minimal() +
theme(legend.position = "bottom")
The plot above shows how CO2 uptake changes with concentration for different types of plants. Each line represents a type of plant, and different points along the line correspond to different observations. This visualization is crucial for understanding how the uptake varies across different levels of concentration and can help in identifying trends and patterns in the data.
From the plot, you can observe the variation in CO2 uptake among different plant types and how it changes with concentration. This information is essential for understanding the effects of concentration on CO2 uptake and can be used to further analyze the model fit and residuals.
8.2 Summaries of Linear Models
Let’s fit a linear model to the CO2 dataset and examine the output using the summary()
function. The model formula is uptake ~ conc * Type
, where uptake
is the response variable, and conc
and Type
are the predictors.
lm_fit<-lm(uptake ~ conc * Type)
summary(lm_fit)
The output of summary(lm_fit)
provides the following information:
Call:
lm(formula = uptake ~ conc * Type, data = CO2)
Residuals:
Min 1Q Median 3Q Max
-16.3956 -5.5250 -0.1604 5.5724 12.0072
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.503038 1.910531 12.302 < 2e-16 ***
conc 0.023080 0.003638 6.344 1.25e-08 ***
TypeMississippi -8.005495 2.701899 -2.963 0.00401 **
conc:TypeMississippi -0.010699 0.005145 -2.079 0.04079 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.936 on 80 degrees of freedom
Multiple R-squared: 0.6035, Adjusted R-squared: 0.5887
F-statistic: 40.59 on 3 and 80 DF, p-value: 4.78e-16
- Residuals: These are the differences between the observed and fitted values. The summary provides the minimum, 1st quartile, median, 3rd quartile, and maximum of the residuals.
- Coefficients: The table shows the estimated coefficients for the intercept,
conc
,TypeMississippi
, and their interaction. Each coefficient is accompanied by its standard error, t-value, and p-value. For example:(Intercept)
: The estimated intercept is 23.503038, with a highly significant p-value (<2e-16).conc
: The estimated coefficient forconc
is 0.023080, indicating that as concentration increases by one unit, the uptake increases by approximately 0.023. This effect is highly significant (p-value = 1.25e-08).TypeMississippi
: The estimated coefficient forTypeMississippi
is -8.005495, indicating that plants from Mississippi have a lower uptake compared to Quebec, with a significant effect (p-value = 0.00401).conc:TypeMississippi
: The interaction term has a coefficient of -0.010699, suggesting that the effect of concentration on uptake differs between the two types, with a p-value of 0.04079, indicating significance at the 5% level.
- Residual standard error: This value (6.936) measures the average size of the residuals.
- R-squared and Adjusted R-squared: These values (0.6035 and 0.5887, respectively) indicate the proportion of variance in the response variable explained by the model.
- F-statistic: The F-statistic (40.59) tests the null hypothesis that all coefficients are zero, with a p-value of 4.78e-16, indicating a highly significant overall model fit.
Generalized Least Squares (GLS) Fit
The Generalized Least Squares (GLS) method, particularly using Restricted Maximum Likelihood (REML), is used to account for correlation and non-constant variance in the residuals. Here’s a detailed explanation of the results from the summary(gls_fit) for the CO2 dataset:
#Load the nlme package
library(nlme)
# Fit the GLS model
gls_fit <- gls(uptake ~ conc * Type, data = CO2, method = "REML")
# Summarize the GLS fit
summary(gls_fit)
Generalized least squares fit by REML
Model: uptake ~ conc * Type
Data: CO2
AIC BIC logLik
584.5892 596.4994 -287.2946
Coefficients:
Value Std.Error t-value p-value
(Intercept) 23.503038 1.9105309 12.301836 0.0000
conc 0.023080 0.0036383 6.343708 0.0000
TypeMississippi -8.005495 2.7018987 -2.962915 0.0040
conc:TypeMississippi -0.010699 0.0051453 -2.079368 0.0408
Correlation:
(Intr) conc TypMss
conc -0.828
TypeMississippi -0.707 0.586
conc:TypeMississippi 0.586 -0.707 -0.828
Standardized residuals:
Min Q1 Med Q3 Max
-2.36390769 -0.79658623 -0.02312032 0.80342921 1.73118290
Residual standard error: 6.935822
Degrees of freedom: 84 total; 80 residual
Summary of GLS Results:
- Model Fit Statistics:
- AIC (Akaike Information Criterion): 584.5892
- BIC (Bayesian Information Criterion): 596.4994
- Log-Likelihood: -287.2946
These statistics provide an overall measure of model fit. Lower AIC and BIC values generally indicate a better-fitting model, while the log-likelihood measures the likelihood of the data given the model. Comparing these values across models can help in model selection.
- Coefficients:
- (Intercept): Estimate = 23.503038, Std. Error = 1.910531, t-value = 12.302, p-value < 0.0001
- conc: Estimate = 0.023080, Std. Error = 0.003638, t-value = 6.344, p-value < 0.0001
- TypeMississippi: Estimate = -8.005495, Std. Error = 2.701899, t-value = -2.963, p-value = 0.0040
- conc:TypeMississippi: Estimate = -0.010699, Std. Error = 0.005145, t-value = -2.079, p-value = 0.0408
These coefficients indicate the effect of each predictor on the response variable. Significant predictors are those with low p-values (below the typical threshold of 0.05), indicating a strong effect on the response.
- Correlation of Coefficients:
- conc and TypeMississippi: -0.707
- conc and conc:TypeMississippi: 0.586
- TypeMississippi and conc:TypeMississippi: -0.828
The correlation between coefficients reveals how changes in one coefficient are associated with changes in another. High correlations can indicate multicollinearity.
- Standardized Residuals:
- Min: -2.364
- Q1 (1st Quartile): -0.797
- Median: -0.023
- Q3 (3rd Quartile): 0.803
- Max: 1.731
These residuals are scaled to have a mean of zero and variance of one, providing a standardized measure of residual variation.
- Residual Standard Error: 6.936
- Degrees of Freedom: 84 total; 80 residual
The residual standard error indicates the average distance between the observed and fitted values, while the degrees of freedom reflect the number of observations used in the model fitting process, adjusted for the number of parameters estimated.
Interpretation:
The GLS model provides an adjusted fit considering potential correlation and variance issues. The model fit statistics (AIC, BIC, and log-likelihood) offer insights into the quality of the fit, with lower values generally indicating a better fit. The coefficients show the impact of each predictor, with significant predictors having p-values below 0.05. The correlations between coefficients help assess multicollinearity, and standardized residuals give an idea of the distribution of residuals. The residual standard error and degrees of freedom provide additional context for evaluating model performance.
By checking these aspects, we ensure that the model adequately captures the relationships in the data and meets the assumptions underlying the GLS method.