Linear Models with Homogeneous Variance

This tutorial addresses the application of linear models (LMs) to data from research studies where observations are independent. LMs are broadly used to quantify the relationship between a dependent variable and a set of covariates through a linear function that depends on a few regression parameters.

The focus here is on the classical LM, which is suitable for analyzing data with independent observations and homogeneous variance. The types of linear models covered include standard linear regression, analysis of variance (ANOVA), and analysis of covariance (ANCOVA) models. Later, we will relax the assumption of variance homogeneity and explore LMs for independent observations with nonconstant variance. Generalized linear models (GLIMs) and nonlinear regression models are also used for independent observation data but are outside the scope of this tutorial.

The foundational concepts introduced here, such as design matrices and likelihood estimation, will be essential for understanding fixed-effects LMs for correlated data and linear mixed models (LMMs). By presenting these concepts in a simpler context, we aim to make their application in more complex models more accessible.

In this tutorial, we delve into the theoretical concepts underlying the classical LM, with a particular focus on implementations in R. For a more detailed treatment, readers are referred to Neter et al. (1990).

Model Specification

The classical linear model (LM) is formulated both at the individual observation level and for all observations collectively.

Model Specification at the Observation Level

The classical LM for independent, normally distributed observations \( y_i \) (where \( i = 1, \dots, n \)) with constant variance is specified as follows: \[ y_i = x_i(1) b_1 + \dots + x_i(p) b_p + e_i, \] where \( e_i \sim N(0, \sigma^2) \), \( x_i(1), \dots, x_i(p) \) are known covariate values for the \( i \)-th observation, \( b_1, \dots, b_p \) are unknown regression parameters, and the residual errors \( e_1, \dots, e_n \) are assumed to be independent.
Defining the column vectors \( \mathbf{x}_i \equiv (x_i(1), \dots, x_i(p))^T \) and \( \mathbf{b} \equiv (b_1, \dots, b_p)^T \), the equation can be written as: \[ y_i = \mathbf{x}_i \mathbf{b} + e_i. \] From these equations, the expected value and variance of \( y_i \) are: \[ E(y_i) \equiv \mu_i = \mathbf{x}_i \mathbf{b}, \quad \text{and} \quad \text{Var}(y_i) = \sigma^2. \] Model Equation for All Data

The model can be compactly expressed by defining: \[ \mathbf{y} \equiv \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}, \quad \mathbf{e} \equiv \begin{pmatrix} e_1 \\ \vdots \\ e_n \end{pmatrix}, \] \[ \mathbf{X} \equiv \begin{pmatrix} \mathbf{x}_1 \\ \vdots \\ \mathbf{x}_n \end{pmatrix} = \begin{pmatrix} x_1(1) & \dots & x_1(p) \\ \vdots & \ddots & \vdots \\ x_n(1) & \dots & x_n(p) \end{pmatrix} \equiv (\mathbf{x}(1) \mathbf{x}(2) \dots \mathbf{x}(p)). \] Then, the model can be written as: \[ \mathbf{y} = \mathbf{X} \mathbf{b} + \mathbf{e}, \] where \( \mathbf{e} \sim N_n(0, \mathbf{R}) \), with \( \mathbf{R} = \sigma^2 \mathbf{I}_n \) (where \( \mathbf{I}_n \) is the \( n \times n \) identity matrix).

Assuming that \( \mathbf{X} \) is of full rank \( p \), the columns \( \mathbf{x}(1), \dots, \mathbf{x}(p) \) are linearly independent, and the \( i \)-th row of \( \mathbf{X} \) corresponds to \( \mathbf{x}_i \).

Offset

Models can be modified by introducing a known additional term \( x_i(0) \) into the equation: \[ y_i = x_i(0) + x_i(1) b_1 + \dots + x_i(p) b_p + e_i, \] where the residual error \( e_i \) follows the same distribution as before. This model can be represented for all data as: \[ \mathbf{y} = \mathbf{x}(0) + \mathbf{X} \mathbf{b} + \mathbf{e}, \] where \( \mathbf{x}(0) \equiv (x_1(0), \dots, x_n(0))^T \), and the residual error vector \( \mathbf{e} \) is distributed as before.

An offset is included as an additional (first) column in the design matrix \( \mathbf{X} \), with the corresponding parameter \( b_0 \) assumed to be 1. To remove the offset, the equation can be redefined as: \[ y_i \equiv y_i - x_i(0), \] yielding a classical LM as specified earlier. The concept of offsets, while not commonly used in practice, is useful for illustrating computational approaches, such as in generalized least squares algorithms, and is available in statistical software like R.

Estimation

Researchers often seek to estimate parameters \( \mathbf{b} \) and \( \sigma^2 \). For classical LMs, ordinary least squares (OLS) is commonly used, but for more complex models, including linear mixed models (LMMs), maximum likelihood (ML) and restricted maximum-likelihood (REML) methods are also introduced.

Ordinary Least Squares

In OLS, estimates of \( \mathbf{b} \) are obtained by minimizing the residual sum of squares: \[ \sum_{i=1}^n (y_i - \mathbf{x}_i \mathbf{b})^2. \] The OLS estimator is: \[ \mathbf{b}_{\text{OLS}} \equiv \left(\sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^T \right)^{-1} \sum_{i=1}^n \mathbf{x}_i y_i = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}. \] OLS does not require normality assumptions but assumes uncorrelated residual errors. The unbiased estimator of \( \sigma^2 \) is: \[ \sigma_{\text{OLS}}^2 \equiv \frac{1}{n-p} \sum_{i=1}^n (y_i - \mathbf{x}_i \mathbf{b}_{\text{OLS}})^2 = \frac{1}{n-p} (\mathbf{y} - \mathbf{X} \mathbf{b}_{\text{OLS}})^T (\mathbf{y} - \mathbf{X} \mathbf{b}_{\text{OLS}}). \] Maximum-Likelihood Estimation

Given the assumptions of normality and independence for \( y_i \), the likelihood function is: \[ L_{\text{Full}}(\mathbf{b}, \sigma^2; \mathbf{y}) \equiv (2\pi \sigma^2)^{-\frac{n}{2}} \prod_{i=1}^n \exp \left( -\frac{(y_i - \mathbf{x}_i \mathbf{b})^2}{2\sigma^2} \right). \] Maximizing the log-likelihood function: \[ \log L_{\text{Full}}(\mathbf{b}, \sigma^2; \mathbf{y}) \equiv -\frac{n}{2} \log (\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \mathbf{x}_i \mathbf{b})^2. \] The ML estimator of \( \mathbf{b} \) is: \[ \mathbf{b}_{\text{ML}} \equiv (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}, \] which is identical to \( \mathbf{b}_{\text{OLS}} \). The ML estimator of \( \sigma^2 \) is: \[ \sigma_{\text{ML}}^2 \equiv \frac{1}{n} \sum_{i=1}^n (y_i - \mathbf{x}_i \mathbf{b}_{\text{ML}})^2 = \frac{1}{n} (\mathbf{y} - \mathbf{X} \mathbf{b}_{\text{ML}})^T (\mathbf{y} - \mathbf{X} \mathbf{b}_{\text{ML}}). \] Restricted Maximum-Likelihood Estimation

REML avoids biases in estimating \( \sigma^2 \) due to the estimation of \( \mathbf{b} \). The REML log-likelihood function is: \[ \log L_{\text{REML}}(\sigma^2; \mathbf{y}) \equiv -\frac{n-p}{2} \log (\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n \hat{e}_i^2, \] where \( \hat{e}_i = y_i - \mathbf{x}_i (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \) are the residuals adjusted for the estimation of \( \mathbf{b} \).

The REML estimator of \( \sigma^2 \) is: \[ \sigma_{\text{REML}}^2 \equiv \frac{1}{n-p} \sum_{i=1}^n \hat{e}_i^2 = \frac{1}{n-p} (\mathbf{y} - \mathbf{X} \mathbf{b}_{\text{ML}})^T (\mathbf{y} - \mathbf{X} \mathbf{b}_{\text{ML}}). \]
Model Diagnostics

Diagnostic tools assess model validity and assumptions. Key diagnostic tools include residuals, leverage, and influential observations.

Residuals

Various types of residuals are used to diagnose model fit:
- **Raw Residuals**: \( e_i \equiv y_i - \hat{y}_i \), where \( \hat{y}_i = \mathbf{x}_i \mathbf{b}_{\text{OLS}} \).
- **Standardized Residuals**: \( \frac{e_i}{\sigma_{\text{OLS}}} \).
- **Internally Studentized Residuals**: \( \frac{e_i}{\hat{\sigma}_i} \).
- **Externally Studentized Residuals**: \( \frac{e_i}{\hat{\sigma}_{(-i)}} \), where \( \hat{\sigma}_{(-i)} \) is the estimate of \( \sigma \) with the \( i \)-th observation excluded.

Leverage

Leverage quantifies the influence of each observation on the fitted values: \[ \mathbf{H} = \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T, \] where \( h_i = H_{ii} \) is the leverage of the \( i \)-th observation.

Influential Observations

Cook’s distance helps identify influential observations: \[ D_i \equiv \frac{(y_i - \hat{y}_i)^2}{p \sigma^2} h_i. \]
Inferential Methods

Inferential methods evaluate parameter estimates and model fit, including hypothesis testing and confidence intervals.

Hypothesis Testing

\( t \)-tests and \( F \)-tests are commonly used. The \( t \)-test statistic for a parameter \( b_i \) is: \[ t_i \equiv \frac{b_i - \beta_i}{\text{SE}(b_i)}, \] where \( \text{SE} \) is the standard error of \( b_i \).

Confidence Intervals

Confidence intervals for parameters are calculated as: \[ \hat{b}_i \pm t_{\alpha/2} \text{SE}(\hat{b}_i). \] Model Selection and Reduction

Selecting the best model involves criteria such as the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC).

Stepwise Selection

Stepwise methods include forward selection, backward elimination, and stepwise regression.

Information Criteria

AIC and BIC are computed as follows: \[ \text{AIC} \equiv -2 \times \text{Log-Likelihood} + 2k, \] \[ \text{BIC} \equiv -2 \times \text{Log-Likelihood} + k \log(n), \] where \( k \) is the number of parameters and \( n \) is the sample size.

CLINICAL BIOSTATS