Cox Proportional Hazards Model Tutorial


The Cox Proportional Hazards (CPH) model is a powerful statistical tool commonly used in medical and survival analysis to investigate the relationship between predictor variables and the survival time of individuals. Developed by David Cox in 1972, the CPH model is a regression-based approach that assumes proportional hazards, meaning that the relative hazard rates between groups remain constant over time. This makes it suitable for analyzing data where the timing of events or outcomes is of interest.

Unlike traditional regression models, the CPH model does not require specifying the baseline hazard function, making it more flexible and applicable to a wide range of scenarios. It allows for the inclusion of both continuous and categorical predictor variables, enabling researchers to assess the impact of multiple factors on survival outcomes while controlling for potential confounding variables. The CPH model provides hazard ratios, which quantify the relative risk of an event occurring in one group compared to another, allowing for meaningful interpretation and comparison of the effects of different predictors on survival. Its versatility, robustness, and ability to handle censored data make the Cox Proportional Hazards model a popular choice in medical research and other fields where survival analysis is essential.

The Cox Proportional Hazards (CPH) model is a widely used regression model in survival analysis that allows researchers to examine the relationship between predictor variables and the hazard rate or survival time of individuals. The CPH model is based on the assumption that the hazard function can be expressed as the product of a baseline hazard function and a set of covariates. The model can handle both continuous and categorical predictors, making it versatile in various research settings.

Mathematically, the Cox Proportional Hazards model can be represented as follows:

Cox Proportional Hazards model formula


  • h(t, x) is the hazard rate at time t for an individual with covariate values x1, x2, ..., xp.
  • h0(t) is the baseline hazard function, which represents the hazard rate when all covariates are equal to zero.
  • b1, b2, ..., bp are the regression coefficients corresponding to each covariate x1, x2, ..., xp.
  • exp() denotes the exponential function.

The model assumes that the hazard ratio remains constant over time, meaning that the proportional relationship between the covariates and the hazard rate holds throughout the study period. By estimating the regression coefficients, researchers can quantify the impact of each predictor variable on the hazard rate, providing valuable insights into the factors influencing survival outcomes. The Cox Proportional Hazards model is widely implemented in software languages like R, making it accessible for researchers to analyze and interpret survival data effectively.

Assumptions of the Cox Proportional Hazards Model

The Cox Proportional Hazards model is a widely used statistical model for analyzing survival data. It makes several assumptions to ensure the validity of the model’s estimates and interpretations. Below are six key assumptions of the Cox Proportional Hazards model, along with detailed explanations for each:

  1. Proportional Hazards Assumption: The primary assumption of the Cox model is that the hazards (instantaneous risk of an event) for any two individuals are proportional over time, meaning the ratio of their hazard functions remains constant. In other words, the effect of a covariate on the hazard is assumed to be constant over time. Violation of this assumption suggests that the hazard ratio changes over time, indicating a time-varying effect.

  2. Independence Assumption: The observations in the dataset must be independent of each other. This assumption implies that the occurrence of an event for one individual does not affect the occurrence of an event for any other individual. Independence is typically assumed for survival data, meaning the risk of an event for one individual should be unrelated to the risks of other individuals at any given time.

  3. Non-informative Censoring Assumption: Censoring occurs when the event of interest has not yet occurred for some individuals at the end of the study. The non-informative censoring assumption states that the censoring mechanism is unrelated to the survival time and the potential event occurrence. In other words, censoring should not be dependent on the underlying event time or the covariates. Violation of this assumption can lead to biased estimates of the hazard ratios.

  4. Linearity Assumption: The relationship between the covariates and the log-hazard function is assumed to be linear. This means that the effect of each covariate on the hazard function is constant and additive in the log-scale. If the relationship is not linear, it may result in biased coefficient estimates and inaccurate inference.

  5. Absence of Multicollinearity: The covariates included in the model should be independent or minimally correlated with each other. Multicollinearity refers to a high degree of correlation between two or more covariates, which can make it challenging to interpret the effects of individual covariates accurately. Multicollinearity can lead to unstable and imprecise coefficient estimates.


Obtaining a Hazard Ratio

Step 1: Fit the Cox Regression Model

Let's consider a Cox Regression model with three covariates: age, gender, and treatment (treatment vs. control). The model can be expressed as:

Treatment Arm:

Control Arm:


  • represents the hazard function at time t for the treatment arm with covariates .
  • represents the hazard function at time t for the control arm with covariates .
  • is the baseline hazard function at time t.
  • are the estimated coefficients for age, gender, and treatment, respectively.

Step 2: Derive the Hazard Ratio

The hazard ratio (HR) can be obtained by taking the ratio of the hazard function for the treatment arm over the hazard function for the control arm:

Substituting the expressions for the hazard functions, we have:

Therefore, the Hazard Ratio (HR) is given by the exponential of the differences in covariate values between the treatment and control arms, multiplied by their respective coefficient estimates.

Step 3: Interpretation

In this example, the Hazard Ratio (HR) is expressed as the exponential of the differences in covariate values between the treatment and control arms, multiplied by their respective coefficient estimates. The HR quantifies the relative hazard between the two groups. If the HR is greater than 1, it suggests a higher hazard (higher risk of the event) in the treatment group compared to the control group, while an HR less than 1 indicates a lower hazard in the treatment group.

In summary:
  • HR = 1: It indicates that there is no difference in the hazard rate between the two groups being compared. The event of interest occurs at the same rate for both groups, suggesting no association between the covariates and the outcome.

  • HR < 1: It implies that the treatment or exposure in the first group is associated with a lower hazard or risk compared to the reference group. This suggests a potentially beneficial or protective effect of the treatment or exposure.

  • HR > 1: It signifies that the treatment or exposure in the first group is associated with a higher hazard or risk compared to the reference group. This indicates an increased risk or adverse effect associated with the treatment or exposure.

Therefore, the HR serves as a valuable tool for quantifying and interpreting the relative risk or hazard in Cox regression analysis.


Cox Proportional Hazards Model in R

Compute the Cox model:

We will fit the Cox regression using the following covariates: age, sex, ph.ecog, and wt.loss based on the “lung” dataset in R.

We start by computing univariate Cox analyses for all these variables; then we will fit multivariate Cox analyses using two variables to describe how the factors jointly impact survival.

Univariate Cox analyses can be computed as follows:

					fit.cox <- coxph(Surv(time, status) ~ sex, data = lung)
coxph(formula = Surv(time, status) ~ sex, data = lung)
  n= 228, number of events= 165 
       coef exp(coef) se(coef)      z Pr(>|z|)   
sex -0.5310    0.5880   0.1672 -3.176  0.00149 **
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    exp(coef) exp(-coef) lower .95 upper .95
sex     0.588      1.701     0.4237     0.816
Concordance= 0.579 (se = 0.022)
Rsquare= 0.046 (max possible= 0.999)
Likelihood ratio test= 10.63  on 1 df, p=0.001111
Wald test            = 10.09  on 1 df, p=0.001491
Score (logrank) test = 10.33  on 1 df, p=0.001312

When examining the Cox regression results, several key aspects can be considered:

Statistical significance: The Wald statistic, represented by the column labeled “z,” determines whether the regression coefficient (β) of a variable is statistically significant. It is calculated as the ratio of the coefficient to its standard error (z = coef/se(coef)). In this analysis, the variable “sex” exhibits highly statistically significant coefficients, indicating a significant association with the outcome.

Regression coefficients: The sign of the regression coefficients (coef) provides valuable information. A positive coefficient suggests a higher hazard (greater risk of death) and a poorer prognosis for individuals with higher values of the corresponding variable. In this study, “sex” is represented as a numeric vector, where 1 represents males and 2 represents females. The Cox model’s summary indicates that females (sex = 2) have a beta coefficient of -0.53, indicating a lower risk of death (higher survival rates) compared to males.

Hazard ratios: Hazard ratios, obtained by exponentiating the coefficients (exp(coef) = exp(-0.53) = 0.59), quantify the effect size of covariates. For instance, being female (sex=2) is associated with a hazard reduction factor of 0.59, indicating a 41% decrease in the risk of the event of interest. Therefore, being female in this dataset is linked to a favorable prognosis.

Confidence intervals of hazard ratios: The summary output includes upper and lower 95% confidence intervals for the hazard ratio (exp(coef)). In this case, the lower 95% bound is 0.4237, and the upper 95% bound is 0.816. These intervals provide a range of plausible values for the hazard ratio, accounting for the uncertainty in the estimation.

Global statistical significance of the model: The output also provides p-values for three tests assessing the overall significance of the model: the likelihood-ratio test, Wald test, and score logrank statistics. These tests are asymptotically equivalent, yielding similar results for large sample sizes (N). However, in smaller samples, they may differ slightly. The likelihood-ratio test is generally preferred for its superior performance with smaller sample sizes.

We can then plot the survival curves of the lung data set with the following R code:

output<-ggsurv(survfit(time, status)~sex