Survival Analysis Tutorial

Survival Analysis Basics with the ‘Ovarian’ Dataset
Welcome to this tutorial on survival analysis! In this tutorial, we will explore the fundamental concepts of survival analysis using the ‘ovarian’ dataset. This dataset contains information about the survival times of 26 women with ovarian cancer. Let’s dive in and explore the different aspects of survival analysis.
Loading and Examining the Dataset
To begin, we need to load the ‘ovarian’ dataset into our statistical environment. You can obtain the dataset from the following link: ovarian dataset.
Once you have the dataset, load it into your statistical software. Take a moment to examine the structure of the dataset and familiarize yourself with the variables it contains. The variables include ‘futime’ (follow-up time or survival time) and ‘fustat’ (status indicator, where 1 denotes death and 0 denotes censored observations).
Estimating Survival Curves
One of the primary goals of survival analysis is to estimate the survival probability over time. We can visualize this using a survival curve, also known as the Kaplan-Meier curve. The Kaplan-Meier estimator provides a non-parametric estimate of the survival function.
The formula for the Kaplan-Meier estimate is as follows
Survival Analysis Formula
  •  S(t) is the estimated survival probability at time t
  •  Ti represents the survival time for individual i
  •  di is the number of deaths at time Ti
  •  ni is the number of individuals at risk just before time Ti
In R, you can calculate the survival curve for the ‘ovarian’ dataset using the survfit() function from the ‘survival’ package. Here’s an example of the R code:
ovarian_surv <- survfit(Surv(futime, fustat) ~ rx, data = ovarian)

					Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)

      n events median 0.95LCL 0.95UCL
rx=1 13      7    638     268      NA
rx=2 13      5     NA     475      NA
The output from the survival function in R provides information about the survival analysis results for two different treatment groups (rx=1 and rx=2) based on the ovarian dataset.
Here’s the breakdown of the output:
  • Call: This indicates the function call used to generate the survival analysis results.
  • n: This represents the number of individuals in each treatment group.
  • events: It shows the number of events (e.g., deaths, failures) observed in each treatment group.
  • median: This displays the median survival time for each treatment group. In this case, the median survival time for rx=1 is 638 units (time units may vary depending on the dataset), while for rx=2, the median survival time is not available (NA).
  • 0.95LCL: This refers to the lower confidence limit for the median survival time, representing the lower bound of the confidence interval. For rx=1, the lower confidence limit is 268, indicating that the median survival time is expected to be at least 268 units. For rx=2, the lower confidence limit is not available (NA).
  • 0.95UCL: This represents the upper confidence limit for the median survival time, indicating the upper bound of the confidence interval. However, in this output, the upper confidence limit is not available (NA) for both treatment groups.
Overall, this output provides summary statistics for the survival times in each treatment group, including the number of events, median survival time, and confidence intervals.
Assumptions of Kaplan-Meier Survival Analysis
Kaplan-Meier survival analysis relies on several key assumptions, which are important to consider when interpreting the results. These assumptions are:
Assumption 1: Independent Censoring

Individuals are censored independently of their survival times, meaning that the probability of censoring at a given time does not depend on the survival time itself or the occurrence of events. This assumption ensures that censoring does not introduce bias into the analysis.

Assumption 2: Non-Informative Censoring

Censoring is non-informative, meaning that the censoring mechanism is not related to the probability of survival. In other words, the likelihood of being censored does not provide any information about the survival outcome. This assumption ensures that censoring does not bias the estimated survival probabilities.

Assumption 3: No Competing Risks

The Kaplan-Meier estimator assumes that there is no competing event that precludes the occurrence of the event of interest. In other words, individuals in the study are only at risk of experiencing the specific event under investigation, and other events do not affect the probability of the event of interest.

Assumption 4: Constant Hazard Ratios

For Cox proportional hazards model, the assumption is that the hazard ratios comparing different groups remain constant over time. This assumption implies that the proportional hazards assumption holds, where the hazards in different groups are proportional at all time points.

Assumption 5: No Violation of Linearity Assumption

When using covariates in the Cox proportional hazards model, the assumption is that the relationship between the covariates and the hazard is linear on the log scale. Violation of this assumption may result in biased estimates of the hazard ratios.

Assumption 6: No Perfect Separation

In logistic regression models, the assumption is that there is no perfect separation of events, where a combination of predictor variables perfectly predicts the occurrence or non-occurrence of the event. Perfect separation can lead to unreliable estimates and difficulties in model convergence.

It is essential to assess these assumptions before interpreting the results of Kaplan-Meier survival analysis and Cox proportional hazards models to ensure the validity and reliability of the findings.

Comparing Survival Curves
Survival analysis allows us to compare survival curves among different groups. This is particularly useful when studying the effect of certain factors on survival. To compare survival curves, we can use statistical tests such as the log-rank test.
The log-rank test compares the observed and expected number of events in each group at each time point and assesses whether there is a statistically significant difference in survival between the groups. The test statistic follows a chi-square distribution.
The p-value for the log-rank test is obtained by comparing the observed test statistic to the chi-square distribution with the appropriate degrees of freedom.
 In R, you can perform the log-rank test using the survdiff() function. Here’s an example of the R code:
					logrank_test <- survdiff(Surv(futime, fustat) ~ rx, data = ovarian)

					> logrank_test
survdiff(formula = Surv(futime, fustat) ~ rx, data = ovarian)

      N Observed Expected (O-E)^2/E (O-E)^2/V
rx=1 13        7     5.23     0.596      1.06
rx=2 13        5     6.77     0.461      1.06

 Chisq= 1.1  on 1 degrees of freedom, p= 0.3 

The Log Rank test is a statistical test used in survival analysis to compare the survival distributions between two or more groups. It assesses whether there are significant differences in the survival experiences of the groups being compared. The p-value for the Log Rank test is obtained through the following steps:

Step 1: Compute the observed and expected number of events in each group:



Step 2: Calculate the variance of the observed minus expected events for each group:


Step 3: Compute the weighted sum of the variance:


Step 4: Obtain the test statistic:


Step 5: Obtain the degrees of freedom:


Step 6: Calculate the p-value:


The p-value represents the probability of observing a test statistic as extreme as, or more extreme than, the one obtained, assuming the null hypothesis is true (i.e., the survival distributions are equal for all groups).

By comparing the obtained p-value with a pre-defined significance level (e.g., 0.05), if the p-value is less than the significance level, we reject the null hypothesis and conclude that there are significant differences in survival experiences between the groups being compared.

Based on the R output from the logrank_test call, we can see that the p-value is equal to 0.3. Thus, based on the data we fail to reject the null hypothesis and claim there is no signficant difference in the survival experiences between the two treatment arms.

By using the ‘survminer’ package’s ggsurvplot function, we can visualize the Kaplan Meier curves of the two treatment arms.
					#Create the survfit object
fit <- survfit(Surv(futime, fustat) ~ rx, data = ovarian)

#Plot the Kaplan Meier curves for each treatment arm
           pval = TRUE, 
           conf.int = FALSE,
           risk.table = TRUE, 
           risk.table.col = "strata", 
           linetype = "strata", 
           surv.median.line = "hv", 
           ggtheme = theme_bw(), 
           palette = c("red", "blue"))