Survival Analysis Tutorial
Survival Analysis Basics with the ‘Ovarian’ Dataset
Welcome to this tutorial on survival analysis! In this tutorial, we will explore the fundamental concepts of survival analysis using the ‘ovarian’ dataset. This dataset contains information about the survival times of 26 women with ovarian cancer. Let’s dive in and explore the different aspects of survival analysis.
Loading and Examining the Dataset
To begin, we need to load the ‘ovarian’ dataset into our statistical environment. You can obtain the dataset from the following link: ovarian dataset.
Once you have the dataset, load it into your statistical software. Take a moment to examine the structure of the dataset and familiarize yourself with the variables it contains. The variables include ‘futime’ (followup time or survival time) and ‘fustat’ (status indicator, where 1 denotes death and 0 denotes censored observations).
Estimating Survival Curves
One of the primary goals of survival analysis is to estimate the survival probability over time. We can visualize this using a survival curve, also known as the KaplanMeier curve. The KaplanMeier estimator provides a nonparametric estimate of the survival function.
The formula for the KaplanMeier estimate is as follows
Where:

S(t) is the estimated survival probability at time t

T_{i} represents the survival time for individual i

d_{i} is the number of deaths at time T_{i}

n_{i} is the number of individuals at risk just before time T_{i}
In R, you can calculate the survival curve for the ‘ovarian’ dataset using the survfit()
function from the ‘survival’ package. Here’s an example of the R code:
library(survival)
ovarian_surv < survfit(Surv(futime, fustat) ~ rx, data = ovarian)
Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
n events median 0.95LCL 0.95UCL
rx=1 13 7 638 268 NA
rx=2 13 5 NA 475 NA
The output from the survival function in R provides information about the survival analysis results for two different treatment groups (rx=1 and rx=2) based on the ovarian dataset.
Here’s the breakdown of the output:
Call: This indicates the function call used to generate the survival analysis results.
n: This represents the number of individuals in each treatment group.
events: It shows the number of events (e.g., deaths, failures) observed in each treatment group.
median: This displays the median survival time for each treatment group. In this case, the median survival time for rx=1 is 638 units (time units may vary depending on the dataset), while for rx=2, the median survival time is not available (NA).
0.95LCL: This refers to the lower confidence limit for the median survival time, representing the lower bound of the confidence interval. For rx=1, the lower confidence limit is 268, indicating that the median survival time is expected to be at least 268 units. For rx=2, the lower confidence limit is not available (NA).
0.95UCL: This represents the upper confidence limit for the median survival time, indicating the upper bound of the confidence interval. However, in this output, the upper confidence limit is not available (NA) for both treatment groups.
Overall, this output provides summary statistics for the survival times in each treatment group, including the number of events, median survival time, and confidence intervals.
Assumptions of KaplanMeier Survival Analysis
KaplanMeier survival analysis relies on several key assumptions, which are important to consider when interpreting the results. These assumptions are:
Assumption 1: Independent Censoring
Individuals are censored independently of their survival times, meaning that the probability of censoring at a given time does not depend on the survival time itself or the occurrence of events. This assumption ensures that censoring does not introduce bias into the analysis.
Assumption 2: NonInformative Censoring
Censoring is noninformative, meaning that the censoring mechanism is not related to the probability of survival. In other words, the likelihood of being censored does not provide any information about the survival outcome. This assumption ensures that censoring does not bias the estimated survival probabilities.
Assumption 3: No Competing Risks
The KaplanMeier estimator assumes that there is no competing event that precludes the occurrence of the event of interest. In other words, individuals in the study are only at risk of experiencing the specific event under investigation, and other events do not affect the probability of the event of interest.
Assumption 4: Constant Hazard Ratios
For Cox proportional hazards model, the assumption is that the hazard ratios comparing different groups remain constant over time. This assumption implies that the proportional hazards assumption holds, where the hazards in different groups are proportional at all time points.
Assumption 5: No Violation of Linearity Assumption
When using covariates in the Cox proportional hazards model, the assumption is that the relationship between the covariates and the hazard is linear on the log scale. Violation of this assumption may result in biased estimates of the hazard ratios.
Assumption 6: No Perfect Separation
In logistic regression models, the assumption is that there is no perfect separation of events, where a combination of predictor variables perfectly predicts the occurrence or nonoccurrence of the event. Perfect separation can lead to unreliable estimates and difficulties in model convergence.
It is essential to assess these assumptions before interpreting the results of KaplanMeier survival analysis and Cox proportional hazards models to ensure the validity and reliability of the findings.
Comparing Survival Curves
Survival analysis allows us to compare survival curves among different groups. This is particularly useful when studying the effect of certain factors on survival. To compare survival curves, we can use statistical tests such as the logrank test.
The logrank test compares the observed and expected number of events in each group at each time point and assesses whether there is a statistically significant difference in survival between the groups. The test statistic follows a chisquare distribution.
The pvalue for the logrank test is obtained by comparing the observed test statistic to the chisquare distribution with the appropriate degrees of freedom.
In R, you can perform the logrank test using the survdiff()
function. Here’s an example of the R code:
logrank_test < survdiff(Surv(futime, fustat) ~ rx, data = ovarian)
> logrank_test
Call:
survdiff(formula = Surv(futime, fustat) ~ rx, data = ovarian)
N Observed Expected (OE)^2/E (OE)^2/V
rx=1 13 7 5.23 0.596 1.06
rx=2 13 5 6.77 0.461 1.06
Chisq= 1.1 on 1 degrees of freedom, p= 0.3
The Log Rank test is a statistical test used in survival analysis to compare the survival distributions between two or more groups. It assesses whether there are significant differences in the survival experiences of the groups being compared. The pvalue for the Log Rank test is obtained through the following steps:
Step 1: Compute the observed and expected number of events in each group:
Step 2: Calculate the variance of the observed minus expected events for each group:
Step 3: Compute the weighted sum of the variance:
Step 4: Obtain the test statistic:
Step 5: Obtain the degrees of freedom:
Step 6: Calculate the pvalue:
The pvalue represents the probability of observing a test statistic as extreme as, or more extreme than, the one obtained, assuming the null hypothesis is true (i.e., the survival distributions are equal for all groups).
By comparing the obtained pvalue with a predefined significance level (e.g., 0.05), if the pvalue is less than the significance level, we reject the null hypothesis and conclude that there are significant differences in survival experiences between the groups being compared.
Based on the R output from the logrank_test call, we can see that the pvalue is equal to 0.3. Thus, based on the data we fail to reject the null hypothesis and claim there is no signficant difference in the survival experiences between the two treatment arms.
By using the ‘survminer’ package’s ggsurvplot function, we can visualize the Kaplan Meier curves of the two treatment arms.
#Create the survfit object
fit < survfit(Surv(futime, fustat) ~ rx, data = ovarian)
#Plot the Kaplan Meier curves for each treatment arm
ggsurvplot(fit,
pval = TRUE,
conf.int = FALSE,
risk.table = TRUE,
risk.table.col = "strata",
linetype = "strata",
surv.median.line = "hv",
ggtheme = theme_bw(),
palette = c("red", "blue"))