Week 12: Survival Analysis

DSAN 5300: Statistical Learning
Spring 2025, Georgetown University

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Monday, April 7, 2025

Open slides in new tab →

Schedule

Today’s Planned Schedule:

Start End Topic
Lecture 6:30pm 7:00pm Estimating Survival Curves →
7:00pm 7:20pm Comparing Between Groups →
7:20pm 8:00pm Regression (Cox Proportional Hazard Model) →
Break! 8:00pm 8:10pm
8:10pm 9:00pm Quiz 3 →

Roadmap

  • Basic tool: survival curve \(S(t)\) = probability of surviving past period \(t\)
  • Complicating factor: Censored obs (e.g., drop out of study)
  • \(\Rightarrow\) Need to think about DGP: Why is observation \(i\) censored while \(j\) is observed?
    • Basic estimators only valid if censoring \(\perp\) survival time
  • How to compare survival between two groups
  • Regression: Effect of features \(X_1, \ldots, X_p\) on survival

Survival Analysis Basics

  • Lots of information crunched into this one figure!

ISLR Figure 11.1

Read Left to Right \(\Rightarrow\) Sequence of Events

Modified ISLR Figure 11.1

“Slice” at Deaths \(\Rightarrow\) At-Risk Observations

Modified ISLR Figure 11.1

The Actual Dataset We Get

Patient (\(i\)) Observed Outcome (\(Y_i\)) Observed? (\(\delta_i\))
1 300 1
2 365 0
3 150 1
4 250 0

The Dataset We Can Infer

Patient (\(i\)) \(Y_i\) \(\delta_i\) Survival Time (\(T_i\)) Censor Point (\(C_i\))
1 300 1 300 NA
2 365 0 NA 365
3 150 1 150 NA
4 250 0 NA 250

…If we’re testing effect of treatment, which column do we most care about?

Measuring Effect of Treatment!

Patient (\(i\)) \(Y_i\) \(\delta_i\) Survival Time (\(T_i\)) Censor Point (\(C_i\))
1 300 1 300 NA
2 365 0 NA 365
3 150 1 150 NA
4 250 0 NA 250

…If we’re testing effect of treatment, \(T_i\) is what we care about!

Basic Question 1: Survival

  • Let \(T\) be a RV representing time of death for a patient
  • What is probability that patient survives past given time \(t\)?

\[ S_T(t) = \Pr(T > t) \]

  • Note relationship to something you saw in 5100!
  • \(S_T(t)\) defined to be \(1 - F_T(t)\), where \(F_T(t)\) is CDF of \(T\):

\[ F_T(t) = \Pr(T \leq t) \]

Kaplan-Meier Estimator: Intuition

Each death event \(d_k\) gives us info that survival probability lower by some amount

Break \(S(t)\) into sequence of stepwise changes at \(d_1, \ldots, d_K\):

\[ \begin{align*} S(d_k) = \Pr(T > d_k) &\overset{\mathclap{\small\text{LTP}}}{=} \Pr(T > d_k \mid T > d_{k-1})\overbrace{\Pr(T > d_{k-1})}^{S(d_{k-1})} \\ &\phantom{=} \; \; + \underbrace{\Pr(T > d_k \mid T \leq d_{k-1})}_{\text{Contradiction} \implies \Pr = 0}\Pr(T \leq d_{k-1}) \end{align*} \]

Gives us a recurrence relation:

\[ \begin{align*} S(d_k) &= \Pr(T > d_k \mid T > d_{k-1})S(d_{k-1}) \\ S(d_{k-1}) &= \Pr(T > d_{k-1} \mid T > d_{k-2})S(d_{k-2}) \\ %S(d_{k-2}) &= \Pr(T > d_{k-2} \mid T > d_{k-3})S(d_{k-3}) \\ &\vdots \\ S(d_2) &= \Pr(T > d_2 \mid T > d_1)S(d_1) \\ S(d_1) &= \Pr(T > d_1 \mid T > d_0)S(d_0) = \Pr(T > d_1) \end{align*} \]

Plug each eq into eq above it to derive:

\[ \begin{align*} S(d_k) = &\Pr(\underbrace{T > d_k}_{\mathclap{\small\text{Survives past }d_k}} \; \mid \; \underbrace{T > d_{k-1}}_{\mathclap{\small\text{Survives past }d_{k-1}}}) \\ &\times \Pr(\underbrace{T > d_{k-1}}_{\small\text{Survives past }d_{k-1}} \mid \underbrace{T > d_{k-2}}_{\small\text{Survives past }d_{k-2}}) \\ &\times \cdots \times \Pr(\underbrace{T > d_2}_{\mathclap{\small\text{Survives past }d_2}} \; \mid \; \underbrace{T > d_1}_{\mathclap{\small\text{Survives past }d_1}}) \\ &\times \Pr(\underbrace{T > d_1}_{\mathclap{\small\text{Survives past }d_1}}) \end{align*} \]

Kaplan-Meier Estimator

  • Defined at death points \(d_k\) as

\[ \widehat{S}(d_k) = \prod_{j=1}^{k} \Bigl( \overbrace{ \frac{ r_j - q_j }{ \underbrace{r_j}_{\mathclap{\small\text{Num At Risk}}} } }^{\mathclap{\small\text{Num Survived}}} \Bigr) \]

  • Then, for \(t \in (d_k, d_{k+1})\), \(\widehat{S}(t) = \widehat{S}(d_k)\), producing stepwise survival function:

ISLR Figure 11.2

Kaplan-Meier Estimator for our Example

Two death points: \(d_1 = 150, d_2 = 300\) (plus start point \(d_0 = 0\))

\[ \begin{align*} {\color{#e69f00}\widehat{S}(d_0)} &= \prod_{j=0}^{0}\left( \frac{r_k - q_k}{r_k} \right) = \left( \frac{4 - 0}{4} \right) = {\color{#e69f00}\boxed{1}} \\ {\color{#56b4e9}\widehat{S}(d_1)} &= \prod_{j=0}^{1}\left( \frac{r_k - q_k}{r_k} \right) = {\color{#e69f00}\boxed{1}} \cdot \left( \frac{r_1-q_1}{r_1} \right) \\ &= {\color{#e69f00}\boxed{1}} \cdot \left( \frac{4 - 1}{4} \right) = {\color{#56B4E9}\boxed{\frac{3}{4}}} \\ {\color{#009e73}\widehat{S}(d_2)} &= \prod_{j=0}^{2}\left( \frac{r_k - q_k}{r_k} \right) = {\color{#e69f00}\boxed{1}} \cdot {\color{#56b4e9}\boxed{\frac{3}{4}}} \cdot \left( \frac{r_2 - q_2}{r_2} \right) \\ &= {\color{#e69f00}\boxed{1}} \cdot {\color{#56b4e9}\boxed{\frac{3}{4}}} \cdot \left( \frac{2-1}{2} \right) = \frac{3}{4}\cdot \frac{1}{2} = {\color{#009e73}\boxed{\frac{3}{8}}} \end{align*} \]

Code
library(tidyverse) |> suppressPackageStartupMessages()
library(survival) |> suppressPackageStartupMessages()
library(latex2exp) |> suppressPackageStartupMessages()
surv_df <- tribble(
  ~id, ~y, ~delta,
  1, 300, 1,
  2, 365, 0,
  3, 150, 1,
  4, 250, 0
)
surv_obj <- Surv(surv_df$y, event = surv_df$delta)
surv_model <- survfit(surv_obj ~ 1)
# Plot options
par(mar=c(2,4,1.25,1.0)) # bltr
y_label <- TeX("$\\Pr(T > t)$")
plot(
  surv_model,
  ylab=y_label,
  lwd=1,
  main="Survival Curve for 4-Patient Example"
) # conf.int=FALSE
# Add colors
# lines(c(0, 150), c(1.0, 1.0), type='l', col='#E69F00', lwd=2)
rect(xleft = 0, xright = 150, ybottom = 0, ytop = 1.0, col="#E69F0040", lwd=0)
# lines(c(150, 300), c(3/4, 3/4), type='l', col='#56B4E9', lwd=2)
rect(xleft = 150, xright = 300, ybottom = 0, ytop = 1.0, col="#56B4E940", lwd=0)
# lines(c(300, 365), c(3/8, 3/8), type='l', col='#009E73', lwd=2)
rect(xleft = 300, xright = 365, ybottom = 0, ytop = 1.0, col="#009E7340", lwd=0)

Comparing Survival Curves

  • Are female patients more likely to survive than male patients?

ISLR Figure 11.3

Log-Rank Test

  • At each death event \(d_k\), construct a table like:
Group 1 Group 2 Total
Died \(q_{1k}\) \(q_{2k}\) \(q_k\)
Survived \(r_{1k} - q_{1k}\) \(r_{2k} - q_{2k}\) \(r_k - q_k\)
Total \(r_{1k}\) \(r_{2k}\) \(r_k\)
  • Focus on \(q_{1k}\)! Null hypothesis: across all \(k \in \{1, \ldots, K\}\), \(q_{1k}\) not systematically lower or higher than RHS:

\[ \mathbb{E}[ \underbrace{q_{1k}}_{\mathclap{\small\substack{\text{Group 1 deaths} \\ \text{at }d_k}}} ] = \overbrace{r_{1k}}^{\mathclap{\small\substack{\text{At risk in} \\[0.2em] \text{Group 1}}}} \cdot \underbrace{\left( \frac{q_k}{r_k} \right)}_{\mathclap{\small\text{Overall death rate at }d_k}} \]

Log-Rank Test Statistic

  • Test statistic \(W\) should “detect” the alternative hypothesis, on basis of information \(X\)… We can use \(\boxed{\textstyle X = \sum_{k=1}^{K}q_{1k}}\)!
  • Here, log-rank test statistic \(W\) detects how much \(q_{1k}\) deviates from expected value from prev slide:

\[ \begin{align*} W &= \frac{X - \mathbb{E}[X]}{\sqrt{\text{Var}[X]}} = \frac{ \sum_{k=1}^{K}q_{1k} - \mathbb{E}\mkern-3mu\left[ \sum_{k=1}^{K}q_{1k} \right] }{ \sqrt{\text{Var}\mkern-3mu\left[ \sum_{k=1}^{K}q_{1k} \right]} } = \frac{ \sum_{k=1}^{K} \left( q_{1k} - \mathbb{E}[q_{1k}] \right) }{ \sqrt{\text{Var}\mkern-3mu\left[ \sum_{k=1}^{K}q_{1k} \right]} } \\ &= \frac{ \sum_{k=1}^{K}\left( q_{1k} - r_{1k}\cdot \frac{q_k}{r_k} \right) }{ \sqrt{\text{Var}\mkern-3mu\left[ \sum_{k=1}^{K}q_{1k} \right]} } \underset{\small\text{Ex 11.7}}{\overset{\small\text{ISLR}}{=}} \frac{ \sum_{k=1}^{K}\left( q_{1k} - r_{1k}\cdot \frac{q_k}{r_k} \right) }{ \sqrt{\sum_{k=1}^{K} \frac{ q_k(r_{1k}/r_k)(1 - r_{1k}/r_k)(r_k - q_k) }{ r_k - 1 }} } \end{align*} \]

Regression with Survival Response

The Hazard Function

  • Death rate at tiny instant after \(t\) (between \(t\) and \(t + \Delta t\)), given survival past \(t\):

\[ h(t) \definedas \lim_{\Delta t \rightarrow 0}\frac{\Pr(t < T \leq t + \Delta t)}{\Delta t} \]

  • If we define a RV \(T_{>t} \definedas [T \mid T > t]\), \(h(t)\) is the pdf of \(T_{>t}\)!
  • Can relate \(h(t)\) to quantities we know (e.g., from 5100):

\[ \underbrace{h(t)}_{\small\text{pdf of }T_{>t}} = \frac{\overbrace{f(t)}^{\small\text{pdf of }T}}{\underbrace{S(t)}_{\small \Pr(T > t)}} \]

Proportional Hazard Assumption

\[ h(t \mid x_i) = h_0(t)\exp\left[ \sum_{j=1}^{p}\beta_j x_{ij} \right] \iff \underbrace{\log[h(t \mid x_i)]}_{\hbar(t \mid x_i)} = \underbrace{\log[h_0(t)]}_{\hbar_0(t)} + \sum_{j=1}^{p}\beta_j x_{ij} \]

Basically: Features \(X_{j}\) shift [log] baseline hazard function \(\hbar_0(t)\) up and down by constant amounts, via multiplication by \(e^{\beta_j}\)

  • Top row: \(\hbar_0(t)\) in black, \(X_j = 1\) shifts it down via multiplication by \(e^{\beta_j}\) to form \(\hbar(t \mid X_j)\) in green
  • Bottom row: Proportional hazard violated, since \(X_j = 1\) associated with different changes to \(\hbar_0(t)\) at different \(t\) values

Cox Proportional Hazard Model

  • Intuition: Best \(\beta\)s are those which best predict \(i\)’s death among all at risk at same time. Called “Partial Likelihood” \(\text{PL}(\boldsymbol\beta)\) since we don’t need to estimate \(h_0(t)\)!

\[ \prod_{i : \, \delta_i = 1} \; \frac{ {\color{red}\cancel{\color{black}h_0(t)}}\exp\mkern-3mu\left[ \sum_{j=1}^{p} \beta_j x_{ij} \right] }{ {\displaystyle \sum\limits_{\mathclap{i': \, y_{i'} \geq y_i}}} {\color{red}\cancel{\color{black}h_0(t)}}\exp\mkern-3mu\left[ \sum_{j=1}^{p} \beta_j x_{i'j} \right] } = \prod_{i : \, \delta_i = 1} \; \frac{ \exp\mkern-3mu\left[ \sum_{j=1}^{p} \beta_j x_{ij} \right] }{ {\displaystyle \sum\limits_{\mathclap{i': \, y_{i'} \geq y_i}}} \exp\mkern-3mu\left[ \sum_{j=1}^{p} \beta_j x_{i'j} \right] } \]

  • Also note the missing intercept \(\beta_0\): handled by the baseline hazard function \(h_0(t)\) (which we cancel out anyways!)

In Code

  • We’ll use the survival library in R, very similar syntax to lm(), glm(), etc.!
Code
library(survival) |> suppressPackageStartupMessages()
library(ISLR2) |> suppressPackageStartupMessages()
bc_df <- BrainCancer |> filter(diagnosis != "Other")
bc_df$diagnosis = factor(bc_df$diagnosis)
bc_df$sex <- factor(substr(bc_df$sex, 1, 1))
bc_df$loc <- factor(substr(bc_df$loc, 1, 5))
options(width=130)
summary(bc_df)
 sex         diagnosis     loc           ki              gtv         stereo       status           time      
 F:39   Meningioma:42   Infra:10   Min.   : 40.00   Min.   : 0.040   SRS:19   Min.   :0.000   Min.   : 0.07  
 M:34   LG glioma : 9   Supra:63   1st Qu.: 80.00   1st Qu.: 2.500   SRT:54   1st Qu.:0.000   1st Qu.: 9.77  
        HG glioma :22              Median : 80.00   Median : 6.480            Median :0.000   Median :26.46  
                                   Mean   : 81.37   Mean   : 8.297            Mean   :0.411   Mean   :27.83  
                                   3rd Qu.: 90.00   3rd Qu.:11.380            3rd Qu.:1.000   3rd Qu.:41.44  
                                   Max.   :100.00   Max.   :33.690            Max.   :1.000   Max.   :82.56  
Code
bc_df |> head(5)
sex diagnosis loc ki gtv stereo status time
F Meningioma Infra 90 6.11 SRS 0 57.64
M HG glioma Supra 90 19.35 SRT 1 8.98
F Meningioma Infra 70 7.95 SRS 0 26.46
F LG glioma Supra 80 7.61 SRT 1 47.80
M HG glioma Supra 90 5.06 SRT 1 6.30

coxph() Estimation

Code
library(broom) |> suppressPackageStartupMessages()
full_cox_model <- coxph(
  Surv(time, status) ~ sex + diagnosis + loc + ki + gtv + stereo,
  data=bc_df
)
broom::tidy(full_cox_model) |> mutate_if(is.numeric, round, 3)
term estimate std.error statistic p.value
sexM 0.456 0.396 1.154 0.249
diagnosisLG glioma 0.864 0.641 1.347 0.178
diagnosisHG glioma 2.116 0.470 4.504 0.000
locSupra 1.482 1.105 1.342 0.180
ki -0.048 0.020 -2.406 0.016
gtv 0.036 0.026 1.361 0.173
stereoSRT -0.475 0.591 -0.803 0.422
  • Diagnosis: Relative to baseline of Meningioma, HG (High-Grade) glioma associated with \(e^{2.155} \approx 8.628\) times greater hazard
  • Karnofsky Index (ki): 1-unit increase associated with reduction of hazard to \(e^{-0.055} \approx 94.65\)% of previous value [0-100 scale of self-functioning abilities]

Survival Curves for Each Diagnosis

Code
library(extrafont) |> suppressPackageStartupMessages()
par(cex=1.2, family="CMU Sans Serif")
diag_levels <- c("Meningioma", "LG glioma", "HG glioma")
diag_df <- tibble(
  diagnosis = diag_levels,
  sex = rep("F", 3),
  loc = rep("Supra", 3),
  ki = rep(mean(bc_df$ki), 3),
  gtv = rep(mean(bc_df$gtv), 3),
  stereo = rep("SRT", 3)
)
survplots <- survfit(full_cox_model, newdata = diag_df)
plot(
  survplots,
  main = "Survival Curves by Diagnosis",
  xlab = "Months", ylab = "Survival Probability",
  col = cb_palette, lwd=1.5
)
legend(
  "bottomleft",
  diag_levels,
  col = cb_palette, lty = 1, lwd=1.5
)

Less Straightforward: Curves for Each ki, gtv Val

  • Technically a different survival curve for each value of ki \(\in [0, 100]\), gtv \(\in (0, \infty)\)
Code
bc_df |> ggplot(aes(x=ki)) +
  geom_density(
    linewidth=g_linewidth,
    fill=cb_palette[1], alpha=0.333
  ) +
  theme_dsan(base_size=28) +
  labs(title = "Karnofsky Index Distribution")

Code
bc_df |> ggplot(aes(x=gtv)) +
  geom_density(
    linewidth=g_linewidth,
    fill=cb_palette[1], alpha=0.333
  ) +
  theme_dsan(base_size=28) +
  labs(title = "Gross Tumor Volume (GTV) Distribution")

  • One approach: bin into (low, medium, high) via terciles, one curve per bin median:
Code
ki_terciles <- quantile(bc_df$ki, c(1/3, 2/3))
bc_df <- bc_df |> mutate(
  tercile = ifelse(ki < ki_terciles[1], 1, ifelse(ki < ki_terciles[2], 2, 3))
)
(terc_df <- bc_df |>
  group_by(tercile) |>
  summarize(med_ki=median(ki)))
tercile med_ki
1 70
2 80
3 90
Code
library(latex2exp) |> suppressPackageStartupMessages()
ki_df <- tibble(
  diagnosis = rep("Meningioma", 3),
  sex = rep("F", 3),
  loc = rep("Supra", 3),
  ki = terc_df$med_ki,
  gtv = rep(mean(bc_df$gtv), 3),
  stereo = rep("SRT", 3)
)
ki_plots <- survfit(full_cox_model, newdata = ki_df)
par(
  mar=c(4.0,4.0,1.2,0.5),
  cex=1.2,
  family="CMU Sans Serif"
) # bltr
plot(
  ki_plots,
  main = "Survival Curves by KI Tercile",
  xlab = "Months",
  ylab = TeX("$\\Pr(T > t)$"),
  lwd = 1,
  col = cb_palette
)
ki_labs <- c(
  TeX("$h( t \\, | \\, KI = 70 )$"),
  TeX("$h( t \\, | \\, KI = 80 )$"),
  TeX("$h( t \\, | \\, KI = 90 )$")
)
legend(
  "bottomleft",
  ki_labs, lwd=1,
  col = cb_palette, lty = 1, cex=0.8
)

Quiz

  • Last one! You got this, حَبَائب!