Week 12: Survival Analysis

DSAN 5300: Statistical Learning
Spring 2025, Georgetown University

Author
Affiliation

Jeff Jacobs

Published

Monday, April 7, 2025

Open slides in new tab →

Schedule

Today’s Planned Schedule:

Start End Topic
Lecture 6:30pm 7:00pm Estimating Survival Curves →
7:00pm 7:20pm Comparing Between Groups →
7:20pm 8:00pm Regression (Cox Proportional Hazard Model) →
Break! 8:00pm 8:10pm
8:10pm 9:00pm Quiz 3 →

Roadmap

  • Basic tool: survival curve S(t) = probability of surviving past period t
  • Complicating factor: Censored obs (e.g., drop out of study)
  • Need to think about DGP: Why is observation i censored while j is observed?
    • Basic estimators only valid if censoring survival time
  • How to compare survival between two groups
  • Regression: Effect of features X1,,Xp on survival

Survival Analysis Basics

  • Lots of information crunched into this one figure!

ISLR Figure 11.1

Read Left to Right Sequence of Events

Modified ISLR Figure 11.1

“Slice” at Deaths At-Risk Observations

Modified ISLR Figure 11.1

The Actual Dataset We Get

Patient (i) Observed Outcome (Yi) Observed? (δi)
1 300 1
2 365 0
3 150 1
4 250 0

The Dataset We Can Infer

Patient (i) Yi δi Survival Time (Ti) Censor Point (Ci)
1 300 1 300 NA
2 365 0 NA 365
3 150 1 150 NA
4 250 0 NA 250

…If we’re testing effect of treatment, which column do we most care about?

Measuring Effect of Treatment!

Patient (i) Yi δi Survival Time (Ti) Censor Point (Ci)
1 300 1 300 NA
2 365 0 NA 365
3 150 1 150 NA
4 250 0 NA 250

…If we’re testing effect of treatment, Ti is what we care about!

Basic Question 1: Survival

  • Let T be a RV representing time of death for a patient
  • What is probability that patient survives past given time t?

ST(t)=Pr(T>t)

  • Note relationship to something you saw in 5100!
  • ST(t) defined to be 1FT(t), where FT(t) is CDF of T:

FT(t)=Pr(Tt)

Kaplan-Meier Estimator: Intuition

Each death event dk gives us info that survival probability lower by some amount

Break S(t) into sequence of stepwise changes at d1,,dK:

S(dk)=Pr(T>dk)=LTPPr(T>dkT>dk1)Pr(T>dk1)S(dk1)=+Pr(T>dkTdk1)ContradictionPr=0Pr(Tdk1)

Gives us a recurrence relation:

S(dk)=Pr(T>dkT>dk1)S(dk1)S(dk1)=Pr(T>dk1T>dk2)S(dk2)S(d2)=Pr(T>d2T>d1)S(d1)S(d1)=Pr(T>d1T>d0)S(d0)=Pr(T>d1)

Plug each eq into eq above it to derive:

S(dk)=Pr(T>dkSurvives past dkT>dk1Survives past dk1)×Pr(T>dk1Survives past dk1T>dk2Survives past dk2)××Pr(T>d2Survives past d2T>d1Survives past d1)×Pr(T>d1Survives past d1)

Kaplan-Meier Estimator

  • Defined at death points dk as

S^(dk)=j=1k(rjqjrjNum At RiskNum Survived)

  • Then, for t(dk,dk+1), S^(t)=S^(dk), producing stepwise survival function:

ISLR Figure 11.2

Kaplan-Meier Estimator for our Example

Two death points: d1=150,d2=300 (plus start point d0=0)

S^(d0)=j=00(rkqkrk)=(404)=1S^(d1)=j=01(rkqkrk)=1(r1q1r1)=1(414)=34S^(d2)=j=02(rkqkrk)=134(r2q2r2)=134(212)=3412=38

Code
library(tidyverse) |> suppressPackageStartupMessages()
library(survival) |> suppressPackageStartupMessages()
library(latex2exp) |> suppressPackageStartupMessages()
surv_df <- tribble(
  ~id, ~y, ~delta,
  1, 300, 1,
  2, 365, 0,
  3, 150, 1,
  4, 250, 0
)
surv_obj <- Surv(surv_df$y, event = surv_df$delta)
surv_model <- survfit(surv_obj ~ 1)
# Plot options
par(mar=c(2,4,1.25,1.0)) # bltr
y_label <- TeX("$\\Pr(T > t)$")
plot(
  surv_model,
  ylab=y_label,
  lwd=1,
  main="Survival Curve for 4-Patient Example"
) # conf.int=FALSE
# Add colors
# lines(c(0, 150), c(1.0, 1.0), type='l', col='#E69F00', lwd=2)
rect(xleft = 0, xright = 150, ybottom = 0, ytop = 1.0, col="#E69F0040", lwd=0)
# lines(c(150, 300), c(3/4, 3/4), type='l', col='#56B4E9', lwd=2)
rect(xleft = 150, xright = 300, ybottom = 0, ytop = 1.0, col="#56B4E940", lwd=0)
# lines(c(300, 365), c(3/8, 3/8), type='l', col='#009E73', lwd=2)
rect(xleft = 300, xright = 365, ybottom = 0, ytop = 1.0, col="#009E7340", lwd=0)

Comparing Survival Curves

  • Are female patients more likely to survive than male patients?

ISLR Figure 11.3

Log-Rank Test

  • At each death event dk, construct a table like:
Group 1 Group 2 Total
Died q1k q2k qk
Survived r1kq1k r2kq2k rkqk
Total r1k r2k rk
  • Focus on q1k! Null hypothesis: across all k{1,,K}, q1k not systematically lower or higher than RHS:

E[q1kGroup 1 deathsat dk]=r1kAt risk inGroup 1(qkrk)Overall death rate at dk

Log-Rank Test Statistic

  • Test statistic W should “detect” the alternative hypothesis, on basis of information X… We can use X=k=1Kq1k!
  • Here, log-rank test statistic W detects how much q1k deviates from expected value from prev slide:

W=XE[X]Var[X]=k=1Kq1kE[k=1Kq1k]Var[k=1Kq1k]=k=1K(q1kE[q1k])Var[k=1Kq1k]=k=1K(q1kr1kqkrk)Var[k=1Kq1k]=ISLREx 11.7k=1K(q1kr1kqkrk)k=1Kqk(r1k/rk)(1r1k/rk)(rkqk)rk1

Regression with Survival Response

The Hazard Function

  • Death rate at tiny instant after t (between t and t+Δt), given survival past t:

h(t)=deflimΔt0Pr(t<Tt+Δt)Δt

  • If we define a RV T>t=def[TT>t], h(t) is the pdf of T>t!
  • Can relate h(t) to quantities we know (e.g., from 5100):

h(t)pdf of T>t=f(t)pdf of TS(t)Pr(T>t)

Proportional Hazard Assumption

h(txi)=h0(t)exp[j=1pβjxij]log[h(txi)](txi)=log[h0(t)]0(t)+j=1pβjxij

Basically: Features Xj shift [log] baseline hazard function 0(t) up and down by constant amounts, via multiplication by eβj

  • Top row: 0(t) in black, Xj=1 shifts it down via multiplication by eβj to form (tXj) in green
  • Bottom row: Proportional hazard violated, since Xj=1 associated with different changes to 0(t) at different t values

Cox Proportional Hazard Model

  • Intuition: Best βs are those which best predict i’s death among all at risk at same time. Called “Partial Likelihood” PL(β) since we don’t need to estimate h0(t)!

i:δi=1h0(t)exp[j=1pβjxij]i:yiyih0(t)exp[j=1pβjxij]=i:δi=1exp[j=1pβjxij]i:yiyiexp[j=1pβjxij]

  • Also note the missing intercept β0: handled by the baseline hazard function h0(t) (which we cancel out anyways!)

In Code

  • We’ll use the survival library in R, very similar syntax to lm(), glm(), etc.!
Code
library(survival) |> suppressPackageStartupMessages()
library(ISLR2) |> suppressPackageStartupMessages()
bc_df <- BrainCancer |> filter(diagnosis != "Other")
bc_df$diagnosis = factor(bc_df$diagnosis)
bc_df$sex <- factor(substr(bc_df$sex, 1, 1))
bc_df$loc <- factor(substr(bc_df$loc, 1, 5))
options(width=130)
summary(bc_df)
 sex         diagnosis     loc           ki              gtv         stereo       status           time      
 F:39   Meningioma:42   Infra:10   Min.   : 40.00   Min.   : 0.040   SRS:19   Min.   :0.000   Min.   : 0.07  
 M:34   LG glioma : 9   Supra:63   1st Qu.: 80.00   1st Qu.: 2.500   SRT:54   1st Qu.:0.000   1st Qu.: 9.77  
        HG glioma :22              Median : 80.00   Median : 6.480            Median :0.000   Median :26.46  
                                   Mean   : 81.37   Mean   : 8.297            Mean   :0.411   Mean   :27.83  
                                   3rd Qu.: 90.00   3rd Qu.:11.380            3rd Qu.:1.000   3rd Qu.:41.44  
                                   Max.   :100.00   Max.   :33.690            Max.   :1.000   Max.   :82.56  
Code
bc_df |> head(5)
sex diagnosis loc ki gtv stereo status time
F Meningioma Infra 90 6.11 SRS 0 57.64
M HG glioma Supra 90 19.35 SRT 1 8.98
F Meningioma Infra 70 7.95 SRS 0 26.46
F LG glioma Supra 80 7.61 SRT 1 47.80
M HG glioma Supra 90 5.06 SRT 1 6.30

coxph() Estimation

Code
library(broom) |> suppressPackageStartupMessages()
full_cox_model <- coxph(
  Surv(time, status) ~ sex + diagnosis + loc + ki + gtv + stereo,
  data=bc_df
)
broom::tidy(full_cox_model) |> mutate_if(is.numeric, round, 3)
term estimate std.error statistic p.value
sexM 0.456 0.396 1.154 0.249
diagnosisLG glioma 0.864 0.641 1.347 0.178
diagnosisHG glioma 2.116 0.470 4.504 0.000
locSupra 1.482 1.105 1.342 0.180
ki -0.048 0.020 -2.406 0.016
gtv 0.036 0.026 1.361 0.173
stereoSRT -0.475 0.591 -0.803 0.422
  • Diagnosis: Relative to baseline of Meningioma, HG (High-Grade) glioma associated with e2.1558.628 times greater hazard
  • Karnofsky Index (ki): 1-unit increase associated with reduction of hazard to e0.05594.65% of previous value [0-100 scale of self-functioning abilities]

Survival Curves for Each Diagnosis

Code
library(extrafont) |> suppressPackageStartupMessages()
par(cex=1.2, family="CMU Sans Serif")
diag_levels <- c("Meningioma", "LG glioma", "HG glioma")
diag_df <- tibble(
  diagnosis = diag_levels,
  sex = rep("F", 3),
  loc = rep("Supra", 3),
  ki = rep(mean(bc_df$ki), 3),
  gtv = rep(mean(bc_df$gtv), 3),
  stereo = rep("SRT", 3)
)
survplots <- survfit(full_cox_model, newdata = diag_df)
plot(
  survplots,
  main = "Survival Curves by Diagnosis",
  xlab = "Months", ylab = "Survival Probability",
  col = cb_palette, lwd=1.5
)
legend(
  "bottomleft",
  diag_levels,
  col = cb_palette, lty = 1, lwd=1.5
)

Less Straightforward: Curves for Each ki, gtv Val

  • Technically a different survival curve for each value of ki [0,100], gtv (0,)
Code
bc_df |> ggplot(aes(x=ki)) +
  geom_density(
    linewidth=g_linewidth,
    fill=cb_palette[1], alpha=0.333
  ) +
  theme_dsan(base_size=28) +
  labs(title = "Karnofsky Index Distribution")

Code
bc_df |> ggplot(aes(x=gtv)) +
  geom_density(
    linewidth=g_linewidth,
    fill=cb_palette[1], alpha=0.333
  ) +
  theme_dsan(base_size=28) +
  labs(title = "Gross Tumor Volume (GTV) Distribution")

  • One approach: bin into (low, medium, high) via terciles, one curve per bin median:
Code
ki_terciles <- quantile(bc_df$ki, c(1/3, 2/3))
bc_df <- bc_df |> mutate(
  tercile = ifelse(ki < ki_terciles[1], 1, ifelse(ki < ki_terciles[2], 2, 3))
)
(terc_df <- bc_df |>
  group_by(tercile) |>
  summarize(med_ki=median(ki)))
tercile med_ki
1 70
2 80
3 90
Code
library(latex2exp) |> suppressPackageStartupMessages()
ki_df <- tibble(
  diagnosis = rep("Meningioma", 3),
  sex = rep("F", 3),
  loc = rep("Supra", 3),
  ki = terc_df$med_ki,
  gtv = rep(mean(bc_df$gtv), 3),
  stereo = rep("SRT", 3)
)
ki_plots <- survfit(full_cox_model, newdata = ki_df)
par(
  mar=c(4.0,4.0,1.2,0.5),
  cex=1.2,
  family="CMU Sans Serif"
) # bltr
plot(
  ki_plots,
  main = "Survival Curves by KI Tercile",
  xlab = "Months",
  ylab = TeX("$\\Pr(T > t)$"),
  lwd = 1,
  col = cb_palette
)
ki_labs <- c(
  TeX("$h( t \\, | \\, KI = 70 )$"),
  TeX("$h( t \\, | \\, KI = 80 )$"),
  TeX("$h( t \\, | \\, KI = 90 )$")
)
legend(
  "bottomleft",
  ki_labs, lwd=1,
  col = cb_palette, lty = 1, cex=0.8
)

Quiz 3

  • Last one! You got this, حَبَائب!