Week 4: Clearing the Path from Cause to Effect

DSAN 5650: Causal Inference for Computational Social Science
Summer 2025, Georgetown University

Jeff Jacobs

jj1088@georgetown.edu

Wednesday, June 11, 2025

Schedule

Today’s Planned Schedule:

	Start	End	Topic
Lecture	6:30pm	7:10pm	PGM as Modeling Language →
	7:10pm	7:30pm	The Ladder of Causal Inference →
	7:30pm	7:50pm	Elemental Confounds I: Forks and Chains →
Break!	7:50pm	8:00pm
	8:00pm	8:50pm	Elemental Confounds II: ⚠️Colliders⚠️ →
	8:50pm	9:00pm	Elemental Confounds III: Proxies →

Roadmap

5300 → Now

In e.g. 5300, you learned a bunch of ad hoc models: Linear Regression, Decision Trees, SVMs
PGMs provide a formalized modeling language for “writing out” models unambiguously in a way your computer understands: specifying exactly how to estimate parameters from data

Now → August: Class splits into two themes, running in parallel!

What kinds of cool comp social sci models are unlocked, that we can now implement in this language? [HW2]
How can we expand PGM vocabulary to incorporate causality? [Midterm]

From PGMs to SWIGs (Bezuidenhout et al. 2024)

Why Take the Time to Learn a Modeling Language (vs. Individual Models)?

My answer: allows you to adapt to specifics/idiosyncrasies of your problem!
Language metaphor: Learning models vs. learning modeling language \(\Leftrightarrow\) Learning phrases in a language vs. learning to speak the language
“Hello, one hamburger please” is good, but what if you…
Are allergic to ketchup and need to make sure it’s removed
Want to replace sesame seed bun with poppy seed bun, if they have it
Prefer spicy, but not too spicy, mustard Bun only Animal style …

Languages give us a syntax…

S	\(\rightarrow\)	NP VP
NP	\(\rightarrow\)	DetP N \| AdjP NP
VP	\(\rightarrow\)	V NP
AdjP	\(\rightarrow\)	Adj \| Adv AdjP
N	\(\rightarrow\)	`frog` \| `tadpole`
V	\(\rightarrow\)	`sees` \| `likes`
Adj	\(\rightarrow\)	`big` \| `small`
Adv	\(\rightarrow\)	`very`
DetP	\(\rightarrow\)	`a` \| `the`

…For expressing arbitrary (infinitely many!) sentences

Example 1: Multilevel Tadpoles (McElreath, Ch. 13)

Need a language that can communicate the following info to estimation algorithm:

Unit of observation is tadpole, but unit of analysis is tank
Ultimately, I care about \(Y =\) survival rate (dependent var), as function of \(X =\) tank properties (independent var)
…But the \(n_i = 48\) tanks actually come in \(n_j = 3\) types: small (10 bois), medium (25), large (35) (Bonus: What if there are different numbers of tanks per type?)
I need it to account for impact of tank size, then pool info across tank sizes

From McElreath (2020)

Example 2: Dissertation Nightmare

*Above*: Data from Soviet archives; *Above Right*: US Military archives; *Below Right*: NATO archives

Nightmarish Without a Modeling Language!

Modeling language \(\Rightarrow\) Unambiguously “encode” idiosyncratic domain knowledge
Dissertation: Cold War \(\times\) “Third World” \(\leadsto\) Cuban 🇨🇺 trans-continental operations¹
Main narrative (for estimation): 1975 (South Africa invades Angola, 14 Oct → 🇨🇺 intervention, 4 Nov) to 1979 (USSR requests 🇨🇺 troops to Ethiopia for Ogaden War)
[Ontology] Fix 1979 geographic entities at National level (as modeling choice, like fixing 2000 USD to measure inflation): \(\textsf{Cuba}_{1979}\), \(\textsf{Angola}_{1979}\), \(\textsf{PDRY}_{1979}\), \(\textsf{YAR}_{1979}\)
Different tokens (Think NLP: "Congo", "DRC", "Republic of Congo") can then be contextualized: can “track” and link data appropriately despite splits, merges, name changes
Say we have data on “Number of Communist Militants in \(X\)” (Hoover Yearbook)…

Entity	Data from 1947-1971 at...		Data from 1971-Present at...
\(\textsf{Pakistan}_{1979}\)	National Level:	\(\frac{62}{62+70} \times\) “Pakistan”	National Level:	“Pakistan”
\(\textsf{Pakistan}_{1979}\)	Subnational Level:	“West Pakistan”	Subnational Level:	\(\sum_{i \in \text{Regions}}\text{data}_i\)
\(\textsf{Bangladesh}_{1979}\)	National Level:	\(\frac{70}{62+70} \times\) “Pakistan”	National Level:	“Bangladesh”
\(\textsf{Bangladesh}_{1979}\)	Subnational Level:	“East Pakistan”	Subnational Level:	\(\sum_{i \in \text{Regions}}\text{data}_i\)

The Ladder of Causal Inference

		Counterfactuals: What would have happened, if history was slightly different… \(\Pr(Y_{M=M_0} \mid \textsf{do}(X)) - \Pr(Y_{M=M_0} \mid \textsf{do}(\neg X))\)
	Intervention: What happens if I… \(\Pr(Y \mid \textsf{do}(X)) - \Pr(Y \mid \textsf{do}(\neg X))\)
Association: What happened? \(\Pr(Y \mid X) - \Pr(Y \mid \neg X)\)

\(\leadsto\) Stuff we add to probability theory in 5650 is to combat confounding: to “fix” whatever is making \(\Pr(Y \mid X) \neq \Pr(Y \mid \textsf{do}(X))\)!

The Four Elemental Confounds

From Richard McElreath’s Statistical Rethinking Lectures

Code

library(tidyverse) # For ggplot
library(extraDistr) # For rbern()
library(patchwork) # For side-by-side plotting
n_d <- 10000 # For discrete RVs
n_c <- 300 # For continuous RVs

The Fork: \(X \leftarrow Z \rightarrow Y\)

Code

set.seed(5650)
fork_df <- tibble(
    Z = rbern(n_d),
    X = rbern(n_d, (1-Z)*0.1 + Z*0.9),
    Y = rbern(n_d, (1-Z)*0.1 + Z*0.9),
)

Code

plot_freqs <- function(df, plot_title, y_lab=TRUE) {
  df_cor <- cor(df$X, df$Y)
  df_label <- paste0("Cor(X,Y) = ",round(df_cor,3))
  freq_df <- df |>
    group_by(X, Y) |>
    summarize(count=n())
  freq_plot <- freq_df |>
    ggplot(
      aes(x=factor(X), y=factor(Y), fill=count)
    ) +
    geom_tile() +
    coord_equal() +
    scale_fill_distiller(
      palette="Greens", direction=1,
      limits=c(0,5000)
    ) +
    geom_label(
      aes(label=count),
      fill="white", color="black", size=7
    ) +
    labs(
      title = plot_title,
      subtitle = df_label,
      x="X", y="Y"
    ) +
    theme_dsan(base_size=24) +
    theme(
      plot.title = element_text(size=21),
      plot.subtitle = element_text(size=18)
    ) +
    remove_legend()
  if (!y_lab) {
    freq_plot <- freq_plot + theme(
      axis.title.y = element_blank()
    )
  }
  return(freq_plot)
}
# The full df
full_label <- paste0("Raw Data (n = 10K)")
full_plot <- plot_freqs(fork_df, full_label)
# Conditioning on Z = 0
z0_df <- fork_df |> filter(Z == 0)
z0_n <- nrow(z0_df)
z0_label <- paste0("Z == 0 (",z0_n," obs)")
z0_plot <- plot_freqs(z0_df, z0_label, y_lab=FALSE)
# Conditioning on Z = 1
z1_df <- fork_df |> filter(Z == 1)
z1_n <- nrow(z1_df)
z1_label <- paste0("Z == 1 (",z1_n," obs)")
z1_plot <- plot_freqs(z1_df, z1_label, y_lab=FALSE)
full_plot | z0_plot | z1_plot

Code

set.seed(5650)
cfork_df <- tibble(
    Z = rbern(n_c),
    X = rnorm(n_c, 2 * Z - 1),
    Y = rnorm(n_c, 2 * Z - 1)
)

Code

library(latex2exp)
overall_lm <- lm(Y ~ X, data=cfork_df)
overall_slope <- round(overall_lm$coef['X'], 3)
z0_lm <- lm(Y ~ X, data=cfork_df |> filter(Z == 0))
z0_slope <- round(z0_lm$coef['X'], 2)
z0_label <- paste0("$Slope_{Z=0} = ",z0_slope,"$")
z0_leg_label <- TeX(paste0("0 $(m=",z0_slope,")$"))
z1_lm <- lm(Y ~ X, data=cfork_df |> filter(Z == 1))
z1_slope <- round(z1_lm$coef['X'], 2)
z1_label <- paste0("$Slope_{Z=1} = ",z1_slope,"$")
z_texlabel <- TeX(paste0(z0_label, " | ", z1_label))
cfork_xmin <- min(cfork_df$X)
cfork_xmax <- max(cfork_df$X)
ggplot() +
  # Points
  geom_point(
    data=cfork_df,
    aes(x=X, y=Y, color=factor(Z)),
    size=0.6*g_pointsize,
    alpha=0.8
  ) +
  # Overall lm
  geom_smooth(
    data=cfork_df, aes(x=X, y=Y),
    method = lm, se = FALSE,
    linewidth = 2.5, color='black'
  ) +
  # Stratified lm
  # (slightly larger black lines)
  geom_smooth(
    data=cfork_df,
    aes(x=X, y=Y, group=factor(Z)),
    method=lm, se=FALSE, fullrange=TRUE,
    linewidth=2.75, color='black'
  ) +
  # (Colored lines)
  geom_smooth(
    data=cfork_df,
    aes(x=X, y=Y, color=factor(Z)),
    method=lm, se=FALSE, fullrange=TRUE,
    linewidth=2
  ) +
  theme_dsan(base_size=24) +
  theme(
    plot.title = element_text(size=24),
    plot.subtitle = element_text(size=20)
  ) +
  coord_equal() +
  labs(
    title = paste0(
      "Unstratified Slope = ",overall_slope
    ),
    subtitle=z_texlabel,
    x = "X", y = "Y", color = "Z"
  )

The Pipe: \(X \rightarrow Z \rightarrow Y\)

Code

set.seed(5650)
pipe_df <- tibble(
    X = rbern(n_d),
    Z = rbern(n_d, (1-X)*0.1 + X*0.9),
    Y = rbern(n_d, (1-Z)*0.1 + Z*0.9),
)

Code

# The full df
pipe_full_label <- paste0("Raw Data (n = 10K)")
pipe_full_plot <- plot_freqs(pipe_df, pipe_full_label)
# Conditioning on Z = 0
pipe_z0_df <- pipe_df |> filter(Z == 0)
pipe_z0_n <- nrow(pipe_z0_df)
pipe_z0_label <- paste0("Z == 0 (",pipe_z0_n," obs)")
pipe_z0_plot <- plot_freqs(pipe_z0_df, pipe_z0_label, y_lab=FALSE)
# Conditioning on Z = 1
pipe_z1_df <- pipe_df |> filter(Z == 1)
pipe_z1_n <- nrow(pipe_z1_df)
pipe_z1_label <- paste0("Z == 1 (",pipe_z1_n," obs)")
pipe_z1_plot <- plot_freqs(pipe_z1_df, pipe_z1_label, y_lab=FALSE)
pipe_full_plot | pipe_z0_plot | pipe_z1_plot

Code

set.seed(5650)
cpipe_df <- tibble(
    X = rnorm(n_c),
    Z = rbern(n_c, plogis(X)),
    Y = rnorm(n_c, 2 * Z - 1)
)

Code

cpipe_lm <- lm(Y ~ X, data=cpipe_df)
cpipe_slope <- round(cpipe_lm$coef['X'], 3)
cpipe_z0_lm <- lm(Y ~ X, data=cpipe_df |> filter(Z == 0))
cpipe_z0_slope <- round(cpipe_z0_lm$coef['X'], 2)
cpipe_z0_label <- paste0("$Slope_{Z=0} = ",cpipe_z0_slope,"$")
cpipe_z1_lm <- lm(Y ~ X, data=cpipe_df |> filter(Z == 1))
cpipe_z1_slope <- round(cpipe_z1_lm$coef['X'], 2)
cpipe_z1_label <- paste0("$Slope_{Z=1} = ",cpipe_z1_slope,"$")
cpipe_z_texlabel <- TeX(paste0(cpipe_z0_label, " | ", cpipe_z1_label))
cpipe_xmin <- min(cpipe_df$X)
cpipe_xmax <- max(cpipe_df$X)
ggplot() +
  # Points
  geom_point(
    data=cpipe_df |> filter(Y > -3),
    aes(x=X, y=Y, color=factor(Z)),
    size=0.6*g_pointsize,
    alpha=0.8
  ) +
  # Overall lm
  geom_smooth(
    data=cpipe_df, aes(x=X, y=Y),
    method = lm, se = FALSE,
    linewidth = 2.5, color='black'
  ) +
  # Stratified lm
  # (slightly larger black lines)
  geom_smooth(
    data=cpipe_df,
    aes(x=X, y=Y, group=factor(Z)),
    method=lm, se=FALSE, fullrange=TRUE,
    linewidth=2.75, color='black'
  ) +
  # (Colored lines)
  geom_smooth(
    data=cpipe_df,
    aes(x=X, y=Y, color=factor(Z)),
    method=lm, se=FALSE, fullrange=TRUE,
    linewidth=2
  ) +
  theme_dsan(base_size=24) +
  theme(
    plot.title = element_text(size=24),
    plot.subtitle = element_text(size=20)
  ) +
  coord_equal() +
  labs(
    title = paste0(
      "Unstratified Slope = ",cpipe_slope
    ),
    subtitle=cpipe_z_texlabel,
    x = "X", y = "Y", color = "Z"
  )

⚠️The Collider⚠️: \(X \rightarrow Z \leftarrow Y\)

Code

set.seed(5650)
coll_df <- tibble(
    X = rbern(n_d),
    Y = rbern(n_d),
    Z = rbern(n_d, ifelse(X + Y > 0, 0.9, 0.2)),
)

Code

# The full df
coll_full_label <- paste0("Raw Data (n = 10K)")
coll_full_plot <- plot_freqs(coll_df, coll_full_label)
# Conditioning on Z = 0
coll_z0_df <- coll_df |> filter(Z == 0)
coll_z0_n <- nrow(coll_z0_df)
coll_z0_label <- paste0("Z == 0 (",coll_z0_n," obs)")
coll_z0_plot <- plot_freqs(coll_z0_df, coll_z0_label, y_lab=FALSE)
# Conditioning on Z = 1
coll_z1_df <- coll_df |> filter(Z == 1)
coll_z1_n <- nrow(coll_z1_df)
coll_z1_label <- paste0("Z == 1 (",coll_z1_n," obs)")
coll_z1_plot <- plot_freqs(coll_z1_df, coll_z1_label, y_lab=FALSE)
coll_full_plot | coll_z0_plot | coll_z1_plot

Conditioning on colliders induces correlation where there previously was none ☠️

Code

set.seed(5650)
ccoll_df <- tibble(
    X = rnorm(n_c),
    Y = rnorm(n_c),
    Z = rbern(n_c, plogis(2 * (X + Y - 1)))
)

Code

ccoll_lm <- lm(Y ~ X, data=ccoll_df)
ccoll_slope <- round(ccoll_lm$coef['X'], 3)
ccoll_z0_lm <- lm(Y ~ X, data=ccoll_df |> filter(Z == 0))
ccoll_z0_slope <- round(ccoll_z0_lm$coef['X'], 2)
ccoll_z0_label <- paste0("$Slope_{Z=0} = ",ccoll_z0_slope,"$")
ccoll_z1_lm <- lm(Y ~ X, data=ccoll_df |> filter(Z == 1))
ccoll_z1_slope <- round(ccoll_z1_lm$coef['X'], 2)
ccoll_z1_label <- paste0("$Slope_{Z=1} = ",ccoll_z1_slope,"$")
ccoll_z_texlabel <- TeX(paste0(ccoll_z0_label, " | ", ccoll_z1_label))
ccoll_xmin <- min(ccoll_df$X)
ccoll_xmax <- max(ccoll_df$X)
ggplot() +
  # Points
  geom_point(
    data=ccoll_df |> filter(Y > -3),
    aes(x=X, y=Y, color=factor(Z)),
    size=0.6*g_pointsize,
    alpha=0.8
  ) +
  # Overall lm
  geom_smooth(
    data=ccoll_df, aes(x=X, y=Y),
    method = lm, se = FALSE,
    linewidth = 2.5, color='black'
  ) +
  # Stratified lm
  # (slightly larger black lines)
  geom_smooth(
    data=ccoll_df,
    aes(x=X, y=Y, group=factor(Z)),
    method=lm, se=FALSE, fullrange=TRUE,
    linewidth=2.75, color='black'
  ) +
  # (Colored lines)
  geom_smooth(
    data=ccoll_df,
    aes(x=X, y=Y, color=factor(Z)),
    method=lm, se=FALSE, fullrange=TRUE,
    linewidth=2
  ) +
  theme_dsan(base_size=24) +
  theme(
    plot.title = element_text(size=24),
    plot.subtitle = element_text(size=20)
  ) +
  coord_equal() +
  labs(
    title = paste0(
      "Unstratified Slope = ",ccoll_slope
    ),
    subtitle=ccoll_z_texlabel,
    x = "X", y = "Y", color = "Z"
  )

…This is why we have to think, rather than just “control for everything”! 😭

Proxies for \(Z\)

Code

set.seed(5650)
prox_df <- tibble(
  X = rbern(n_d),
  Z = rbern(n_d, (1-X)*0.1 + X*0.9),
  Y = rbern(n_d, (1-Z)*0.1 + Z*0.9),
  A = rbern(n_d, (1-Z)*0.1 + Z*0.9)
)

Code

# The full df
prox_full_label <- paste0("Raw Data (n = 10K)")
prox_full_plot <- plot_freqs(prox_df, prox_full_label)
# Conditioning on A == 0
prox_a0_df <- prox_df |> filter(A == 0)
prox_a0_n <- nrow(prox_a0_df)
prox_a0_label <- paste0("A == 0 (",prox_a0_n," obs)")
prox_a0_plot <- plot_freqs(prox_a0_df, prox_a0_label, y_lab=FALSE)
# Conditioning on A == 1
prox_a1_df <- prox_df |> filter(A == 1)
prox_a1_n <- nrow(prox_a1_df)
prox_a1_label <- paste0("A == 1 (",prox_a1_n," obs)")
prox_a1_plot <- plot_freqs(prox_a1_df, prox_a1_label, y_lab=FALSE)
prox_full_plot | prox_a0_plot | prox_a1_plot

With just \(X \rightarrow Z \rightarrow Y\), we’d have a pipe
Observing \(A\) gives us some (not all!) information about \(Z\)

Code

library(tidyverse)
library(extraDistr)
library(latex2exp)
set.seed(5650)
n_c <- 300
cprox_df <- tibble(
    X = rnorm(n_c),
    Z = rbern(n_c, plogis(X)),
    Y = rnorm(n_c, 2 * Z - 1),
    A = rbern(n_c, (1-Z)*0.86 + Z*0.14)
)
cprox_lm <- lm(Y ~ X, data=cprox_df)
cprox_slope <- round(cprox_lm$coef['X'], 3)
cprox_a0_lm <- lm(Y ~ X, data=cprox_df |> filter(A == 0))
cprox_a0_slope <- round(cprox_a0_lm$coef['X'], 2)
cprox_a0_label <- paste0("$Slope_{A=0} = ",cprox_a0_slope,"$")
# A == 1 lm
cprox_a1_lm <- lm(Y ~ X, data=cprox_df |> filter(A == 1))
cprox_a1_slope <- round(cprox_a1_lm$coef['X'], 2)
cprox_a1_label <- paste0("$Slope_{A=1} = ",cprox_a1_slope,"$")
cprox_a_texlabel <- TeX(paste0(cprox_a0_label, " | ", cprox_a1_label))
cprox_xmin <- min(cprox_df$X)
cprox_xmax <- max(cprox_df$X)
ggplot() +
  # Points
  geom_point(
    data=cprox_df |> filter(Y > -3),
    aes(x=X, y=Y, color=factor(A)),
    size=0.6*g_pointsize,
    alpha=0.8
  ) +
  # Overall lm
  geom_smooth(
    data=cprox_df, aes(x=X, y=Y),
    method = lm, se = FALSE,
    linewidth = 2.5, color='black'
  ) +
  # Stratified lm
  # (slightly larger black lines)
  geom_smooth(
    data=cprox_df,
    aes(x=X, y=Y, group=factor(A)),
    method=lm, se=FALSE, fullrange=TRUE,
    linewidth=2.75, color='black'
  ) +
  # (Colored lines)
  geom_smooth(
    data=cprox_df,
    aes(x=X, y=Y, color=factor(A)),
    method=lm, se=FALSE, fullrange=TRUE,
    linewidth=2
  ) +
  theme_dsan(base_size=22) +
  theme(
    plot.title = element_text(size=22),
    plot.subtitle = element_text(size=18)
  ) +
  coord_equal() +
  labs(
    title = paste0(
      "Unstratified Slope = ",cprox_slope
    ),
    subtitle=cprox_a_texlabel,
    x = "X", y = "Y", color = "A"
  )

References

Bezuidenhout, Dana, Sarah Forthal, Kara Rudolph, and Matthew R Lamb. 2024. “Single World Intervention Graphs (SWIGs): A Practical Guide.” American Journal of Epidemiology, September, kwae353. https://doi.org/10.1093/aje/kwae353.

Gleijeses, Piero. 2013. Visions of Freedom: Havana, Washington, Pretoria, and the Struggle for Southern Africa, 1976-1991. UNC Press Books.

McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and STAN. CRC Press.