source("../_globals.r")
Week 6: Causality in Ethics and Policy
DSAN 5450: Data Ethics and Policy
Spring 2024, Georgetown University
Class Sessions
Causal Inference
The Intuitive Problem of Inferring Causality
source("../_globals.r")
library(dplyr)
library(ggplot2)
<- c(21362, 22254, 23134, 23698, 24367, 24930, 25632, 26459, 27227, 27457)
ga_lawyers <- tibble::tribble(
ski_df ~year, ~varname, ~value,
2000, "ski_revenue", 1551,
2001, "ski_revenue", 1635,
2002, "ski_revenue", 1801,
2003, "ski_revenue", 1827,
2004, "ski_revenue", 1956,
2005, "ski_revenue", 1989,
2006, "ski_revenue", 2178,
2007, "ski_revenue", 2257,
2008, "ski_revenue", 2476,
2009, "ski_revenue", 2438,
)<- mean(ski_df$value)
ski_mean <- sd(ski_df$value)
ski_sd <- ski_df %>% mutate(val_scaled = 12*value, val_norm = (value - ski_mean)/ski_sd)
ski_df <- tibble::tibble(year=2000:2009, varname="ga_lawyers", value=ga_lawyers)
law_df <- mean(law_df$value)
law_mean <- sd(law_df$value)
law_sd <- law_df %>% mutate(val_norm = (value - law_mean)/law_sd)
law_df <- dplyr::bind_rows(ski_df, law_df)
spur_df ggplot(spur_df, aes(x=year, y=val_norm, color=factor(varname, labels = c("Ski Revenue","Lawyers in Georgia")))) +
stat_smooth(method="loess", se=FALSE) +
geom_point(size=g_pointsize/1.5) +
labs(
fill="",
title="Ski Revenue vs. Georgia Lawyers",
x="Year",
color="Correlation: 99.2%",
linetype=NULL
+
) dsan_theme("custom", 18) +
scale_x_continuous(
breaks=seq(from=2000, to=2014, by=2)
+
) #scale_y_continuous(
# name="Total Revenue, Ski Facilities (Million USD)",
# sec.axis = sec_axis(~ . * law_sd + law_mean, name = "Number of Lawyers in Georgia")
#) +
scale_y_continuous(breaks = -1:1,
labels = ~ . * round(ski_sd,1) + round(ski_mean,1),
name="Total Revenue, Ski Facilities (Million USD)",
sec.axis = sec_axis(~ . * law_sd + law_mean, name = "Number of Lawyers in Georgia")) +
expand_limits(x=2010) +
#geom_hline(aes(yintercept=x, color="Mean Values"), as.data.frame(list(x=0)), linewidth=0.75, alpha=1.0, show.legend = TRUE) +
scale_color_manual(
breaks=c('Ski Revenue', 'Lawyers in Georgia'),
values=c('Ski Revenue'=cbPalette[1], 'Lawyers in Georgia'=cbPalette[2]))
`geom_smooth()` using formula = 'y ~ x'
cor(ski_df$value, law_df$value)
[1] 0.9921178
(Based on Spurious Correlations, Tyler Vigen)
- This, however, is only a mini-boss. Beyond it lies the truly invincible FINAL BOSS… 🙀
The Fundamental Problem of Causal Inference
The only workable definition of “\(X\) causes \(Y\)”:
- The problem? We live in one world, not two identical worlds simultaneously 😭
What Is To Be Done?
Face Everything And Rise: Controlled, Randomized Experiment Paradigm
- Find good comparison cases: Treatment and Control
- Without a control group, you cannot make inferences!
- Selecting on the dependent variable…
Selecting on the Dependent Variable
- Jeff’s rant: If you care about actually solving social issues, this should infuriate you
Complications: Selection
- Tldr: Why did this person (unit) end up in the treatment group? Why did this other person (unit) end up in the control group?
- Are there systematic differences?
- “““Vietnam”“” “““War”“” Draft: Why can’t we just study [men who join the military] versus [men who don’t], and take the difference as a causal estimate?
The Solution: Matching
- Controlled experiment: we can ensure (since we have control over the assignment mechanism) the only systematic difference between \(C\) and \(T\) is: \(T\) received treatment, \(C\) did not
- In an observational study, we’re “too late”! Thus, we no longer refer to assignment but to selection
- Our job is to figure out (reverse engineer!) the selection mechanism, then correct for its non-randomness. Spoiler: “transform” observational \(\rightarrow\) experimental via weighting.
- That’s the gold at end of rainbow. The rainbow itself is…
Do-Calculus
Our Data-Generating Process
- \(Y\): Future success, \(\mathcal{R}_Y = \{0, 1\}\)
- \(E\): Private school education, \(\mathcal{R}_E = \{0, 1\}\)
- \(V\): Born into poverty, \(\mathcal{R}_V = \{0, 1\}\)
The Private School \(\leadsto\) Success Pipeline 🤑
- Sample independent RVs \(U_1 \sim \mathcal{B}(1/2)\), \(U_2 \sim \mathcal{B}(1/3)\), \(U_3 \sim \mathcal{B}(1/3)\)
- \(V \leftarrow U_1\)
- \(E \leftarrow \textsf{if }(V = 1)\textsf{ then } 0\textsf{ else }U_2\)
- \(Y \leftarrow \textsf{if }(V = 1)\textsf{ then }0\textsf{ else }U_3\)
Chalkboard Time…
- \(\Pr(Y = 1) = \; ?\)
- \(\Pr(Y = 1 \mid E = 1) = \; ?\)
Top Secret Answers Slide (Don’t Peek)
- \(\Pr(Y = 1) = \frac{1}{6}\)
- \(\Pr(Y = 1 \mid E = 1) = \frac{1}{3}\)
- \(\overset{✅}{\implies}\) One out of every three private-school graduates is successful, vs. one out of every six graduates overall
- \(\overset{❓}{\implies}\) Private school education doubles likelihood of success!
- The latter is only true if intervening/changing/doing \(E = 0 \leadsto E = 1\) is what moves \(\Pr(Y = 1)\) from \(\frac{1}{6}\) to \(\frac{1}{3}\)!
Chalkboard Time 2: Electric Boogaloo
- \(\Pr(Y = 1) = \frac{1}{6}\)
- \(\Pr(Y = 1 \mid E = 1) = \frac{1}{3}\)
- \(\Pr(Y = 1 \mid \textsf{do}(E = 1)) = \; ?\)
- Here, \(\textsf{do}(E = 1)\) means diving into the DGP below the surface and changing it so that \(E = 1\)… Setting \(E\) to be \(1\)
\(\text{DGP}(Y \mid \textsf{do}(E = 1))\)
- Sample independent RVs \(U_1 \sim \mathcal{B}(1/2)\), \(U_2 \sim \mathcal{B}(1/3)\), \(U_3 \sim \mathcal{B}(1/3)\)
- \(V \leftarrow U_1\)
- \(E \leftarrow \textsf{if }(V = 1)\textsf{ then } 0\textsf{ else }U_2\)
- \(Y \leftarrow \textsf{if }(V = 1)\textsf{ then }0\textsf{ else }U_3\)
Double Quadruple Secret Answer Slide
- \(\Pr(Y = 1) = \frac{1}{6}\)
- \(\Pr(Y = 1 \mid E = 1) = \frac{1}{3}\)
- \(\Pr(Y = 1 \mid \textsf{do}(E = 1)) = \frac{1}{6}\)
References
Hume, David. 1739. A Treatise of Human Nature: Being an Attempt to Introduce the Experimental Method of Reasoning Into Moral Subjects; and Dialogues Concerning Natural Religion. Longmans, Green.