Week 6: Bayesian Workflow, Midterm Pre-Review

DSAN 5650: Causal Inference for Computational Social Science
Summer 2025, Georgetown University

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Wednesday, June 25, 2025

Open slides in new window →

Logistics / Table-Setting

  • Midterm will be 27-Hour Take-Home Exam
  • A “Midterm” folder will magically appear on guhub.io at 9pm EDT on Wednesday, July 2 (immediately after lecture ends), due 11:59pm EDT on Thursday, July 3
  • Lecture recording will be a video resource for you, providing all the necessary background for midterm questions (like the intro-to-Afrobarometer last week!)

From HWs to Midterm

  • HWs: More focus on social science / more “in the weeds” (checking details, manipulating prior parameters, debugging, etc.)
  • Midterm: More focus on causality, bc, in 2 hours I can test you on concepts and things like “What are the backdoor paths here? How would you close them?” more easily than on loading datasets, cleaning, fitting models, etc.

Looking Forwards (Post-Midterm)

  • More text analysis!
  • Making the connection between modeling and predicting

Bayesian Workflow

(There are a lot of words in HW2 that I haven’t had the chance to explain yet!)

  • Prior
  • Prior Predictive
  • Posterior
  • Posterior Predictive

Modeling How Trees Become Forests

  • Your model, if it’s generative (which… nearly all in 5650 are), relates observed data D to a set of underlying parameters θ that are hypothesized as “giving rise” to D
  • When you specify how exactly this “giving rise” works, you’re specifying a DGP!
  • PGMs: human-brain-friendly (bc graphical) language for writing DGPs; then move to PyAgrum Stan PyMC to “encode” PGMs (in 0100101) so computer can…
Super-charge your EDA/modeling Estimate θ from data
Prior distributions Posterior distributions
Code
library(tidyverse)
library(ggExtra)
gen_walk_plot <- function(walk_data, a=0.0075) {
  # print(end_df)
  grid_color <- rgb(0, 0, 0, 0.1)
  # And plot!
  walkplot <- ggplot() +
    geom_line(
      data = walk_data$long_df,
      aes(x = t, y = pos, group = pid),
      linewidth = g_linewidth,
      alpha = a,
      #color = cb_palette[2]
      #color = "#cf8f00"
      color = "black"
    ) +
    geom_point(
      data = walk_data$end_df,
      aes(x = t, y = endpos),
      alpha = 0
    ) +
    scale_x_continuous(
      breaks = seq(
        0,
        walk_data$num_steps,
        walk_data$num_steps / 4
      )
    ) +
    scale_y_continuous(
      breaks = seq(-20, 20, 10)
    ) +
    theme_dsan(base_size=24) +
    theme(
      legend.position = "none",
      # title = element_text(size = 16)
    ) +
    theme(
      panel.grid.major.y = element_line(
        color = grid_color,
        linewidth = 1,
        linetype = 1
      )
    ) +
    labs(
      title = paste0(
        walk_data$num_people, " Random Walks, ",
        walk_data$num_steps, " Steps"
      ),
      x = "Number of Steps",
      y = "Position"
    )
}
walk_data <- readRDS("assets/walk_data.rds")
# 16 steps
# wp1 <- gen_walkplot(500, 16, 0.05)
# ggMarginal(wp1, margins = "y", type = "histogram", yparams = list(binwidth = 1))
wp <- gen_walk_plot(walk_data) + ylim(-30,30)
ggMarginal(
  wp, margins = "y",
  type = "histogram",
  yparams = list(binwidth = 1)
)

Prior “Stage” Distributions

  • RVs θ = params of your model, RVs X = data Generative model  θ X 
  • Enter Prior World (Before observing any data): Since we have priors over θ we can sample values of θ, then generate synthetic data on the basis of these values

Prior Distribution: Pr(θ)

What can I guess about values of my parameters from background knowledge of the world? e.g.:

  • Human heights can’t be negative
  • Data collected at a bar Age 18, Pr(Age=x) decreases as x goes above 30

Prior Predictive Distribution: Pr(Xθ)

What could the outcomes look like if I ran my guesses through the DGP?

100 simulated heights, none are negative

1K sim bar-goers; 80% have this haircut

Posterior “Stage” Distributions

  • RVs θ = params of your model, RVs X = data Generative model  θ X 

Posterior Distribution: Pr(θ|X=X)

Now we observe data: XX, which means we can use Bayes’ Rule to infer distribution over θ: what values are most likely to produce X?

  • 60% of observed bar-goers have that haircut θpost0.6+0.82=0.7
  • Coin = H 4/5 times θ=p=0.5+0.82=0.65
  • Coin = H 5/5 times θ=p=0.5+1.02=0.75

Posterior Predictive Distribution: Pr(Xθ)

Now that we’ve fit Pr(θ) to data, can generate as much new data as we want, e.g. to evaluate how well model predicts outcomes for test data

Why Do We Need “Subjective” Priors?

  • Under frequentism… (tldr) literally no method for dealing with shades of uncertainty
  • Frequentist assumption: For a given coin, Pr(Heads) is not a Random Variable! It’s some number, like 0.5 or 0.8. It’s the asymptote of #[Heads]#[Heads]+#[Tails] as n
  • So, if we flip coin 1 time, and get Heads, no basis for inferring Pr(Coin is Fair) vs. Pr(Coin is Biased): Both are undefined. We’ve… “discovered” that Pr(Heads)=1
  • By using Bayesian inference, we can bring prior knowledge into our studies, which we’ll need to do especially for complex emergent systems: societies, economies, etc.!
  • In fact, as a guy named Laplace discovered in the nineteenth century, frequentism = Bayesian inference with “flat priors”

Flat vs. Informative Priors

Code
library(tidyverse)
flat_df <- tibble(x=seq(0, 1, 0.1), y=0)
flat_df |> ggplot(aes(x=x, y=y)) +
  geom_line(
    color=cb_palette[1],
    linewidth=g_linewidth
  ) +
  ylim(0, 1) +
  labs(
    title="Flat Prior on Pr(Heads)",
    y="Density"
  ) +
  theme_dsan(base_size=28) +
  theme(title=element_text(size=20))
library(tidyverse)
data_df <- tibble(x=1, y=1)
data_df |> ggplot(aes(x=x, y=y)) +
  geom_point(size=5) +
  geom_segment(
    x=1, y=0, yend=1, linewidth=g_linewidth
  ) +
  xlim(0, 1) +
  ylim(0, 1) +
  labs(
    title="Observed Data",
    y="Density"
  ) +
  theme_dsan(base_size=28) +
  theme(title=element_text(size=20))
library(tidyverse)
library(latex2exp)
w_label <- TeX("Width = $1/n$")
h_label <- TeX("Height = $n$")
data_df <- tibble(x=1, y=1)
data_df |> ggplot(aes(x=x, y=y)) +
  geom_segment(
    x=1, y=0, yend=1, linewidth=g_linewidth,
    color=cb_palette[1], arrow=arrow()
  ) +
  geom_segment(
    x=0, y=0, xend=1, linewidth=g_linewidth,
    color=cb_palette[1]
  ) +
  xlim(0, 1) +
  ylim(0, 1) +
  labs(
    title="Posterior of Pr(Heads)",
    y = "Density"
  ) +
  theme_dsan(base_size=28) +
  theme(title=element_text(size=20)) +
  annotate(
    geom = "text", x = 0.5, y = 0.8, 
    label = w_label, hjust = 0, vjust = 1, size = 8
  ) +
  annotate(
    geom = "text", x = 0.5, y = 0.7, 
    label = h_label, hjust = 0, vjust = 1, size = 8
  )
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.3.0
✔ purrr     1.0.4     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Code
library(tidyverse)
unif_df <- tibble(x=seq(0, 1, 0.1), y=1)
unif_df |> ggplot(aes(x=x, y=y)) +
  geom_line(
    color="#e69f00", linewidth=g_linewidth
  ) +
  annotate('rect', xmin=0, xmax=1, ymin=0, ymax=1, fill='#e69f00', alpha=0.3) +
  xlim(0, 1) + ylim(0, 1) +
  labs(title="Uniform Prior on Pr(Heads)") +
  theme_dsan(base_size=28) +
  theme(title=element_text(size=20))
library(tidyverse)
data_df <- tibble(x=1, y=1)
data_df |> ggplot(aes(x=x, y=y)) +
  geom_point(size=5) +
  geom_segment(
    x=1, y=0, yend=1, linewidth=g_linewidth
  ) +
  xlim(0, 1) +
  ylim(0, 1) +
  labs(title="Observed Data") +
  theme_dsan(base_size=28) +
  theme(title=element_text(size=20))
library(tidyverse)
data_df <- tibble(x=1, y=1)
x_vals <- seq(0, 1, 0.01)
my_exp <- function(x) exp(1-1/(x^2))
y_vals <- sapply(x_vals, my_exp)
data_df <- tibble(x=x_vals, y=y_vals)
rib_df <- tibble(x=x_vals, ymax=y_vals, ymin=0)
ggplot() +
  # stat_function(fun=my_exp, linewidth=g_linewidth, color=cb_palette[1]) +
  geom_line(
    data=data_df,
    aes(x=x, y=y),
    linewidth=g_linewidth, color=cb_palette[1]
  ) +
  geom_ribbon(
    data=rib_df,
    aes(x=x, ymin=ymin, ymax=ymax),
    fill=cb_palette[1], alpha=0.3
  ) +
  # geom_segment(
  #   x=1, y=0, yend=1, linewidth=g_linewidth,
  #   color=cb_palette[1], arrow=arrow()
  # ) +
  # geom_segment(
  #   x=0, y=0, xend=1, linewidth=g_linewidth,
  #   color=cb_palette[1]
  # ) +
  xlim(0, 1) +
  ylim(0, 1) +
  labs(title="Posterior of Pr(Heads)") +
  theme_dsan(base_size=28) +
  theme(title=element_text(size=20))

Lab Time!

References