DSAN 5100: Probabilistic Modeling and Statistical Computing
Section 01
Tuesday, November 4, 2025
\[ \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\bigexp}[1]{\exp\mkern-4mu\left[ #1 \right]} \newcommand{\bigexpect}[1]{\mathbb{E}\mkern-4mu \left[ #1 \right]} \newcommand{\definedas}{\overset{\small\text{def}}{=}} \newcommand{\definedalign}{\overset{\phantom{\text{defn}}}{=}} \newcommand{\eqeventual}{\overset{\text{eventually}}{=}} \newcommand{\Err}{\text{Err}} \newcommand{\expect}[1]{\mathbb{E}[#1]} \newcommand{\expectsq}[1]{\mathbb{E}^2[#1]} \newcommand{\fw}[1]{\texttt{#1}} \newcommand{\given}{\mid} \newcommand{\green}[1]{\color{green}{#1}} \newcommand{\heads}{\outcome{heads}} \newcommand{\iid}{\overset{\text{\small{iid}}}{\sim}} \newcommand{\lik}{\mathcal{L}} \newcommand{\loglik}{\ell} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\minimize}{minimize} \newcommand{\mle}{\textsf{ML}} \newcommand{\nimplies}{\;\not\!\!\!\!\implies} \newcommand{\orange}[1]{\color{orange}{#1}} \newcommand{\outcome}[1]{\textsf{#1}} \newcommand{\param}[1]{{\color{purple} #1}} \newcommand{\pgsamplespace}{\{\green{1},\green{2},\green{3},\purp{4},\purp{5},\purp{6}\}} \newcommand{\pedge}[2]{\require{enclose}\enclose{circle}{~{#1}~} \rightarrow \; \enclose{circle}{\kern.01em {#2}~\kern.01em}} \newcommand{\pnode}[1]{\require{enclose}\enclose{circle}{\kern.1em {#1} \kern.1em}} \newcommand{\ponode}[1]{\require{enclose}\enclose{box}[background=lightgray]{{#1}}} \newcommand{\pnodesp}[1]{\require{enclose}\enclose{circle}{~{#1}~}} \newcommand{\purp}[1]{\color{purple}{#1}} \newcommand{\sign}{\text{Sign}} \newcommand{\spacecap}{\; \cap \;} \newcommand{\spacewedge}{\; \wedge \;} \newcommand{\tails}{\outcome{tails}} \newcommand{\Var}[1]{\text{Var}[#1]} \newcommand{\bigVar}[1]{\text{Var}\mkern-4mu \left[ #1 \right]} \]
Recall that the \(k\)th moment of an RV \(X\) is \(\mu_k = \expect{X^k}\)
e.g., \(\mu_1 = \expect{X}\), \(\mu_2 = \expect{X^2} = \Var{X} + \expect{X}^2\)
Also recall (I rambled on about) how the MGF contains all information about a distribution. This means we can estimate distributions from data:
Define \(k\)th sample moment of \(\mathbf{X}_N\) to be \(\widehat{\mu}_k = \frac{1}{N}\sum_{i=1}^nX_i^k\). Then the equations
\[ \begin{align*} \mu_1(\param{\theta}) &= \widehat{\mu}_1 \\ \mu_2(\param{\theta}) &= \widehat{\mu}_2 \\ &\vdots \\ \mu_N(\param{\theta}) &= \widehat{\mu}_N \end{align*} \]
Give us a system of equations, allowing us to solve for parameters of our distribution!
\[ \mu_1(p) = \widehat{\mu}_1 \iff p = \frac{1}{N}\sum_{i=1}^N X_i^1, \]
\[ \underbrace{p^*_{\text{GMM}}}_{\mathclap{\text{Found using algebra}}} = \frac{1}{N}\sum_{i=1}^N X_i = \underbrace{p^*_{\text{MLE}}}_{\mathclap{\text{Found using calculus}}} \]
There’s no such thing as a free lunch.
But modern Machine Learning basically gets us rly close to a free lunch
Jeff
Low Variance | High Variance | |
---|---|---|
Low Bias | ||
High Bias |
\[ \begin{align*} \Err(x_0) &= \bigexpect{\left.(Y − \widehat{f}(x_0))^2 \right| X = x_0} \\ &= \sigma^2_{\varepsilon} + \left( \bigexpect{\widehat{f}(x_0)} − f(x_0) \right)^2 + \mathbb{E}\left[\widehat{f}(x_0) − \bigexpect{\widehat{f}(x_0)}\right]^2 \\ &= \sigma^2_{\varepsilon} + \left( \text{Bias}(\widehat{f}(x_0)\right)^2 + \bigVar{\widehat{f}(x_0)} \\ &= \text{Irreducible Error} + \text{Bias}^2 + \text{Variance}. \end{align*} \]
Figure from Tharwat (2019)
Definition: Efficiency
if \(T_1\) and \(T_2\) are both unbiased estimators for the same parameter \(\theta\) (i.e., if \(\mathbb{E}[T_1] = \theta\) and \(\mathbb{E}[T_2] = \theta\)), then we can compute their relative efficiency as
\[ \text{RE}(T_2, T_1) = \frac{\Var{T_2}}{\Var{T_1}}, \]
and then we say that \(T_2\) is more efficient than \(T_1\) if
\[ \Var{T_2} < \Var{T_1} \iff \text{RE}(T_2, T_1) < 1. \]
Consider the following dataset:
x <- seq(from = 0, to = 1, by = 0.1)
n <- length(x)
eps <- rnorm(n, 0, 0.04)
y <- x + eps
# But make one big outlier
midpoint <- ceiling((3/4)*n)
y[midpoint] <- 0
of_data <- tibble::tibble(x=x, y=y)
# Linear model
lin_model <- lm(y ~ x)
# But now polynomial regression
poly_model <- lm(y ~ poly(x, degree = 10, raw=TRUE))
ggplot(of_data, aes(x = x, y = y)) +
geom_point(size = g_pointsize / 1.5) +
dsan_theme("full")
Fitting a linear model gives us:
ggplot(of_data, aes(x = x, y = y)) +
geom_point(size = g_pointsize / 1.5) +
geom_smooth(aes(color="Linear"), method = lm, se = FALSE, show.legend=FALSE) +
# geom_abline(aes(intercept = 0, slope = 1, color = "Linear"), linewidth = 1, show.legend = FALSE) +
# stat_smooth(
# method = "lm",
# formula = y ~ poly(x, 10, raw = TRUE),
# se = FALSE, aes(color = "Polynomial")
# ) +
dsan_theme("full")
## Part 1: Set up data
library(dplyr)
library(ggplot2)
library(tibble)
# subsample <- of_data |> sample_n() sample(of_data, size=5)
gen_subsamples <- function(obs_data, num_subsamples, subsample_size) {
#print(subsample_size)
subsample_ints <- c()
subsample_coefs <- c()
for (i in 1:num_subsamples) {
cur_subsample <- obs_data |> sample_n(subsample_size, replace = TRUE)
cur_lin_model <- lm(y ~ x, data = cur_subsample)
cur_int <- cur_lin_model$coefficients[1]
subsample_ints <- c(subsample_ints, cur_int)
cur_coef <- cur_lin_model$coefficients[2]
subsample_coefs <- c(subsample_coefs, cur_coef)
}
subsample_df <- tibble(intercept = subsample_ints, coef = subsample_coefs)
return(subsample_df)
}
num_subsamples <- 50
subsample_size <- floor(nrow(of_data) / 2)
subsample_df <- gen_subsamples(of_data, num_subsamples, subsample_size)
full_model <- lm(y ~ x, data = of_data)
full_int <- full_model$coefficients[1]
full_coef <- full_model$coefficients[2]
full_df <- tibble(intercept=full_int, coef=full_coef)
mean_df <- tibble(
intercept=mean(subsample_df$intercept),
coef = mean(subsample_df$coef)
)
## Part 2: Plot
ggplot(of_data, aes(x = x, y = y)) +
geom_point(size=g_pointsize) +
# The random lines
geom_abline(data = subsample_df, aes(slope = coef, intercept = intercept, color='Subsample Model'), linewidth=g_linewidth, linetype="solid", alpha=0.25) +
# The original regression line
geom_abline(data=full_df, aes(slope = coef, intercept = intercept, color='Full-Data Model'), linewidth=2*g_linewidth) +
# The average of the random lines
#geom_abline(data=mean_df, aes(slope = coef, intercept = intercept, color='mean'), linewidth=2*g_linewidth) +
labs(
title = paste0("Linear Models for ", num_subsamples, " Subsamples of Size n = ", subsample_size),
color = element_blank()
) +
dsan_theme("full") +
theme(
legend.title = element_blank(),
legend.spacing.y = unit(0, "mm")
)
x <- seq(from = 0, to = 1, by = 0.1)
n <- length(x)
eps <- rnorm(n, 0, 0.04)
y <- x + eps
robust_data <- tibble(x = x, y = y)
robust_sub_df <- gen_subsamples(robust_data, 30, 5)
#print(robust_sub_df)
full_model_robust <- lm(y ~ x, data = robust_data)
full_int_robust <- full_model_robust$coefficients[1]
full_coef_robust <- full_model_robust$coefficients[2]
full_df_robust <- tibble(intercept = full_int_robust, coef = full_coef_robust)
ggplot(robust_data, aes(x = x, y = y)) +
geom_point(size=g_pointsize) +
# The random lines
geom_abline(data = robust_sub_df, aes(slope = coef, intercept = intercept, color='Subsample Model'), linewidth=g_linewidth, linetype="solid", alpha=0.25) +
# The original regression line
geom_abline(data=full_df_robust, aes(slope = coef, intercept = intercept, color='Full-Data Model'), linewidth=2*g_linewidth) +
# The average of the random lines
#geom_abline(data=mean_df, aes(slope = coef, intercept = intercept, color='mean'), linewidth=2*g_linewidth) +
labs(
title = paste0("Linear Models for ", num_subsamples, " Subsamples of Size n = ", subsample_size),
color = element_blank()
) +
dsan_theme("full") +
theme(
legend.title = element_blank(),
legend.spacing.y = unit(0, "mm")
)
Here the model is not “misled” by outliers
\[ \begin{align*} \widetilde{X}_1 &= \{x_2, x_4, x_5, x_7, x_9\} \\ \widetilde{X}_2 &= \{x_2, x_3, x_4, x_7, x_{10}\} \\ &~\vdots \\ \widetilde{X}_{100} &= \{x_3, x_3, x_7, x_8, x_8\} \end{align*} \]
Answer: Absurdly, unreasonably well.
pop <- rnorm(1000000, mean = 3, sd = 1)
# Sampling 1k times
rand_samples <- replicate(
1000,
sample(pop, size=100, replace = FALSE)
)
sample_means <- colMeans(rand_samples)
sample_df <- tibble(est = sample_means, Method = "1000 Samples")
# Sampling 1 time and bootstrapping
bs_sample <- sample(pop, size = 100, replace = FALSE)
subsamples <- replicate(1000, sample(bs_sample, size=100, replace = TRUE))
bs_means <- colMeans(subsamples)
bs_df <- tibble(est = bs_means, Method = "Bootstrap (1 Sample)")
result_df <- bind_rows(sample_df, bs_df)
sim_dnorm <- function(x) dnorm(x, mean = 3, sd = 1)
ggplot(result_df, aes(x=est, fill=Method)) +
dsan_theme("full") +
geom_density(alpha=0.2, linewidth=g_linewidth) +
geom_vline(aes(xintercept=3, linetype="value"),linewidth=g_linewidth) +
scale_linetype_manual("", values=c("density"="solid", "value"="dashed"), labels=c("Population Mean", "testing")) +
theme(
legend.title = element_blank(),
legend.spacing.y = unit(0, "mm")
) +
labs(
title = "Bootstrap vs. Multiple Samples",
x = "Sample / Subsample Means",
y = "Density"
)
sample_est <- mean(sample_means)
sample_str <- sprintf("%.3f", sample_est)
sample_err <- abs(sample_est - 3)
sample_err_str <- sprintf("%.3f", sample_err)
sample_output <- paste0("1K samples estimate: ", sample_str, " (abs. err: ", sample_err_str, ")")
bs_est <- mean(bs_means)
bs_str <- sprintf("%.3f", bs_est)
bs_err <- abs(bs_est - 3)
bs_err_str <- sprintf("%.3f", bs_err)
bs_output <- paste0("Bootstrap estimate: ", bs_str, " (abs. err: ", bs_err_str, ")")
writeLines(paste0(sample_output,"\n",bs_output))
1K samples estimate: 3.001 (abs. err: 0.001)
Bootstrap estimate: 3.025 (abs. err: 0.025)
(New York City) | Public Schools | Charter Schools | Total |
---|---|---|---|
Enrolled Students | \(N_\text{pub} \approx 893000\) | \(N_\text{ch} \approx 157000\) | \(N = 1050649\) |
Yearly Dropouts | \(D_\text{pub} \approx 24500\) | \(D_\text{ch} \approx 16000\) | \(D \approx 44000\) |
Yearly Dropout Rate | \(R_\text{pub} \approx 2.7\%\) | \(R_\text{ch} \approx 10\%\) | \(R \approx 4.2\%\) |
\(\mathbb{E}[D \mid R = 4.2\%]\) | \(\mathbb{E}[D_\text{pub}] = 37506\) | \(\mathbb{E}[D_\text{ch}] = 6594\) | \(\mathbb{E}[\Delta] = 30912\) |
1 0 0 0 1 0 0 0 0 1 => 3 heads
library(tidyverse)
num_replications <- 1000
coin_seqs <- replicate(num_replications, rbern(num_flips, p))
heads_per_seq <- colSums(coin_seqs)
heads_df <- tibble(num_heads = heads_per_seq)
highlight_3 <- c(rep("grey",2), rep(cbPalette[1],1), rep("grey",7))
ggplot(heads_df, aes(x=factor(num_heads))) +
geom_histogram(stat='count', fill=highlight_3) +
dsan_theme("quarter") +
labs(
title=paste0("Results From N=",num_replications," 10-Coin-Flip Trials")
)
[1] 0.107
89 is a prime number, so if someone asks you to justify it, you can stare at them meaningfully and incant, “why, because it’s prime, of course!” That’s no worse than the conventional justification for 95%. (McElreath 2020, 88)
# The true population sizes
N_ch <- 157000
N_pub <- 893000
# We'll use a function to compute num_sims simulations of a school with N_j
# students, each with dropout probability p
simulate_dropouts <- function(num_sims, N_j, p) {
return(rbinom(num_sims, N_j, p))
}
# We're simulating "null world" by generating data
# on the basis of the assumption p_pub_sim = p_ch_sim
# So we call the function with the same p for both
# simulated schools. In this case, we'll use the
# *overall average* dropout rate across all schools:
# 4.2% = 0.042
p_sim <- 0.042
# But we'll write a general function for simulating
# pairs of schools, whether or not they have the same
# Pr(dropout):
simulate_pair <- function(num_sims, p_ch, p_pub) {
D_ch <- simulate_dropouts(num_sims, N_ch, p_ch)
D_pub <- simulate_dropouts(num_sims, N_pub, p_pub)
sim_df <- tibble(D_ch=D_ch, D_pub=D_pub)
return(sim_df)
}
# Run the simulation, with the *same* p parameter
# for both our simulated schools:
run_sims_same_p <- function(num_sims, printResults=FALSE) {
sim_df <- simulate_pair(num_sims, p_sim, p_sim)
# And from these two counts, we can compute the
# *test statistic*: in this case, the difference
sim_df <- sim_df |> mutate(
test_stat = D_pub - D_ch
)
return(sim_df)
}
small_sim_df <- run_sims_same_p(3)
small_sim_df
D_ch | D_pub | test_stat |
---|---|---|
6528 | 37395 | 30867 |
6606 | 37287 | 30681 |
6607 | 37469 | 30862 |
num_trials <- 10000000
bw <- 100
sim_df <- run_sims_same_p(num_trials)
# And plot the values of our test statistic
null_dist_plot <- ggplot(sim_df, aes(x=test_stat)) +
geom_histogram(binwidth=bw) +
geom_density(
aes(y = bw * after_stat(count)),
linewidth = g_linewidth,
fill = cbPalette[1],
alpha = 0.333
) +
dsan_theme() +
labs(
x = "Test Statistic (D_pub - D_ch)",
y = "Count",
title = paste0("(Public Dropouts - Charter Dropouts), ",format(num_trials,big.mark=' ', scientific=FALSE)," Simulations")
)
null_dist_plot
library(scales)
null_obs_plot <- ggplot(sim_df, aes(x=test_stat)) +
geom_histogram(binwidth=bw) +
geom_density(
aes(y = (bw/3) * after_stat(count)),
linewidth = g_linewidth,
fill = cbPalette[1],
alpha = 0.333
) +
geom_vline(
xintercept = 8500,
color='red',
linewidth = g_linewidth,
linetype = 'dashed'
) +
scale_x_continuous(breaks=seq(from=5000, to=35000, by=5000), limits=c(5000, 35000)) +
scale_y_continuous(labels = label_number(big.mark=' ')) +
dsan_theme() +
labs(
x = "Test Statistic (D_pub - D_ch)",
y = "Count",
title = paste0("(Public Dropouts - Charter Dropouts), ",format(num_trials,big.mark=' ',scientific=FALSE)," Simulations")
)
null_obs_plot
mean(sim_df$test_stat)
cdf_at_val <- function(v) {
lt_val_df <- sim_df |> filter(test_stat < v)
prop_lt <- nrow(lt_val_df) / nrow(sim_df)
return(prop_lt)
}
writeLines(format(cdf_at_val(31000), scientific=FALSE))
writeLines(format(cdf_at_val(30500), scientific=FALSE))
cdf_val <- cdf_at_val(30000)
cdf_val_fmt <- format(cdf_val, scientific=FALSE)
lt30k_df <- sim_df |> filter(test_stat < 30000)
num_lt30k <- nrow(lt30k_df)
num_total <- nrow(sim_df)
num_total_fmt <- format(num_total, big.mark=',')
writeLines(paste0(cdf_val_fmt," = ",num_lt30k," / ",num_total_fmt))
min(sim_df$test_stat)
[1] 30911.98
0.6650288
0.0222126
0.0000037 = 37 / 10,000,000
[1] 29879
(From The Guardian, 2016 Feb 21)
DSAN 5100-01 W11: Method of Moments and Bootstrap