DSAN 5100: Probabilistic Modeling and Statistical Computing
Section 01
Tuesday, November 18, 2025
\[ \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\bigexp}[1]{\exp\mkern-4mu\left[ #1 \right]} \newcommand{\bigexpect}[1]{\mathbb{E}\mkern-4mu \left[ #1 \right]} \newcommand{\definedas}{\overset{\small\text{def}}{=}} \newcommand{\definedalign}{\overset{\phantom{\text{defn}}}{=}} \newcommand{\eqeventual}{\overset{\text{eventually}}{=}} \newcommand{\Err}{\text{Err}} \newcommand{\expect}[1]{\mathbb{E}[#1]} \newcommand{\expectsq}[1]{\mathbb{E}^2[#1]} \newcommand{\fw}[1]{\texttt{#1}} \newcommand{\given}{\mid} \newcommand{\green}[1]{\color{green}{#1}} \newcommand{\heads}{\outcome{heads}} \newcommand{\iid}{\overset{\text{\small{iid}}}{\sim}} \newcommand{\lik}{\mathcal{L}} \newcommand{\loglik}{\ell} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\minimize}{minimize} \newcommand{\mle}{\textsf{ML}} \newcommand{\nimplies}{\;\not\!\!\!\!\implies} \newcommand{\orange}[1]{\color{orange}{#1}} \newcommand{\outcome}[1]{\textsf{#1}} \newcommand{\param}[1]{{\color{purple} #1}} \newcommand{\pgsamplespace}{\{\green{1},\green{2},\green{3},\purp{4},\purp{5},\purp{6}\}} \newcommand{\pedge}[2]{\require{enclose}\enclose{circle}{~{#1}~} \rightarrow \; \enclose{circle}{\kern.01em {#2}~\kern.01em}} \newcommand{\pnode}[1]{\require{enclose}\enclose{circle}{\kern.1em {#1} \kern.1em}} \newcommand{\ponode}[1]{\require{enclose}\enclose{box}[background=lightgray]{{#1}}} \newcommand{\pnodesp}[1]{\require{enclose}\enclose{circle}{~{#1}~}} \newcommand{\purp}[1]{\color{purple}{#1}} \newcommand{\sign}{\text{Sign}} \newcommand{\spacecap}{\; \cap \;} \newcommand{\spacewedge}{\; \wedge \;} \newcommand{\tails}{\outcome{tails}} \newcommand{\Var}[1]{\text{Var}[#1]} \newcommand{\bigVar}[1]{\text{Var}\mkern-4mu \left[ #1 \right]} \]
Type of Distribution | + | Parameter Values | = | All Possible Info (MGF) |
---|---|---|---|---|
Normal \(~\mathcal{N}(\param{\mu}, \param{\sigma^2})\) | + | \(\param{\mu} = m\), \(\param{\sigma^2} = s^2\) | = | \(M_t(X) = \exp[mt + s^2t^2/2]\) |
Uniform \(~\mathcal{U}(\param{\alpha},\param{\beta})\) | + | \(\param{\alpha} = a\), \(\param{\beta} = b\) | = | \(M_t(X) = \frac{e^{tb} - e^{ta}}{t(b-a)}\) |
Poisson \(~\text{Pois}(\param{\lambda})\) | + | \(\param{\lambda} = \ell\) | = | \(M_t(X) = \exp[\ell(e^t - 1)]\) |
Instead of trying to estimate all possible info about the underlying distribution, we only estimate one piece of information that we need to test a hypothesis
Example: for populations \(\chi\), \(\psi\), if \(H_A: \mu_\chi > \mu_\psi\), skip intermediate step of estimating distributions \(\mathcal{D}_\chi(\param{\theta})\), \(\mathcal{D}_\psi(\param{\theta})\) from samples \(\mathbf{X}\), \(\mathbf{Y}\)!
Instead, just directly “extract” greater than vs. not greater than info from \(\mathbf{X}\) and \(\mathbf{Y}\) (median \(m\) always exists as “pivot” for well-behaved \(X\), \(\Pr(X > m) = 0.5\))
It will help us to define \(\text{Sign}(x) = \begin{cases}\phantom{-}1 &\text{if } x > 0 \\ \phantom{-}0 &\text{if }x = 0 \\ -1 &\text{if }x < 0\end{cases}\)
If \(\mathbf{X} = (15, 7, 11, 3)\), \(\mathbf{Y} = (-3, 4, 1, 4)\), we can compute a “greater-than score”:
\[ \begin{align*} \sum_{i=1}^N \sign(X_i - Y_i) &= \sign(15 - (-3)) + \sign(7-4) + \sign(11-1) + \sign(3-4) \\ &= 1 + 1 + 1 + -1 = 2 \end{align*} \]
\[ \text{GT}_i(X_i, Y_i) = \text{Sign}(X_{i} - Y_{i}) = \begin{cases} \phantom{-}1 &\text{if }X_{i} > Y_{i}, \\ \phantom{-}0 &\text{if }X_{i} = Y_{i}, \\ -1 &\text{if }X_{i} < Y_{i} \end{cases} \]
We can then sum each pair’s score into an aggregate score:
\[ \text{GT}(\mathbf{X}, \mathbf{Y}) = \sum_{i=1}^NGT_i(X_i,Y_i) \]
such that if \(\chi\) and \(\psi\) are in fact the same population, we expect \(GT(\mathbf{X}, \mathbf{Y}) = 0\) (positive and negative values of \(GT_i\) cancel each other out, for sufficiently large \(N\))
Non-parametric tests which use this value as a test statistic are called Sign tests
\(X_i\) | \(Y_j\) | \(>\) | \(=\) | \(<\) |
---|---|---|---|---|
5 | 7 | +1 | ||
5 | 6 | +1 | ||
9 | 7 | +1 | ||
9 | 6 | +1 | ||
7 | 7 | +0.5 | ||
7 | 6 | +1 | ||
Final | Score: | \(S_{\mathbf{X}} = 3.5\) | \(S_{\mathbf{Y}} = 2.5\) |
Values | \(5_X\) | \(6_Y\) | \(7_X\) | \(7_Y\) | \(9_X\) | |
Rank in \(\mathbf{X}\) | \([1]_{X/\mathbf{X}}\) | \([2]_{X/\mathbf{X}}\) | \([3]_{X/\mathbf{X}}\) | \(\Sigma_{X/\mathbf{X}} = [6]\) | ||
Rank in \(\mathbf{Y}\) | \([1]_{Y/\mathbf{Y}}\) | \([2]_{Y/\mathbf{Y}}\) | \(\Sigma_{Y/\mathbf{Y}} = [3]\) | |||
Rank in \(\mathbf{Z}\) | \([1]_{X/\mathbf{Z}}\) | \([2]_{Y/\mathbf{Z}}\) |
\([3.5]_{X/\mathbf{Z}}\) | \([3.5]_{Y/\mathbf{Z}}\) |
\([5]_{X/\mathbf{Z}}\) | \(\Sigma_{X/\mathbf{Z}} = [9.5]\) \(\Sigma_{Y/\mathbf{Z}} = [5.5]\) |
Wales | 78 | 71 | 76 | 75 | 72 | 70 | 68 | 72 | 69 | 73 | 67 | \(\overline{X} \approx 71.91\) |
USA | 75 | 68 | 75 | 69 | 70 | 70 | 72 | 67 | 72 | 72 | 70 | \(\overline{Y} \approx 70.73\) |
library(tidyverse)
# Manual (via tibble)
# us_heights <- tibble(height=c(75, 68, 74, 68, 68, 70, 72, 67, 72, 72, 70))
# corrected
us_heights <- tibble(height=c(75, 68, 75, 69, 70, 70, 72, 67, 72, 72, 68))
mean_us <- mean(us_heights$height)
us_heights <- us_heights |> mutate(Team = "US")
wales_heights <- tibble(height=c(78, 71, 76, 75, 72, 70, 68, 72, 69, 73, 67))
mean_wales <- mean(wales_heights$height)
# From csv
# https://www.fifa.com/en/match-centre/match/17/255711/285063/400235455
team_df <- read_csv("assets/wc_usa_wales.csv")
# team_df |> group_by(team) |> summarize(height_mean = mean(height_in))
mean_df <- tibble(mean_height = c(mean_us, mean_wales), Team = c("US", "Wales"))
wales_heights <- wales_heights |> mutate(Team = "Wales")
players = bind_rows(us_heights, wales_heights)
ggplot(players, aes(x=height, fill=Team)) +
geom_density(
alpha=0.3, adjust=4/5
) +
geom_vline(
data=mean_df,
aes(xintercept=mean_height, color=Team),
linetype = "dashed",
linewidth = g_linewidth
) +
xlim(c(65,80)) +
dsan_theme("half") +
scale_fill_manual(values=c(cbPalette[1], cbPalette[2])) +
labs(
title = "Empirical Distribution of Team Heights",
x = "Height (inches)",
y = "Empirical Density"
)
Original Data | Sorted Total Samples | Rank | |||
---|---|---|---|---|---|
USA | Wales | USA | Wales | USA | Wales |
75 | 78 | 67 | 67 | 1.5 | 1.5 |
68 | 71 | 68 | 68 | 4 | 4 |
75 | 76 | 68 | 4 | ||
69 | 75 | 69 | 69 | 6.5 | 6.5 |
70 | 72 | 70 | 70 | 9 | 9 |
70 | 70 | 70 | 9 | ||
72 | 68 | 71 | 11 | ||
67 | 72 | 72 | 72 | 14 | 14 |
72 | 69 | 72 | 14 | ||
72 | 14 | ||||
72 | 73 | 72 | 14 | ||
68 | 67 | 73 | 17 | ||
75 | 75 | 19 | 19 | ||
75 | 19 | ||||
76 | 21 | ||||
78 | 22 |
\[ \sum_{i=1}^{N}i = \underbrace{\overbrace{(1 + N)}^{N + 1} + \overbrace{(2 + (N-1))}^{N + 1} + \cdots}_{N/2\text{ terms}} = \frac{N(N+1)}{2} \]
\[ \Sigma_{X/\mathbf{X}} = \Sigma_{Y/\mathbf{Y}} = \sum_{i=1}^{11}i = \frac{11(12)}{2} = 66 \]
\[ \begin{align*} U_X &= 112.5 - 66 = 46.5, \; U_Y = 140.5 - 66 = 74.5 \\ U &= \min\{U_X, U_Y\} = \min\{46.5, 74.5\} = 46.5 \end{align*} \]
One simulation:
set.seed(5100)
library(tidyverse)
N1 <- 11
N2 <- 11
N <- N1 + N2
totalRankSum <- (N * (N+1)) / 2
s_1 <- runif(N1)
df_1 <- tibble(x = s_1, team = "A")
s_2 <- runif(N2)
df_2 <- tibble(x = s_2, team = "B")
df_combined <- bind_rows(df_1, df_2)
df_combined['rank'] <- rank(df_combined$x)
writeLines(paste0("N1 = ",N1,", N2 = ",N2," => Sum(1...(N1+N2)) = ",totalRankSum))
N1 = 11, N2 = 11 => Sum(1...(N1+N2)) = 253
x | team | rank |
---|---|---|
0.0281644 | B | 1 |
0.0282184 | A | 2 |
0.0730575 | B | 3 |
0.1343398 | B | 4 |
0.1351495 | B | 5 |
0.1858087 | A | 6 |
→
And we can repeat this process (say) 10K times:
simulate_ranksums <- function(N1, N2) {
N <- N1 + N2
totalRankSum <- (N * (N+1)) / 2
s_1 <- runif(N1)
df_1 <- tibble(x = s_1, team = "A")
s_2 <- runif(N2)
df_2 <- tibble(x = s_2, team = "B")
df_combined <- bind_rows(df_1, df_2)
df_combined['rank'] <- rank(df_combined$x)
ranksum_df <- df_combined |> group_by(team) |> summarize(ranksum = sum(rank))
return(ranksum_df$ranksum)
}
num_sims <- 1000
results <- replicate(num_sims, simulate_ranksums(11,11))
t(results[,0:10])
[,1] [,2]
[1,] 133 120
[2,] 98 155
[3,] 115 138
[4,] 166 87
[5,] 137 116
[6,] 140 113
[7,] 146 107
[8,] 157 96
[9,] 119 134
[10,] 106 147
[1] 126.407 126.593
# Separate ranksums
ranksum_A <- tibble(ranksum=results[1,], team="A")
ranksum_A_mean <- mean(ranksum_A$ranksum)
ranksum_B <- tibble(ranksum=results[2,], team="B")
ranksum_B_mean <- mean(ranksum_B$ranksum)
sim_df <- bind_rows(ranksum_A, ranksum_B)
# Means
mean_df <- tibble(mean_value = c(ranksum_A_mean, ranksum_B_mean), team=c("A","B"))
mean_center <- (ranksum_A_mean + ranksum_B_mean) / 2
gen_ranksum_plot <- function(radius=Inf) {
ranksum_plot <- ggplot(sim_df, aes(x=ranksum, fill=team)) +
geom_density(linewidth = g_linewidth, alpha=0.333) +
geom_vline(
data=mean_df,
aes(xintercept = mean_value, color=team),
linewidth = g_linewidth
) +
theme_classic(base_size=14) +
scale_fill_manual(values=c(cbPalette[1], cbPalette[2]))
if (radius != Inf) {
ranksum_plot <- ranksum_plot +
xlim(mean_center - radius, mean_center + radius)
}
return(ranksum_plot)
}
gen_ranksum_plot()
wilcox.test
in R
Wilcoxon rank sum test with continuity correction
data: height by Team
W = 48, p-value = 0.4262
alternative hypothesis: true location shift is not equal to 0
\[ H = \frac{12}{N(N+1)}\sum_{j=1}^{k}\frac{R_j^2}{n_j} - 3(N+1) \]
kruskal.test
in RDSAN 5100 W12: Non-Parametric Statistics