DSAN 5000: Data Science and Analytics
Thursday, October 10, 2024
dataset | x_mean | y_mean |
I | 9.00 | 7.50 |
II | 9.00 | 7.50 |
III | 9.00 | 7.50 |
IV | 9.00 | 7.50 |
x | y | ||
dataset | |||
I | x | 1.00 | 0.82 |
y | 0.82 | 1.00 | |
II | x | 1.00 | 0.82 |
y | 0.82 | 1.00 | |
III | x | 1.00 | 0.82 |
y | 0.82 | 1.00 | |
IV | x | 1.00 | 0.82 |
y | 0.82 | 1.00 |
Dataset I R^2 = 0.67 |
coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 3 | 1.12 | 2.67 | 0.03 | 0.46 | 5.54 |
x | 0.5 | 0.12 | 4.24 | 0 | 0.23 | 0.77 |
Dataset II R^2 = 0.67 |
coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 3 | 1.12 | 2.67 | 0.03 | 0.46 | 5.54 |
x | 0.5 | 0.12 | 4.24 | 0 | 0.23 | 0.77 |
Dataset III R^2 = 0.67 |
coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 3 | 1.12 | 2.67 | 0.03 | 0.46 | 5.54 |
x | 0.5 | 0.12 | 4.24 | 0 | 0.23 | 0.77 |
Dataset IV R^2 = 0.67 |
coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 3 | 1.12 | 2.67 | 0.03 | 0.46 | 5.54 |
x | 0.5 | 0.12 | 4.24 | 0 | 0.23 | 0.77 |
“I got a 238.25 on the first test!” 🤩
“But only a 0.31 on the second” 😭
id | t1_score | t2_score |
17 | 268.30 | -0.54 |
27 | 258.44 | -0.33 |
26 | 245.86 | -0.55 |
5 | 238.25 | 0.31 |
11 | 206.54 | -0.02 |
16 | 205.49 | -0.06 |
“And higher than 60% on the second!” 😎
id | t1_score | t1_pctile | t2_score | t2_pctile | t1_z_score | t2_z_score |
17 | 268.30 | 100.0 | -0.54 | 30.0 | 1.87 | -0.82 |
27 | 258.44 | 96.7 | -0.33 | 46.7 | 1.73 | -0.52 |
26 | 245.86 | 93.3 | -0.55 | 26.7 | 1.54 | -0.83 |
5 | 238.25 | 90.0 | 0.31 | 60.0 | 1.44 | 0.39 |
11 | 206.54 | 86.7 | -0.02 | 56.7 | 0.98 | -0.08 |
16 | 205.49 | 83.3 | -0.06 | 50.0 | 0.96 | -0.14 |
The percentile places everyone at evenly-spaced intervals from 0 to 100:
# https://community.rstudio.com/t/number-line-in-ggplot/162894/4
# Add a binary indicator to track "me" (student #8)
whoami <- 29
But what if we want to see their absolute performance, on a 0 to 100 scale?
\[ x'_i = x_i - \mu \]
\[ z_i = \frac{x_i - \mu}{\sigma} \]
Image Credit: Peter Dovak
\[ || \mathbf{x} - \mathbf{y} ||_p = \left(\sum_{i=1}^n |x_i - y_i|^p \right)^{1/p} \]
Edit Distance, e.g., Hamming distance:
\[ \begin{array}{c|c|c|c|c|c} x & \green{1} & \green{1} & \red{0} & \red{1} & 1 \\ \hline & ✅ & ✅ & ❌ & ❌ & ✅ \\\hline y & \green{1} & \green{1} & \red{1} & \red{0} & 1 \\ \end{array} \; \leadsto d(x,y) = 2 \]
KL Divergence (Probability distributions):
\[ \begin{align*} \kl(P \parallel Q) &= \sum_{x \in \mathcal{R}_X}P(x)\log\left[ \frac{P(x)}{Q(x)} \right] \\ &\neq \kl(Q \parallel P) \; (!) \end{align*} \]
\[ || \mathbf{x} - \mathbf{y} ||_2 = \sqrt{\sum_{i=1}^n(x_i-y_i)^2} \]
\[ || \mathbf{x} - \mathbf{y} ||_1 = \sum_{i=1}^n |x_i - y_i| \]
\[ || \mathbf{x} - \mathbf{y} ||_{\infty} = \lim_{p \rightarrow \infty}\left[|| \mathbf{x} - \mathbf{y} ||_p\right] = \max\{|x_1-y_1|, \ldots, |x_n - y_n|\} \]
\[ || \mathbf{x} - \mathbf{y} ||_0 = \mathbf{1}\left[x_i \neq y_i\right] \]
\[ || \mathbf{x} - \mathbf{y} ||_{1/2} = \left(\sum_{i=1}^n \sqrt{x_i - y_i} \right)^2 \]
\[ \forall a, b, c \left[ d(a,c) \leq d(a,b) + d(b,c) \right] \]
Visualizing “circles” in \(L^p\) space:
\[ h_i = \beta_0 + \beta_1 s_i + \varepsilon_i \]
There are scarier alternatives, though! What if…
Dogs eat homework because their owner studied so much that the dog got ignored?
Dogs hate sloppy work, and eat bad homework that would have gotten a low score
Noisy homes (\(Z = 1\)) cause dogs to get agitated and eat homework more often, and students do worse
\[ \iqr = Q_3 - Q_1 \]
\[ [Q_1 - 1.5 \cdot \iqr, \; Q_3 + 1.5 \cdot \iqr] \]
mean_score <- mean(data_df$Score)
sd_score <- sd(data_df$Score)
lower_cutoff <- mean_score - 3 * sd_score
upper_cutoff <- mean_score + 3 * sd_score
Presumed DGP:
Actual DGP:
In math (I wish I had learned it like this), the \(\log()\) function is a magic function that “reduces” complicated operations to less-complicated operations:
Exponentiation \(\rightarrow\) Multiplication:
\[ \log(a^b) = b\cdot \log(a) \]
Multiplication \(\rightarrow\) Addition:
\[ \log(a\cdot b) = \log(a) + \log(b) \]
\[ y = e^{mx + b} \iff \log(y) = mx + b \]
\[ y = e^{mx + b} \iff \log(y) = mx + b \]
x_mean = mean(df$x)
y_mean = mean(df$y)
ggplot(df, aes(x=x, y=y)) +
geom_point() +
geom_vline(xintercept = x_mean) +
geom_hline(yintercept = y_mean) +
#facet_grid(. ~ rel) +
aes(x=x, y=y, label=label, color=label),
size = g_pointsize * 1.5
) +
scale_color_manual(values=c("darkgreen","red")) +
dsan_theme() +
remove_legend() +
#axis.text.x = element_blank(),
axis.title.x = element_blank(),
#axis.ticks.x = element_blank(),
#axis.text.y = element_blank(),
#axis.ticks.y = element_blank(),
axis.title.y = element_blank()
) +
xlim(c(-1,1)) + ylim(c(-1,1)) +
coord_fixed(ratio=1) +
scale_x_continuous(breaks=c(-1, x_mean, 1), labels=c("-1",TeX(r"($\mu_x$)"),"1")) +
scale_y_continuous(breaks=c(-1, y_mean, 1), labels=c("-1",TeX(r"($\mu_y$)"),"1"))
x- x+
y- 50 0
y+ 0 50
100 matches, 0 mismatches
x- x+
y- 38 11
y+ 12 39
77 matches, 23 mismatches
y_noisy_neg <- x_vals
x- x+
y- 13 34
y+ 37 16
29 matches, 71 mismatches
gen_rect_plot <- function(df, col_order=c("red","darkgreen")) {
x_mean = mean(df$x)
y_mean = mean(df$y)
df <- df |> mutate(
x_contrib = ifelse(x > x_mean, "+", "-"),
y_contrib = ifelse(y > y_mean, "+", "-"),
match = x_contrib == y_contrib
ggplot(df, aes(x=x, y=y)) +
geom_point() +
geom_vline(xintercept = x_mean) +
geom_hline(yintercept = y_mean) +
#facet_grid(. ~ rel) +
geom_rect(aes(xmin=x_mean, xmax=x, ymin=y_mean, ymax=y, fill=match), color='black', linewidth=0.1, alpha=0.075) +
scale_color_manual(values=c("darkgreen","red")) +
scale_fill_manual(values=col_order) +
aes(x=x, y=y, label=label, color=label),
size = g_pointsize * 1.5
) +
dsan_theme() +
remove_legend() +
#axis.text.x = element_blank(),
axis.title.x = element_blank(),
#axis.ticks.x = element_blank(),
#axis.text.y = element_blank(),
#axis.ticks.y = element_blank(),
axis.title.y = element_blank()
) +
xlim(c(-1,1)) + ylim(c(-1,1)) +
coord_fixed(ratio=1) +
scale_x_continuous(breaks=c(-1, x_mean, 1), labels=c("-1",TeX(r"($\mu_x$)"),"1")) +
scale_y_continuous(breaks=c(-1, y_mean, 1), labels=c("-1",TeX(r"($\mu_y$)"),"1"))
gen_rect_plot(df_collinear, col_order=c("darkgreen","red"))
x- x+
y- 50 0
y+ 0 50
100 matches, 0 mismatches
x- x+
y- 38 11
y+ 12 39
77 matches, 23 mismatches
gen_rect_plot_expanded(df_collinear_expanded, col_order=c("darkgreen","red"))
x- x+
y- 50 0
y+ 0 50
100 matches, 0 mismatches
x- x+
y- 42 7
y+ 8 43
85 matches, 15 mismatches
x- x+
y- 10 44
y+ 40 6
16 matches, 84 mismatches
\[ \begin{align*} &A = (5,0), B = (3,4) \\ &\implies \cos(A,B) = \frac{3}{5} \end{align*} \]
Plus new names for ones you already know!
“Levenshtein Distance”: Edit distance
“Chebyshev Distance”: \(L^{\infty}\)-norm, meaning, maximum absolute distance. In \(\mathbb{R}^2\):
\[ \begin{align*} &D((x_1,y_1),(x_2,y_2)) \\ &= \max\{ |x_2 - x_1|, |y_2 - y_1| \} \end{align*} \]
\(\mathcal{A} = \mathcal{U}\{0,10\}\)? \(\kl(\mathcal{O} \parallel \mathcal{A})=0.338\)
\(\text{Binom}(10,0.57)\)? \(\kl(\mathcal{O} \parallel \mathcal{B})=0.477\)
Supervised Learning: You want the computer to learn the existing pattern of how you are classifying1 observations
Unsupervised Learning: You want the computer to find patterns in a dataset, without any prior classification info
Supervised Learning: Dataset has both explanatory variables (“features”) and response variables (“labels”)
home_id | sqft | bedrooms | rating |
0 | 1000 | 1 | Disliked |
1 | 2000 | 2 | Liked |
2 | 2500 | 1 | Liked |
3 | 1500 | 2 | Disliked |
4 | 2200 | 1 | Liked |
ggplot(sup_data, aes(x=sqft, y=bedrooms, color=rating)) +
geom_point(size = g_pointsize * 2) +
title = "Supervised Data: House Listings",
x = "Square Footage",
y = "Number of Bedrooms",
color = "Outcome"
) +
dsan_theme("half") +
expand_limits(x=c(800,2700), y=c(0.8,2.2)) +
geom_vline(xintercept = 1750, linetype="dashed", color = "black", size=1) +
annotate('rect', xmin=-Inf, xmax=1750, ymin=-Inf, ymax=Inf, alpha=.2, fill=cbPalette[1]) +
annotate('rect', xmin=1750, xmax=Inf, ymin=-Inf, ymax=Inf, alpha=.2, fill=cbPalette[2])
ggplot(unsup_grouped, aes(x=sqft, y=bedrooms)) +
#scale_color_brewer(palette = "PuOr") +
geom_mark_ellipse(expand=0.1, aes(fill=big), size = 1) +
geom_point(size=g_pointsize) +
x = "Square Footage",
y = "Number of Bedrooms",
fill = "?"
) +
dsan_theme("half") +
ggtitle("Unsupervised Data: House Listings") +
#theme(legend.position = "none") +
#theme(legend.title = text_element("?"))
expand_limits(x=c(800,2700), y=c(0.8,2.2)) +
scale_fill_manual(values=c(cbPalette[3],cbPalette[4]), labels=c("?","?"))
Image credit: DataCamp tutorial
Guessing House Prices:
Guessing Word Frequencies:
\[ \begin{align*} &\Pr(S = 1 \mid w_5 = \texttt{dollars}, w_4 = \texttt{million}) \\ &> \Pr(S = 1 \mid w_5 = \texttt{dollars}, w_4 = \texttt{octopus}) \end{align*} \]
\[ \Pr(S = 1 \mid w_5) \perp \Pr(S = 1 \mid w_4) \]
What are the keys to success in the NBA?
