Week 4: The Scourge of Overfitting

DSAN 5300: Statistical Learning
Spring 2025, Georgetown University

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Monday, February 3, 2025

Open slides in new tab →

Schedule

Today’s Planned Schedule:

Start End Topic
Lecture 6:30pm 6:40pm Logistic Regression Recap →
6:40pm 7:00pm CV and Model Selection Motivation →
7:00pm 8:00pm Anti-Overfitting Toolkit →
Break! 8:00pm 8:10pm
8:10pm 9:00pm Quizzo →

Recap: Logistic Regression

  • What happens to the probability \(\Pr(Y = 1 \mid X)\) when \(X\) increases by 1 unit?
  • “Linear Probability Model” \(\Pr(Y = 1 \mid X) = \beta_0 + \beta_1X\) fails, but, “fixing” it leads to

\[ \begin{align*} &\log\left[ \frac{\Pr(Y = 1 \mid X)}{1 - \Pr(Y = 1 \mid X)} \right] = \beta_0 + \beta_1 X \\ \iff &\Pr(Y = 1 \mid X) = \frac{\exp[\beta_0 + \beta_1X]}{1 + \exp[\beta_0 + \beta_1X]} = \frac{1}{1 + \exp\left[ -(\beta_0 + \beta_1X) \right] } \end{align*} \]

\(\leadsto\) A 1-unit increase in \(X\) is associated with a \(\beta_1\) increase in the log-odds of \(Y\)

What We Have Thus Far

  • We have a core model, regression, that we can build up into p much anything we want!
Class Topic This Video
Linear regression Pachelbel’s Canon in D (1m26s-1m46s)
Logistic regression Add swing: (1m46s)
Neural networks (triads \(\mapsto\) 7th/9th chords) (5m24s-5m53s)

Taking Off the Linearity Training Wheels

Linear Models

Figure 1: Font Credit

Linear Models

Figure 2: Font Credit

How Can We Attain Hiromi’s Aura?

  • Ingredient 1: Lots of examples: find mysteries/questions you care about in the world and think of how regression could help us understand them!
  • But, Ingredient 2 is Generalized Linear Models (GLM), which I’ll give an intro to on the board 🏃‍♂️‍➡️

Where Are We Going / What Problems Are We Solving?

  • Another nice property we have: OLS estimator is BLUE (Best Linear Unbiased Estimator) of conditional mean \(\mathbb{E}[Y \mid X]\)
  • The first problem we’ll tackle is: as we move from linear models with these kinds of guarantees to fancier models with more uncertainties / “potholes”… how do we ensure they still achieve want we want them to achieve?
  • Tldr: we can study more complex relationships between \(X\) and \(Y\) than linear ones, but we lose guarantees like “If it’s linear, then it is [this]”: in other words, we lose this automatic generalizability
  • (With great[er] power comes great[er] responsibility!)

The Level 2 Goal: Generalizability

The Goal of Statistical Learning

Find…

  • A function \(\widehat{y} = f(x)\)
  • That best predicts \(Y\) for given values of \(X\)
  • For data that has not yet been observed! 😳❓

…Can We Just, Like, Not?

  • What happens if we “unleash” fancier non-linear models on data the same way we’ve been using linear models?
  • The evil scourge of… OVERFITTING (⚡️ a single overly-dramatic lightning bolt strikes the whiteboard behind me right at this exact moment what are the odds ⚡️)

Your computer is Yes Man
Code
library(tidyverse)
set.seed(5300)
N <- 30
x_vals <- runif(N, min=0, max=1)
y_vals_raw <- 3 * x_vals
y_noise <- rnorm(N, mean=0, sd=0.5)
y_vals <- y_vals_raw + y_noise
data_df <- tibble(x=x_vals, y=y_vals)
data_df |> ggplot(aes(x=x, y=y)) +
  geom_point(size=2) +
  stat_smooth(
    method="lm",
    formula="y ~ x",
    se=FALSE,
    linewidth=1
  ) +
  labs(
    title = paste0("Linear Regression, N = ",N)
  ) +
  theme_dsan(base_size=28)

You: “Fit the data… but you’re only allowed to be linear!” Computer: “You got it boss!”
Code
data_df |> ggplot(aes(x=x, y=y)) +
  geom_point(size=2.5) +
  stat_smooth(
    method="lm",
    formula=y ~ poly(x, N, raw=TRUE),
    se=FALSE,
    linewidth=1
  ) +
  labs(
    title = paste0("Polynomial Regression, N = ",N)
  ) +
  theme_dsan(base_size=28)

You: “Fit the data…” Computer: “You got it boss!”

Memorizing Data vs. Learning the Relationship

Code
x <- seq(from = 0, to = 1, by = 0.1)
n <- length(x)
eps <- rnorm(n, 0, 0.04)
y <- x + eps
# But make one big outlier
midpoint <- ceiling((3/4)*n)
y[midpoint] <- 0
of_data <- tibble::tibble(x=x, y=y)
# Linear model
lin_model <- lm(y ~ x)
# But now polynomial regression
poly_model <- lm(y ~ poly(x, degree = 10, raw=TRUE))
#summary(model)
ggplot(of_data, aes(x=x, y=y)) +
  geom_point(size=g_pointsize/2) +
  labs(
    title = "Training Data",
    color = "Model"
  ) +
  theme_dsan(base_size=16)

Code
ggplot(of_data, aes(x=x, y=y)) +
  geom_point(size=g_pointsize/2) +
  geom_abline(aes(intercept=0, slope=1, color="Linear"), linewidth=1, show.legend = FALSE) +
  stat_smooth(method = "lm",
              formula = y ~ poly(x, 10, raw=TRUE),
              se = FALSE, aes(color="Polynomial")) +
  labs(
    title = "A Perfect Model?",
    color = "Model"
  ) +
  theme_dsan(base_size=16)

How have we measured “good” fit? High \(R^2\)? Low \(RSS\)?

Linear Model:
Code
summary(lin_model)$r.squared
[1] 0.5679903
Code
get_rss(lin_model)
[1] 0.5638522
Polynomial Model:
Code
summary(poly_model)$r.squared
[1] 1
Code
get_rss(poly_model)
[1] 0

5000: Accuracy \(\leadsto\) 5300: Generalization

  • Training Accuracy: How well does it fit the data we can see?
  • Test Accuracy: How well does it generalize to future data?
Code
# Data setup
x_test <- seq(from = 0, to = 1, by = 0.1)
n_test <- length(x_test)
eps_test <- rnorm(n_test, 0, 0.04)
y_test <- x_test + eps_test
of_data_test <- tibble::tibble(x=x_test, y=y_test)
lin_y_pred_test <- predict(lin_model, as.data.frame(x_test))
#lin_y_pred_test
lin_resids_test <- y_test - lin_y_pred_test
#lin_resids_test
lin_rss_test <- sum(lin_resids_test^2)
#lin_rss_test
# Lin R2 = 1 - RSS/TSS
tss_test <- sum((y_test - mean(y_test))^2)
lin_r2_test <- 1 - (lin_rss_test / tss_test)
#lin_r2_test
# Now the poly model
poly_y_pred_test <- predict(poly_model, as.data.frame(x_test))
poly_resids_test <- y_test - poly_y_pred_test
poly_rss_test <- sum(poly_resids_test^2)
#poly_rss_test
# RSS
poly_r2_test <- 1 - (poly_rss_test / tss_test)
#poly_r2_test
ggplot(of_data, aes(x=x, y=y)) +
  stat_smooth(method = "lm",
              formula = y ~ poly(x, 10, raw = TRUE),
              se = FALSE, aes(color="Polynomial")) +
  theme_classic() +
  geom_point(data=of_data_test, aes(x=x_test, y=y_test), size=g_pointsize/2) +
  geom_abline(aes(intercept=0, slope=1, color="Linear"), linewidth=1, show.legend = FALSE) +
  labs(
    title = "Evaluation: Unseen Test Data",
    color = "Model"
  ) +
  theme_dsan(base_size=16)

Linear Model:
Code
lin_r2_test
[1] 0.8906269
Code
lin_rss_test
[1] 0.139159
Polynomial Model:
Code
poly_r2_test
[1] 0.5237016
Code
poly_rss_test
[1] 0.6060099

In Other Words…

Image source: circulated as secret shitposting among PhD students in seminars

Building an Anti-Overfitting Toolkit

  • Penalizing Complexity
  • Cross-Validation
  • Model Selection (Penalizing Complexity 2.0)

Ok So, How Do We Avoid Overfitting?

  • The gist: penalize model complexity

  • Original optimization:

    \[ \theta^* = \underset{\theta}{\operatorname{argmin}} \mathcal{L}(y, \widehat{y}) \]

  • New optimization:

    \[ \theta^* = \underset{\theta}{\operatorname{argmin}} \left[ \lambda \mathcal{L}(y, \widehat{y}) + (1-\lambda) \mathsf{Complexity}(\theta) \right] \]

  • But how do we measure, and penalize, “complexity”?

Regularization: Measuring and Penalizing Complexity

  • In the case of polynomials, fairly simple complexity measure: degree of polynomial

\[ \mathsf{Complexity}(\widehat{y} = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3) > \mathsf{Complexity}(\widehat{y} = \beta_0 + \beta_1 x) \]

  • In general machine learning, however, we might not be working with polynomials
  • In neural networks, for example, we sometimes toss in millions of features and ask the algorithm to “just figure it out”
  • The gist, in the general case, is thus: try to “amplify” the most important features and shrink the rest, so that

\[ \mathsf{Complexity} \propto \frac{|\text{AmplifiedFeatures}|}{|\text{ShrunkFeatures}|} \]

LASSO and Elastic Net Regularization

  • Many ways to translate this intuition into math!
  • In several fields, however (econ, biostatistics), LASSO1 (Tibshirani 1996) is standard:

\[ \beta^*_{LASSO} = {\underset{\beta}{\operatorname{argmin}}}\left\{{\frac {1}{N}}\left\|y-X\beta \right\|_{2}^{2}+\lambda \|\beta \|_{1}\right\} \]

  • Why does this work to penalize complexity? What does the parameter \(\lambda\) do?
  • Some known issues with LASSO fixed in extension of the same intuitions: Elastic Net

\[ \beta^*_{EN} = {\underset {\beta }{\operatorname {argmin} }}\left\{ \|y-X\beta \|^{2}_2+\lambda _{2}\|\beta \|^{2}+\lambda _{1}\|\beta \|_{1} \right\} \]

  • Ensures a unique global minimum! Note that \(\lambda_2 = 0, \lambda_1 = 1 \implies \beta^*_{LASSO} = \beta^*_{EN}\)

Training vs. Test Data

Code
graph grid
{
    graph [
        overlap=true
    ]
    nodesep=0.0
    ranksep=0.0
    rankdir="TB"
    node [
        style="filled",
        color=black,
        fillcolor=lightblue,
        shape=box
    ]

    // uncomment to hide the grid
    edge [style=invis]
    
    subgraph cluster_01 {
        label="Training Set (80%)"
    N1[label="20%"] N2[label="20%"] N3[label="20%"] N4[label="20%"]
    }
    subgraph cluster_02 {
        label="Test Set (20%)"
    N5[label="20%",fillcolor=orange]
    }
}

grid cluster_01 Training Set (80%) cluster_02 Test Set (20%) N1 20% N2 20% N3 20% N4 20% N5 20%

Cross-Validation

  • The idea that good models generalize well is crucial!
    • What if we could leverage this insight to optimize over our training data?
    • The key: Validation Sets
Code
graph grid
{
    graph [
        overlap=true,
        scale=0.2
    ]
    nodesep=0.0
    ranksep=0.0
    rankdir="LR"
    scale=0.2
    node [
        style="filled",
        color=black,
        fillcolor=lightblue,
        shape=box
    ]

    // uncomment to hide the grid
    edge [style=invis]
    
    subgraph cluster_01 {
        label="Training Set (80%)"
        subgraph cluster_02 {
            label="Training Fold (80%)"
            A1[label="16%"] A2[label="16%"] A3[label="16%"] A4[label="16%"]
        }
        subgraph cluster_03 {
            label="Validation Fold (20%)"
            B1[label="16%",fillcolor=lightgreen]
        }
    }
    subgraph cluster_04 {
        label="Test Set (20%)"
    C1[label="20%",fillcolor=orange]
    }
    A1 -- A2 -- A3 -- A4 -- B1 -- C1;
}

grid cluster_04 Test Set (20%) cluster_01 Training Set (80%) cluster_02 Training Fold (80%) cluster_03 Validation Fold (20%) A1 16% A2 16% A3 16% A4 16% B1 16% C1 20%

Hyperparameters

  • The unspoken (but highly consequential!) “settings” for our learning procedure (that we haven’t optimized via gradient descent)
  • There are several you’ve already seen in e.g. 5000 – can you name them?

Hyperparameters You’ve Already Seen

  • Unsupervised Clustering: The number of clusters we want \(K\)
  • Gradient Descent: The step size \(\gamma\)
  • LASSO/Elastic Net: \(\lambda\)
  • The train/validation/test split!

Hyperparameter Selection

  • Every model comes with its own hyperparameters:
    • Neural Networks: Number of layers, nodes per layer
    • Decision Trees: Max tree depth, max features to include
    • Topic Models: Number of topics, document/topic priors
  • So, how do we choose?
    • Often more art than science
    • Principled, universally applicable, but slow: grid search
    • Specific methods for specific algorithms: ADAM (Kingma and Ba 2017) for Neural Network learning rates)

…Now What?

  • So we’ve got a trained model…
    • Data collected ✅
    • Loss function chosen ✅
    • Gradient descent complete ✅
    • Hyperparameters tuned ✅
    • Good \(F_1\) score on test data ✅
  • What’s our next step?
    • This is where engineers and social scientists diverge…
    • Stay tuned!

Quiz Time!

  • Jeff hands out your (paper) quizzes now!
  • You have 55 minutes (8:05 to 9pm)
  • You got this 💪

References

Kingma, Diederik P., and Jimmy Ba. 2017. “Adam: A Method for Stochastic Optimization.” arXiv. https://doi.org/10.48550/arXiv.1412.6980.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) 58 (1): 267–88. https://www.jstor.org/stable/2346178.

Footnotes

  1. Least Absolute Shrinkage and Selection Operator↩︎