Week 2: Linear Regression

DSAN 5300: Statistical Learning
Spring 2026, Georgetown University

Class Sessions

Author

Affiliation

Jeff Jacobs

jj1088@georgetown.edu

Published

Monday, January 12, 2026

Open slides in new tab →

Schedule

Today’s Planned Schedule:

	Start	End	Topic
Lecture	6:30pm	7:10pm	Simple Linear Regression →
	7:10pm	7:30pm	Deriving the OLS Solution →
	7:30pm	8:00pm	Interpreting OLS Output →
Break!	8:00pm	8:10pm
	8:10pm	8:30pm	Quiz Review →
	8:30pm	9:00pm	Quiz 2!

Linear Regression

What happens to my dependent variable $Y$ when my independent variable $X$ increases by 1 unit?
Keep the goal in front of your mind:

Tip The Goal of Regression

Find a line $\widehat{y} = mx + b$ that best predicts $Y$ for given values of $X$

Sanity Note 1: $\Rightarrow$ measuring error via vertical distance from line
Sanity Note 2: $\Rightarrow$ modeling distribution of $\boxed{Y \mid X}$, not $(X,Y)$!
- Predicting $Y$ from $X$ and $X$ from $Y$ $\Rightarrow$ principal component line $\neq$ regression!

How Do We Define “Best”?

Intuitively, two different ways to measure how well a line fits the data:

Code

library(tidyverse)
set.seed(5321)
N <- 11
x <- seq(from = 0, to = 1, by = 1 / (N - 1))
y <- x + rnorm(N, 0, 0.2)
mean_y <- mean(y)
spread <- y - mean_y
df <- tibble(x = x, y = y, spread = spread)
ggplot(df, aes(x=x, y=y)) +
  geom_abline(slope=1, intercept=0, linetype="dashed", color=cbPalette[1], linewidth=g_linewidth*1.5) +
  geom_segment(xend=(x+y)/2, yend=(x+y)/2, linewidth=g_linewidth*2, color=cbPalette[2]) +
  geom_point(size=g_pointsize) +
  coord_equal() +
  xlim(0, 1) + ylim(0, 1) +
  dsan_theme("half") +
  labs(
    title = "Principal Component Line"
  )
ggplot(df, aes(x=x, y=y)) +
  geom_abline(slope=1, intercept=0, linetype="dashed", color=cbPalette[1], linewidth=g_linewidth*1.5) +
  geom_segment(xend=x, yend=x, linewidth=g_linewidth*2, color=cb_palette[3]) +
  geom_point(size=g_pointsize) +
  coord_equal() +
  xlim(0, 1) + ylim(0, 1) +
  dsan_theme("half") +
  labs(
    title = "Regression Line"
  )

Figure 1: The line that minimizes blue distances does **not** predict $Y$ as well as regression line, despite intuitive appeal!

Figure 2: The line that minimizes green distances **optimally** predicts $Y$ from $X$, in a mathematically-provable sense!

On the difference between these two lines, and why it matters, I cannot recommend Gelman and Hill (2007) enough!

Principal Component Analysis

Principal Component Line can be used to project the data onto its dimension of highest variance (recap from 5000!)
More simply: PCA can discover meaningful axes in data (unsupervised learning / exploratory data analysis settings)

Code

library(readr)
library(ggplot2)
gdp_df <- read_csv("assets/gdp_pca.csv")

dist_to_line <- function(x0, y0, a, c) {
    numer <- abs(a * x0 - y0 + c)
    denom <- sqrt(a * a + 1)
    return(numer / denom)
}
# Finding PCA line for industrial vs. exports
x <- gdp_df$industrial
y <- gdp_df$exports
lossFn <- function(lineParams, x0, y0) {
    a <- lineParams[1]
    c <- lineParams[2]
    return(sum(dist_to_line(x0, y0, a, c)))
}
o <- optim(c(0, 0), lossFn, x0 = x, y0 = y)
ggplot(gdp_df, aes(x = industrial, y = exports)) +
    geom_point(size=g_pointsize/2) +
    geom_abline(aes(slope = o$par[1], intercept = o$par[2], color="pca"), linewidth=g_linewidth, show.legend = TRUE) +
    geom_smooth(aes(color="lm"), method = "lm", se = FALSE, linewidth=g_linewidth, key_glyph = "blank") +
    scale_color_manual(element_blank(), values=c("pca"=cbPalette[2],"lm"=cbPalette[1]), labels=c("Regression","PCA")) +
    dsan_theme("half") +
    remove_legend_title() +
    labs(
      title = "PCA Line vs. Regression Line",
        x = "Industrial Production (% of GDP)",
        y = "Exports (% of GDP)"
    )

See this amazing blog post using PCA, with 2 dimensions, to explore UN voting patterns!

Create Your Own Dimension!

Code

ggplot(gdp_df, aes(pc1, .fittedPC2)) +
    geom_point(size = g_pointsize/2) +
    geom_hline(aes(yintercept=0, color='PCA Line'), linetype='solid', size=g_linesize) +
    geom_rug(sides = "b", linewidth=g_linewidth/1.2, length = unit(0.1, "npc"), color=cbPalette[3]) +
    expand_limits(y=-1.6) +
    scale_color_manual(element_blank(), values=c("PCA Line"=cbPalette[2])) +
    dsan_theme("full") +
    remove_legend_title() +
    labs(
      title = "Exports vs. Industrial Production in Principal Component Space",
      x = "First Principal Component (Dimension of Greatest Variance)",
      y = "Second Principal Component"
    )

…And Use It for EDA

Code

library(dplyr)
library(tidyr)
plot_df <- gdp_df %>% select(c(country_code, pc1, agriculture, military))
long_df <- plot_df %>% pivot_longer(!c(country_code, pc1), names_to = "var", values_to = "val")
long_df <- long_df |> mutate(
  var = case_match(
    var,
    "agriculture" ~ "Agricultural Production",
    "military" ~ "Military Spending"
  )
)
ggplot(long_df, aes(x = pc1, y = val, facet = var)) +
    geom_point() +
    geom_smooth(method=lm, formula='y ~ x') +
    facet_wrap(vars(var), scales = "free") +
    dsan_theme("full") +
    labs(
        x = "\"Industrial-Export Dimension\" (First PC)",
        y = "% of GDP"
    )

But in Our Case…

$x$ and $y$ dimensions already have meaning, and we have a hypothesis about effect of $x$ on $y$!

The Regression Hypothesis $\mathcal{H}_{\text{reg}}$

Given data $(X, Y)$, we estimate $\widehat{y} = \widehat{\beta}_0 + \widehat{\beta}_1x$, hypothesizing that:

Starting from $y = \underbrace{\widehat{\beta}_0}_{\mathclap{\text{Intercept}}}$ when $x = 0$,
An increase of $x$ by 1 unit is associated with an increase of $y$ by $\underbrace{\widehat{\beta}_1}_{\mathclap{\text{Coefficient}}}$ units

We want to measure how well our line predicts $y$ for any given $x$ value $\implies$ vertical distance from regression line

Example: Advertising Effects

Independent variable: $ put into advertisements; Dependent variable: Sales
Goal 1: Predict sales for a given allocation
Goal 2: Infer best allocation for a given advertising budget (more simply: a new $1K appears! Where should we invest it?)

Code

library(tidyverse)
ad_df <- read_csv("assets/Advertising.csv") |> rename(id=`...1`)
long_df <- ad_df |> pivot_longer(-c(id, sales), names_to="medium", values_to="allocation")
long_df |> ggplot(aes(x=allocation, y=sales)) +
  geom_point() +
  facet_wrap(vars(medium), scales="free_x") +
  geom_smooth(method='lm', formula="y ~ x") +
  theme_dsan() +
  labs(
    x = "Allocation ($1K)",
    y = "Sales (1K Units)"
  )

Simple Linear Regression

For now, we treat Newspaper, Radio, TV advertising separately: how much do sales increase per $1 into [medium]? (Later we’ll consider them jointly: multiple regression)

Our model:

\[ Y = \underbrace{\param{\beta_0}}_{\mathclap{\text{Intercept}}} + \underbrace{\param{\beta_1}}_{\mathclap{\text{Slope}}}X + \varepsilon \]

…Generates predictions via:

\[ \widehat{y} = \underbrace{\widehat{\beta}_0}_{\mathclap{\small\begin{array}{c}\text{Estimated} \\[-5mm] \text{intercept}\end{array}}} ~+~ \underbrace{\widehat{\beta}_1}_{\mathclap{\small\begin{array}{c}\text{Estimated} \\[-4mm] \text{slope}\end{array}}}\cdot x \]

Note how these predictions will be wrong (unless the data is perfectly linear)
We’ve accounted for this in our model (by including $\varepsilon$ term)!
But, we’d like to find estimates $\widehat{\beta}_0$ and $\widehat{\beta}_1$ that produce the “least wrong” predictions: motivates focus on residuals $\widehat{\varepsilon}_i$…

\[ \widehat{\varepsilon}_i = \underbrace{y_i}_{\mathclap{\small\begin{array}{c}\text{Real} \\[-5mm] \text{label}\end{array}}} - \underbrace{\widehat{y}_i}_{\mathclap{\small\begin{array}{c}\text{Predicted} \\[-5mm] \text{label}\end{array}}} = \underbrace{y_i}_{\mathclap{\small\begin{array}{c}\text{Real} \\[-5mm] \text{label}\end{array}}} - \underbrace{ \left( \widehat{\beta}_0 + \widehat{\beta}_1 \cdot x \right) }_{\text{\small{Predicted label}}} \]

Least Squares: Minimizing Residuals

What can we optimize to ensure these residuals are as small as possible?

Code

N <- 21
x <- seq(from = 0, to = 1, by = 1 / (N - 1))
y <- x + rnorm(N, 0, 0.25)
mean_y <- mean(y)
spread <- y - mean_y
sim_lg_df <- tibble(x = x, y = y, spread = spread)
sim_lg_df |> ggplot(aes(x=x, y=y)) +
  geom_abline(slope=1, intercept=0, linetype="dashed", color=cbPalette[1], linewidth=g_linewidth) +
  # geom_segment(xend=x, yend=x, linewidth=g_linewidth*2, color=cbPalette[2]) +
  geom_segment(aes(xend=x, yend=x, color=ifelse(y>x,"Positive","Negative")), linewidth=1.5*g_linewidth) +
  geom_point(size=g_pointsize) +
  # coord_equal() +
  theme_dsan("half") +
  scale_color_manual("Spread", values=c("Positive"=cbPalette[3],"Negative"=cbPalette[6]), labels=c("Positive"="Positive","Negative"="Negative"))
large_sum <- sum(sim_lg_df$spread)
writeLines(fmt_decimal(large_sum))
large_sqsum <- sum((sim_lg_df$spread)^2)
writeLines(fmt_decimal(large_sqsum))
large_abssum <- sum(abs(sim_lg_df$spread))
writeLines(fmt_decimal(large_abssum))

Sum?

0.0000000000

Sum of Squares?

3.8405017200

Sum of absolute vals?

7.6806094387

Code

N <- 21
x <- seq(from = 0, to = 1, by = 1 / (N - 1))
y <- x + rnorm(N, 0, 0.05)
mean_y <- mean(y)
spread <- y - mean_y
sim_sm_df <- tibble(x = x, y = y, spread = spread)
sim_sm_df |> ggplot(aes(x=x, y=y)) +
  geom_abline(slope=1, intercept=0, linetype="dashed", color=cbPalette[1], linewidth=g_linewidth) +
  # geom_segment(xend=x, yend=x, linewidth=g_linewidth*2, color=cbPalette[2]) +
  geom_segment(aes(xend=x, yend=x, color=ifelse(y>x,"Positive","Negative")), linewidth=1.5*g_linewidth) +
  geom_point(size=g_pointsize) +
  # coord_equal() +
  theme_dsan("half") +
  scale_color_manual("Spread", values=c("Positive"=cbPalette[3],"Negative"=cbPalette[6]), labels=c("Positive"="Positive","Negative"="Negative"))
small_rsum <- sum(sim_sm_df$spread)
writeLines(fmt_decimal(small_rsum))
small_sqrsum <- sum((sim_sm_df$spread)^2)
writeLines(fmt_decimal(small_sqrsum))
small_abssum <- sum(abs(sim_sm_df$spread))
writeLines(fmt_decimal(small_abssum))

Sum?

0.0000000000

Sum of Squares?

1.9748635217

Sum of absolute vals?

5.5149697440

Why Not Absolute Value?

Two feasible ways to prevent positive and negative residuals cancelling out:
- Absolute error $\left|y - \widehat{y}\right|$ or squared error $\left( y - \widehat{y} \right)^2$
But remember: we’re aiming to minimize 👀 these residuals; ghost of calculus past 😱
We minimize by taking derivatives… which one is differentiable everywhere?

Code

# Could use facet_grid() here, but it doesn't work too nicely with stat_function() :(
ggplot(data.frame(x=c(-4,4)), aes(x=x)) +
  stat_function(
    fun =~ abs(.x),
    linewidth=g_linewidth * 1.5,
    color=cb_palette[1],
  ) +
  dsan_theme("quarter") +
  labs(
    title="f(x) = |x|",
    y="f(x)"
  )

Code

library(latex2exp)
x2_label <- TeX("$f(x) = x^2$")
ggplot(data.frame(x=c(-4,4)), aes(x=x)) +
  stat_function(
    fun =~ .x^2,
    linewidth = g_linewidth * 1.5,
    color=cb_palette[2],
  ) +
  dsan_theme("quarter") +
  labs(
    title=x2_label,
    y="f(x)"
  )

Outliers Penalized Quadratically

May feel arbitrary at first (we’re “forced” to use squared error because of calculus?)
It also has important consequences for “learnability” via gradient descent!

Code

library(latex2exp)
set.seed(5300)
slope <- 0.5
x_vals <- seq(from = -20, to = 20, length.out = 15)
N <- length(x_vals)
y_vals <- slope * x_vals + rnorm(N, 0, 5)
spread_vals <- y_vals - slope * x_vals
sim_lg_df <- tibble(
  x = x_vals, y = y_vals, spread = spread_vals
)
abs_sum <- round(sum(abs(sim_lg_df$spread)), 3)
abs_label <- TeX(paste0("$\\Sigma = ",abs_sum,"$"))
sim_lg_df |> ggplot(aes(x=x, y=y)) +
  geom_point(size=g_pointsize * 0.9) +
  geom_abline(
    slope=slope,
    intercept=0,
    linetype="dashed",
    color='black',
    linewidth=g_linewidth
  ) +
  geom_segment(
    aes(xend=x, yend=slope * x),
    arrow=arrow(
      angle=90,
      length=unit(0.125,'inches'),
      ends='both'
    ),
    linewidth=1.5*g_linewidth,
    color=cb_palette[7],
    alpha=0.9,
  ) +
  ylim(-15, 15) +
  xlim(-20, 27.5) +
  coord_equal() +
  theme_dsan(base_size=28) +
  labs(title="Absolute Penalties") +
  geom_text(
    data=data.frame(x=15, y=-10), label=abs_label,
    size=12
  )
sq_sum <- round(sum((sim_lg_df$spread)^2), 3)
sq_label <- TeX(paste0("$\\Sigma = ",sq_sum,"$"))
sim_lg_df |> ggplot(aes(x=x, y=y)) +
  geom_point(size=g_pointsize * 0.9) +
  geom_abline(
    slope=slope,
    intercept=0,
    linetype="dashed",
    color='black',
    linewidth=g_linewidth
  ) +
  geom_rect(
    aes(
      xmin=x,
      xmax=x+abs(spread),
      ymin=ifelse(spread < 0, y, y - spread),
      ymax=ifelse(spread < 0, y - spread, y)
    ),
    color='black',
    fill=cb_palette[7],
    alpha=0.2,
  ) +
  ylim(-15, 15) +
  xlim(-20, 27.5) +
  coord_equal() +
  theme_dsan(base_size=30) +
  labs(title="Squared Penalties") +
  geom_text(
    data=data.frame(x=15, y=-10), label=sq_label,
    size=12
  )

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Key Features of Regression Line

Regression line is BLUE: Best Linear Unbiased Estimator
What exactly is it the “best” linear estimator of?

is chosen so that

\[ \widehat{\theta} = \left(\widehat{\beta}_0, \widehat{\beta}_1\right) = \argmin_{\beta_0, \beta_1}\left[ \sum_{x_i \in X} \left(~~\overbrace{\widehat{y}(x_i)}^{\mathclap{\small\text{Predicted }y}} - \overbrace{\expect{Y \mid X = x_i}}^{\small \text{Avg. }y\text{ when }x = x_i}\right)^{2~} \right] \]

Where Did That $\mathbb{E}[Y \mid X = x_i]$ Come From?

From our assumption that the irreducible errors $\varepsilon_i$ are normally distributed $\mathcal{N}(0, \sigma^2)$

Kind of an immensely important point, since it gives us a hint for checking whether model assumptions hold: spread around the regression line should be $\mathcal{N}(0, \sigma^2)$

Heteroskedasticity

If spread increases or decreases for larger $x$, for example, then $\varepsilon \nsim \mathcal{N}(0, \sigma^2)$

But… What About Other Types of Vars?

5000: you saw nominal, ordinal, cardinal vars
5100: you wrestled with discrete vs. continuous RVs
Good News #1: Regression can handle all these types+more!
Good News #2: Distinctions between classification and regression diminish as you learn fancier regression methods!
- tldr: Predict continuous probabilities $\Pr(Y) \in [0,1]$ (regression), then guess 1 if $\Pr(Y) > 0.5$ (classification)
By end of 5300 you should have something on your toolbelt for handling most cases like “I want to do [regression / classification], but my data is [not cardinal+continuous]”

Quiz Review

Objective Functions

“Fitting” a statistical model to data means minimizing some loss function that measures “how bad” our predictions are:

Optimization Problems: General Form

Find $x^*$, the solution to

\[ \begin{align} \min_{x} ~ & f(x) &\text{(Objective function)} \\ \text{s.t. } ~ & g(x) = 0 &\text{(Constraints)} \end{align} \]

Earlier we were able to write $x^* = \argmax_x{f(x)}$, since there were no constraints. Is there a way to write a formula like this with constraints?
Answer: Yes! Thx Giuseppe-Luigi Lagrangia = Joseph-Louis Lagrange:

\[ x^* = \argmax_{x,~\lambda}f(x) - \lambda[g(x)] \]

(Why worry about constraints when OLS is unconstrained? …Soon we’ll need to penalize complexity!)

Example Problem

Example 1: Unconstrained Optimization

Find $x^*$, the solution to

\[ \begin{align} \min_{x} ~ & f(x) = 3x^2 - x \\ \text{s.t. } ~ & \varnothing \end{align} \]

Our Plan

Compute the derivative $f'(x) = \frac{\partial}{\partial x}f(x)$,
Set it equal to zero: $f'(x) = 0$, and
Solve this equality for $x$, i.e., find values $x^*$ satisfying $f'(x^*) = 0$

Computing the derivative:

\[ f'(x) = \frac{\partial}{\partial x}f(x) = \frac{\partial}{\partial x}\left[3x^2 - x\right] = 6x - 1, \]

Solving for $x^*$, the value(s) satisfying $\frac{\partial}{\partial x}f'(x^*) = 0$ for just-derived $f'(x)$:

\[ f'(x^*) = 0 \iff 6x^* - 1 = 0 \iff x^* = \frac{1}{6}. \]

Derivative Cheatsheet

Type of Thing	Thing	Change in Thing when $x$ Changes by Tiny Amount
Polynomial	$f(x) = x^n$	$f'(x) = \frac{\partial}{\partial x}f(x) = nx^{n-1}$
Fraction	$f(x) = \frac{1}{x}$	Use Polynomial rule (since $\frac{1}{x} = x^{-1}$) to get $f'(x) = -\frac{1}{x^2}$
Logarithm	$f(x) = \ln(x)$	$f'(x) = \frac{\partial}{\partial x} = \frac{1}{x}$
Exponential	$f(x) = e^x$	$f'(x) = \frac{\partial}{\partial x}e^x = e^x$ (🧐❗️)
Multiplication	$f(x) = g(x)h(x)$	$f'(x) = g'(x)h(x) + g(x)h'(x)$
Division	$f(x) = \frac{g(x)}{h(x)}$	Too hard to memorize… turn it into Multiplication, as $f(x) = g(x)(h(x))^{-1}$
Composition (Chain Rule)	$f(x) = g(h(x))$	$f'(x) = g'(h(x))h'(x)$
Fancy Logarithm	$f(x) = \ln(g(x))$	$f'(x) = \frac{g'(x)}{g(x)}$ by Chain Rule
Fancy Exponential	$f(x) = e^{g(x)}$	$f'(x) = g'(x)e^{g(x)}$ by Chain Rule

References

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. https://www.dropbox.com/scl/fi/asbumi3g0gqa4xl9va7wp/Andrew-Gelman-Jennifer-Hill-Data-Analysis-Using-Regression-and-Multilevel_Hierarchical-Models.pdf?rlkey=zf8icjhm7rswvxrpm7d10m65o&dl=1.

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. 2023. An Introduction to Statistical Learning: With Applications in Python. Springer Nature. https://books.google.com?id=ygzJEAAAQBAJ.

Type of Thing	Thing	Change in Thing when \(x\) Changes by Tiny Amount
Polynomial	\(f(x) = x^n\)	\(f'(x) = \frac{\partial}{\partial x}f(x) = nx^{n-1}\)
Fraction	\(f(x) = \frac{1}{x}\)	Use Polynomial rule (since \(\frac{1}{x} = x^{-1}\)) to get \(f'(x) = -\frac{1}{x^2}\)
Logarithm	\(f(x) = \ln(x)\)	\(f'(x) = \frac{\partial}{\partial x} = \frac{1}{x}\)
Exponential	\(f(x) = e^x\)	\(f'(x) = \frac{\partial}{\partial x}e^x = e^x\) (🧐❗️)
Multiplication	\(f(x) = g(x)h(x)\)	\(f'(x) = g'(x)h(x) + g(x)h'(x)\)
Division	\(f(x) = \frac{g(x)}{h(x)}\)	Too hard to memorize… turn it into Multiplication, as \(f(x) = g(x)(h(x))^{-1}\)
Composition (Chain Rule)	\(f(x) = g(h(x))\)	\(f'(x) = g'(h(x))h'(x)\)
Fancy Logarithm	\(f(x) = \ln(g(x))\)	\(f'(x) = \frac{g'(x)}{g(x)}\) by Chain Rule
Fancy Exponential	\(f(x) = e^{g(x)}\)	\(f'(x) = g'(x)e^{g(x)}\) by Chain Rule

Schedule

Linear Regression

How Do We Define “Best”?

Principal Component Analysis

Create Your Own Dimension!

…And Use It for EDA

But in Our Case…

Example: Advertising Effects

Simple Linear Regression

Least Squares: Minimizing Residuals

Why Not Absolute Value?

Outliers Penalized Quadratically

Key Features of Regression Line

Where Did That \(\mathbb{E}[Y \mid X = x_i]\) Come From?

Heteroskedasticity

But… What About Other Types of Vars?

Quiz Review

Objective Functions

Example Problem

Derivative Cheatsheet

References