Week 2: Linear Regression

DSAN 5300: Statistical Learning
Spring 2026, Georgetown University

Jeff Jacobs

jj1088@georgetown.edu

Monday, January 12, 2026

Schedule

Today’s Planned Schedule:

	Start	End	Topic
Lecture	6:30pm	7:10pm	Simple Linear Regression →
	7:10pm	7:30pm	Deriving the OLS Solution →
	7:30pm	8:00pm	Interpreting OLS Output →
Break!	8:00pm	8:10pm
	8:10pm	8:30pm	Quiz Review →
	8:30pm	9:00pm	Quiz 2!

Linear Regression

What happens to my dependent variable $Y$ when my independent variable $X$ increases by 1 unit?
Keep the goal in front of your mind:

The Goal of Regression

Find a line $\widehat{y} = mx + b$ that best predicts $Y$ for given values of $X$

Sanity Note 1: $\Rightarrow$ measuring error via vertical distance from line
Sanity Note 2: $\Rightarrow$ modeling distribution of $\boxed{Y \mid X}$, not $(X,Y)$!
- Predicting $Y$ from $X$ and $X$ from $Y$ $\Rightarrow$ principal component line $\neq$ regression!

How Do We Define “Best”?

Intuitively, two different ways to measure how well a line fits the data:

Figure 1: The line that minimizes blue distances does **not** predict $Y$ as well as regression line, despite intuitive appeal!

Figure 2: The line that minimizes green distances **optimally** predicts $Y$ from $X$, in a mathematically-provable sense!

Principal Component Analysis

Principal Component Line can be used to project the data onto its dimension of highest variance (recap from 5000!)
More simply: PCA can discover meaningful axes in data (unsupervised learning / exploratory data analysis settings)

Create Your Own Dimension!

…And Use It for EDA

But in Our Case…

$x$ and $y$ dimensions already have meaning, and we have a hypothesis about effect of $x$ on $y$!

The Regression Hypothesis $\mathcal{H}_{\text{reg}}$

Given data $(X, Y)$, we estimate $\widehat{y} = \widehat{\beta}_0 + \widehat{\beta}_1x$, hypothesizing that:

Starting from $y = \underbrace{\widehat{\beta}_0}_{\mathclap{\text{Intercept}}}$ when $x = 0$,
An increase of $x$ by 1 unit is associated with an increase of $y$ by $\underbrace{\widehat{\beta}_1}_{\mathclap{\text{Coefficient}}}$ units

We want to measure how well our line predicts $y$ for any given $x$ value $\implies$ vertical distance from regression line

Example: Advertising Effects

Independent variable: $ put into advertisements; Dependent variable: Sales
Goal 1: Predict sales for a given allocation
Goal 2: Infer best allocation for a given advertising budget (more simply: a new $1K appears! Where should we invest it?)

Simple Linear Regression

For now, we treat Newspaper, Radio, TV advertising separately: how much do sales increase per $1 into [medium]? (Later we’ll consider them jointly: multiple regression)

Our model:

\[ Y = \underbrace{\param{\beta_0}}_{\mathclap{\text{Intercept}}} + \underbrace{\param{\beta_1}}_{\mathclap{\text{Slope}}}X + \varepsilon \]

…Generates predictions via:

\[ \widehat{y} = \underbrace{\widehat{\beta}_0}_{\mathclap{\small\begin{array}{c}\text{Estimated} \\[-5mm] \text{intercept}\end{array}}} ~+~ \underbrace{\widehat{\beta}_1}_{\mathclap{\small\begin{array}{c}\text{Estimated} \\[-4mm] \text{slope}\end{array}}}\cdot x \]

Note how these predictions will be wrong (unless the data is perfectly linear)
We’ve accounted for this in our model (by including $\varepsilon$ term)!
But, we’d like to find estimates $\widehat{\beta}_0$ and $\widehat{\beta}_1$ that produce the “least wrong” predictions: motivates focus on residuals $\widehat{\varepsilon}_i$…

\[ \widehat{\varepsilon}_i = \underbrace{y_i}_{\mathclap{\small\begin{array}{c}\text{Real} \\[-5mm] \text{label}\end{array}}} - \underbrace{\widehat{y}_i}_{\mathclap{\small\begin{array}{c}\text{Predicted} \\[-5mm] \text{label}\end{array}}} = \underbrace{y_i}_{\mathclap{\small\begin{array}{c}\text{Real} \\[-5mm] \text{label}\end{array}}} - \underbrace{ \left( \widehat{\beta}_0 + \widehat{\beta}_1 \cdot x \right) }_{\text{\small{Predicted label}}} \]

Least Squares: Minimizing Residuals

What can we optimize to ensure these residuals are as small as possible?

Sum?

0.0000000000

Sum of Squares?

3.8405017200

Sum of absolute vals?

7.6806094387

Sum?

0.0000000000

Sum of Squares?

1.9748635217

Sum of absolute vals?

5.5149697440

Why Not Absolute Value?

Two feasible ways to prevent positive and negative residuals cancelling out:
- Absolute error $\left|y - \widehat{y}\right|$ or squared error $\left( y - \widehat{y} \right)^2$
But remember: we’re aiming to minimize 👀 these residuals; ghost of calculus past 😱
We minimize by taking derivatives… which one is differentiable everywhere?

Outliers Penalized Quadratically

May feel arbitrary at first (we’re “forced” to use squared error because of calculus?)
It also has important consequences for “learnability” via gradient descent!

Key Features of Regression Line

Regression line is BLUE: Best Linear Unbiased Estimator
What exactly is it the “best” linear estimator of?

is chosen so that

\[ \widehat{\theta} = \left(\widehat{\beta}_0, \widehat{\beta}_1\right) = \argmin_{\beta_0, \beta_1}\left[ \sum_{x_i \in X} \left(~~\overbrace{\widehat{y}(x_i)}^{\mathclap{\small\text{Predicted }y}} - \overbrace{\expect{Y \mid X = x_i}}^{\small \text{Avg. }y\text{ when }x = x_i}\right)^{2~} \right] \]

Where Did That $\mathbb{E}[Y \mid X = x_i]$ Come From?

From our assumption that the irreducible errors $\varepsilon_i$ are normally distributed $\mathcal{N}(0, \sigma^2)$

Image Source

Kind of an immensely important point, since it gives us a hint for checking whether model assumptions hold: spread around the regression line should be $\mathcal{N}(0, \sigma^2)$

Heteroskedasticity

If spread increases or decreases for larger $x$, for example, then $\varepsilon \nsim \mathcal{N}(0, \sigma^2)$

Figure 3.11 from James et al. (2023)

But… What About Other Types of Vars?

5000: you saw nominal, ordinal, cardinal vars
5100: you wrestled with discrete vs. continuous RVs
Good News #1: Regression can handle all these types+more!
Good News #2: Distinctions between classification and regression diminish as you learn fancier regression methods!
- tldr: Predict continuous probabilities $\Pr(Y) \in [0,1]$ (regression), then guess 1 if $\Pr(Y) > 0.5$ (classification)
By end of 5300 you should have something on your toolbelt for handling most cases like “I want to do [regression / classification], but my data is [not cardinal+continuous]”

Quiz Review

Objective Functions

“Fitting” a statistical model to data means minimizing some loss function that measures “how bad” our predictions are:

Optimization Problems: General Form

Find $x^*$, the solution to

\[ \begin{align} \min_{x} ~ & f(x) &\text{(Objective function)} \\ \text{s.t. } ~ & g(x) = 0 &\text{(Constraints)} \end{align} \]

Earlier we were able to write $x^* = \argmax_x{f(x)}$, since there were no constraints. Is there a way to write a formula like this with constraints?
Answer: Yes! Thx Giuseppe-Luigi Lagrangia = Joseph-Louis Lagrange:

\[ x^* = \argmax_{x,~\lambda}f(x) - \lambda[g(x)] \]

Example Problem

Example 1: Unconstrained Optimization

Find $x^*$, the solution to

\[ \begin{align} \min_{x} ~ & f(x) = 3x^2 - x \\ \text{s.t. } ~ & \varnothing \end{align} \]

Our Plan

Compute the derivative $f'(x) = \frac{\partial}{\partial x}f(x)$,
Set it equal to zero: $f'(x) = 0$, and
Solve this equality for $x$, i.e., find values $x^*$ satisfying $f'(x^*) = 0$

Computing the derivative:

\[ f'(x) = \frac{\partial}{\partial x}f(x) = \frac{\partial}{\partial x}\left[3x^2 - x\right] = 6x - 1, \]

Solving for $x^*$, the value(s) satisfying $\frac{\partial}{\partial x}f'(x^*) = 0$ for just-derived $f'(x)$:

\[ f'(x^*) = 0 \iff 6x^* - 1 = 0 \iff x^* = \frac{1}{6}. \]

Derivative Cheatsheet

Type of Thing	Thing	Change in Thing when $x$ Changes by Tiny Amount
Polynomial	$f(x) = x^n$	$f'(x) = \frac{\partial}{\partial x}f(x) = nx^{n-1}$
Fraction	$f(x) = \frac{1}{x}$	Use Polynomial rule (since $\frac{1}{x} = x^{-1}$) to get $f'(x) = -\frac{1}{x^2}$
Logarithm	$f(x) = \ln(x)$	$f'(x) = \frac{\partial}{\partial x} = \frac{1}{x}$
Exponential	$f(x) = e^x$	$f'(x) = \frac{\partial}{\partial x}e^x = e^x$ (🧐❗️)
Multiplication	$f(x) = g(x)h(x)$	$f'(x) = g'(x)h(x) + g(x)h'(x)$
Division	$f(x) = \frac{g(x)}{h(x)}$	Too hard to memorize… turn it into Multiplication, as $f(x) = g(x)(h(x))^{-1}$
Composition (Chain Rule)	$f(x) = g(h(x))$	$f'(x) = g'(h(x))h'(x)$
Fancy Logarithm	$f(x) = \ln(g(x))$	$f'(x) = \frac{g'(x)}{g(x)}$ by Chain Rule
Fancy Exponential	$f(x) = e^{g(x)}$	$f'(x) = g'(x)e^{g(x)}$ by Chain Rule

References

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. https://www.dropbox.com/scl/fi/asbumi3g0gqa4xl9va7wp/Andrew-Gelman-Jennifer-Hill-Data-Analysis-Using-Regression-and-Multilevel_Hierarchical-Models.pdf?rlkey=zf8icjhm7rswvxrpm7d10m65o&dl=1.

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. 2023. An Introduction to Statistical Learning: With Applications in Python. Springer Nature. https://books.google.com?id=ygzJEAAAQBAJ.

Type of Thing	Thing	Change in Thing when \(x\) Changes by Tiny Amount
Polynomial	\(f(x) = x^n\)	\(f'(x) = \frac{\partial}{\partial x}f(x) = nx^{n-1}\)
Fraction	\(f(x) = \frac{1}{x}\)	Use Polynomial rule (since \(\frac{1}{x} = x^{-1}\)) to get \(f'(x) = -\frac{1}{x^2}\)
Logarithm	\(f(x) = \ln(x)\)	\(f'(x) = \frac{\partial}{\partial x} = \frac{1}{x}\)
Exponential	\(f(x) = e^x\)	\(f'(x) = \frac{\partial}{\partial x}e^x = e^x\) (🧐❗️)
Multiplication	\(f(x) = g(x)h(x)\)	\(f'(x) = g'(x)h(x) + g(x)h'(x)\)
Division	\(f(x) = \frac{g(x)}{h(x)}\)	Too hard to memorize… turn it into Multiplication, as \(f(x) = g(x)(h(x))^{-1}\)
Composition (Chain Rule)	\(f(x) = g(h(x))\)	\(f'(x) = g'(h(x))h'(x)\)
Fancy Logarithm	\(f(x) = \ln(g(x))\)	\(f'(x) = \frac{g'(x)}{g(x)}\) by Chain Rule
Fancy Exponential	\(f(x) = e^{g(x)}\)	\(f'(x) = g'(x)e^{g(x)}\) by Chain Rule

Week 2: Linear Regression

Schedule

Linear Regression

How Do We Define “Best”?

Principal Component Analysis

Create Your Own Dimension!

…And Use It for EDA

But in Our Case…

Example: Advertising Effects

Simple Linear Regression

Least Squares: Minimizing Residuals

Why Not Absolute Value?

Outliers Penalized Quadratically

Key Features of Regression Line

Where Did That \(\mathbb{E}[Y \mid X = x_i]\) Come From?

Heteroskedasticity

But… What About Other Types of Vars?

Quiz Review

Objective Functions

Example Problem

Derivative Cheatsheet

References