Week 2: Linear Regression

DSAN 5300: Statistical Learning
Spring 2025, Georgetown University

Class Sessions

Author

Affiliation

Jeff Jacobs

jj1088@georgetown.edu

Published

Monday, January 13, 2025

Open slides in new tab →

Linear Regression

What happens to my dependent variable $Y$ when my independent variable $X$ changes by 1 unit?
Whenever you carry out a regression, keep the goal in the front of your mind:

The Goal of Regression

Find a line $\widehat{y} = mx + b$ that best predicts $Y$ for given values of $X$

How Do We Define “Best”?

Intuitively, two different ways to measure how well a line fits the data:

Code

library(tidyverse)
set.seed(5321)
N <- 11
x <- seq(from = 0, to = 1, by = 1 / (N - 1))
y <- x + rnorm(N, 0, 0.2)
mean_y <- mean(y)
spread <- y - mean_y
df <- tibble(x = x, y = y, spread = spread)
ggplot(df, aes(x=x, y=y)) +
  geom_abline(slope=1, intercept=0, linetype="dashed", color=cbPalette[1], linewidth=g_linewidth*2) +
  geom_segment(xend=(x+y)/2, yend=(x+y)/2, linewidth=g_linewidth*2, color=cbPalette[2]) +
  geom_point(size=g_pointsize) +
  coord_equal() +
  xlim(0, 1) + ylim(0, 1) +
  dsan_theme("half") +
  labs(
    title = "Principal Component Line"
  )
ggplot(df, aes(x=x, y=y)) +
  geom_abline(slope=1, intercept=0, linetype="dashed", color=cbPalette[1], linewidth=g_linewidth*2) +
  geom_segment(xend=x, yend=x, linewidth=g_linewidth*2, color=cbPalette[2]) +
  geom_point(size=g_pointsize) +
  coord_equal() +
  xlim(0, 1) + ylim(0, 1) +
  dsan_theme("half") +
  labs(
    title = "Regression Line"
  )

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

On the difference between these two lines, and why it matters, I cannot recommend Gelman and Hill (2007) enough!

Predictive Example: Advertising Effects

Independent variable: $ put into advertisements
Dependent variable: Sales
Goal: Figure out a good way to allocate an advertising budget

Code

library(tidyverse)
ad_df <- read_csv("assets/Advertising.csv") |> rename(id=`...1`)

New names:
Rows: 200 Columns: 5
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," dbl
(5): ...1, TV, radio, newspaper, sales
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Code

long_df <- ad_df |> pivot_longer(-c(id, sales), names_to="medium", values_to="allocation")
long_df |> ggplot(aes(x=allocation, y=sales)) +
  geom_point() +
  facet_wrap(vars(medium), scales="free_x") +
  geom_smooth(method='lm', formula="y ~ x") +
  theme_dsan() +
  labs(
    x = "Allocation ($1K)",
    y = "Sales (1K Units)"
  )

Explanatory Example: Industrialization Effects

Code

library(tidyverse)
gdp_df <- read_csv("assets/gdp_pca.csv")

Rows: 89 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): country_code, country_name
dbl (12): .rownames, services, agriculture, industrial, manufacturing, resou...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

mil_plot <- gdp_df |> ggplot(aes(x=industrial, y=military)) +
  geom_point(size=0.5*g_pointsize) +
  geom_smooth(method='lm', formula="y ~ x", linewidth=1) +
  theme_dsan() +
  labs(
    title="Military Exports vs. Industrialization",
    x="Industrial Production (% of GDP)",
    y="Military Exports (% of All Exports)"
  )
mil_plot

Warning: Removed 8 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_point()`).

Simple Linear Regression

For now, we treat Newspaper, Radio, TV advertising separately: how much do sales increase per $1 into [medium]? (Later we’ll consider them jointly: multiple regression)
Our model:

\[ Y = \underbrace{\param{\beta_0}}_{\mathclap{\text{Intercept}}} + \underbrace{\param{\beta_1}}_{\mathclap{\text{Slope}}}X + \varepsilon \]
This model generates predictions via

\[ \widehat{y} = \underbrace{\widehat{\beta_0}}_{\mathclap{\small\begin{array}{c}\text{Estimated} \\[-5mm] \text{intercept}\end{array}}} ~+~ \underbrace{\widehat{\beta_1}}_{\mathclap{\small\begin{array}{c}\text{Estimated} \\[-4mm] \text{slope}\end{array}}}\cdot x \]
Note how these predictions will be wrong (unless the data is perfectly linear)
We’ve accounted for this in our model (by including $\varepsilon$ term)!
But, we’d like to find estimates $\widehat{\beta}_0$ and $\widehat{\beta}_1$ that produce the “least wrong” predictions: motivates focus on residuals…

Least Squares: Minimizing Residuals

What can we optimize to ensure these residuals are as small as possible?

Code

N <- 21
x <- seq(from = 0, to = 1, by = 1 / (N - 1))
y <- x + rnorm(N, 0, 0.25)
mean_y <- mean(y)
spread <- y - mean_y
sim_lg_df <- tibble(x = x, y = y, spread = spread)
sim_lg_df |> ggplot(aes(x=x, y=y)) +
  geom_abline(slope=1, intercept=0, linetype="dashed", color=cbPalette[1], linewidth=g_linewidth) +
  # geom_segment(xend=x, yend=x, linewidth=g_linewidth*2, color=cbPalette[2]) +
  geom_segment(aes(xend=x, yend=x, color=ifelse(y>x,"Positive","Negative")), linewidth=1.5*g_linewidth) +
  geom_point(size=g_pointsize) +
  # coord_equal() +
  theme_dsan("half") +
  scale_color_manual("Spread", values=c("Positive"=cbPalette[3],"Negative"=cbPalette[6]), labels=c("Positive"="Positive","Negative"="Negative")) +
  labs(
    title = "Regression Line"
  )

Sum?

Code

large_sum <- sum(sim_lg_df$spread)
writeLines(fmt_decimal(large_sum))

0.0000000000

Sum of Squares?

Code

large_sqsum <- sum((sim_lg_df$spread)^2)
writeLines(fmt_decimal(large_sqsum))

3.8405017200

Code

N <- 21
x <- seq(from = 0, to = 1, by = 1 / (N - 1))
y <- x + rnorm(N, 0, 0.05)
mean_y <- mean(y)
spread <- y - mean_y
sim_sm_df <- tibble(x = x, y = y, spread = spread)
sim_sm_df |> ggplot(aes(x=x, y=y)) +
  geom_abline(slope=1, intercept=0, linetype="dashed", color=cbPalette[1], linewidth=g_linewidth) +
  # geom_segment(xend=x, yend=x, linewidth=g_linewidth*2, color=cbPalette[2]) +
  geom_segment(aes(xend=x, yend=x, color=ifelse(y>x,"Positive","Negative")), linewidth=1.5*g_linewidth) +
  geom_point(size=g_pointsize) +
  # coord_equal() +
  theme_dsan("half") +
  scale_color_manual("Spread", values=c("Positive"=cbPalette[3],"Negative"=cbPalette[6]), labels=c("Positive"="Positive","Negative"="Negative")) +
  labs(
    title = "Regression Line"
  )

Sum?

Code

small_rsum <- sum(sim_sm_df$spread)
writeLines(fmt_decimal(small_rsum))

0.0000000000

Sum of Squares?

Code

small_sqrsum <- sum((sim_sm_df$spread)^2)
writeLines(fmt_decimal(small_sqrsum))

1.9748635217

Why Not Absolute Value?

Two feasible ways to prevent positive and negative residuals cancelling out:
- Absolute value $\left|y - \widehat{y}\right|$ or squaring $\left( y - \widehat{y} \right)^2$
But remember that we’re aiming to minimize these residuals…
Ghost of calculus past 😱: which is differentiable everywhere?

Code

library(latex2exp)
x2_label <- latex2exp("$f(x) = x^2$")

Warning in latex2exp("$f(x) = x^2$"): 'latex2exp' is deprecated.
Use 'TeX' instead.
See help("Deprecated") and help("latex2exp-deprecated").

Code

ggplot(data.frame(x=c(-4,4)), aes(x=x)) +
  stat_function(fun=~ .x^2, linewidth = g_linewidth) +
  dsan_theme("quarter") +
  labs(
    title=x2_label,
    y="f(x)"
  )

Code

# Could use facet_grid() here, but it doesn't work too nicely with stat_function() :(
ggplot(data.frame(x=c(-4,4)), aes(x=x)) +
  stat_function(fun=~ abs(.x), linewidth=g_linewidth) +
  dsan_theme("quarter") +
  labs(
    title="f(x) = |x|",
    y="f(x)"
  )

Outliers Penalized Quadratically

Key Features of Regression Line

Regression line is BLUE: Best Linear Unbiased Estimator
What exactly is it the “best” linear estimator of?

\[ \widehat{y} = \underbrace{\widehat{\beta}_0}_{\mathclap{\small\begin{array}{c}\text{Estimated} \\[-5mm] \text{intercept}\end{array}}} ~+~ \underbrace{\widehat{\beta}_1}_{\mathclap{\small\begin{array}{c}\text{Estimated} \\[-4mm] \text{slope}\end{array}}}\cdot x \]

is chosen so that

\[ \widehat{\theta} = \left(\widehat{\beta}_0, \widehat{\beta}_1\right) = \argmin_{\beta_0, \beta_1}\left[ \sum_{x_i \in X} \left(~~\overbrace{\widehat{y}(x_i)}^{\mathclap{\small\text{Predicted }y}} - \overbrace{\expect{Y \mid X = x_i}}^{\small \text{Avg. }y\text{ when }x = x_i}\right)^{2~} \right] \]

Where Did That $\mathbb{E}[Y \mid X = x_i]$ Come From?

But… What About All the Other Types of Vars?

5000: you saw, e.g., nominal, ordinal, cardinal vars
5100: you wrestled with discrete vs. continuous RVs
Good News #1: Regression can handle all these types+more!
Good News #2: Distinctions between classification and regression start to diminish as you learn fancier regression methods! (One key tool here: link functions)
By end of 5300 you should have something on your toolbelt for handling most cases like “I want to do [regression / classification], but my data is [not cardinal+continuous]”

Deriving the Least Squares Estimate

A Sketch (HW is the Full Thing)

OLS for regression without intercept $\param{\beta_0}$: Which line through origin best predicts $Y$?
(Good practice + reminder of how restricted linear models are!)

\[ Y = \beta_1 X + \varepsilon \]

Code

library(latex2exp)
set.seed(5300)
# rand_slope <- log(runif(80, min=0, max=1))
# rand_slope[41:80] <- -rand_slope[41:80]
# rand_lines <- tibble::tibble(
#   id=1:80, slope=rand_slope, intercept=0
# )
# angles <- runif(100, -pi/2, pi/2)
angles <- seq(from=-pi/2, to=pi/2, length.out=50)
possible_lines <- tibble::tibble(
  slope=tan(angles), intercept=0
)
num_points <- 30
x_vals <- runif(num_points, 0, 1)
y0_vals <- 0.5 * x_vals + 0.25
y_noise <- rnorm(num_points, 0, 0.07)
y_vals <- y0_vals + y_noise
rand_df <- tibble::tibble(x=x_vals, y=y_vals)
title_exp <- latex2exp("Parameter Space ($\\beta_1$)")

Warning in latex2exp("Parameter Space ($\\beta_1$)"): 'latex2exp' is deprecated.
Use 'TeX' instead.
See help("Deprecated") and help("latex2exp-deprecated").

Code

# Main plot object
gen_lines_plot <- function(point_size=2.5) {
  lines_plot <- rand_df |> ggplot(aes(x=x, y=y)) +
    geom_point(size=point_size) +
    geom_hline(yintercept=0, linewidth=1.5) +
    geom_vline(xintercept=0, linewidth=1.5) +
    # Point at origin
    geom_point(data=data.frame(x=0, y=0), aes(x=x, y=y), size=4) +
    xlim(-1,1) +
    ylim(-1,1) +
    # coord_fixed() +
    theme_dsan_min(base_size=28)
  return(lines_plot)
}
main_lines_plot <- gen_lines_plot()
main_lines_plot +
  # Parameter space of possible lines
  geom_abline(
    data=possible_lines,
    aes(slope=slope, intercept=intercept, color='possible'),
    # linetype="dotted",
    # linewidth=0.75,
    alpha=0.25
  ) +
  # True DGP
  geom_abline(
    aes(
      slope=0.5,
      intercept=0.25,
      color='true'
    ), linewidth=1, alpha=0.8
  ) + 
  scale_color_manual(
    element_blank(),
    values=c('possible'="black", 'true'=cb_palette[2]),
    labels=c('possible'="Possible Fits", 'true'="True DGP")
  ) +
  remove_legend_title() +
  labs(
    title=title_exp
  )

Evaluating with Residuals

Code

rc1_df <- possible_lines |> slice(n() - 14)
# Predictions for this choice
rc1_pred_df <- rand_df |> mutate(
  y_pred = rc1_df$slope * x,
  resid = y - y_pred
)
rc1_label <- latex2exp(paste0("Estimate 1: $\\beta_1 \\approx ",round(rc1_df$slope, 3),"$"))

Warning in latex2exp(paste0("Estimate 1: $\\beta_1 \\approx ", round(rc1_df$slope, : 'latex2exp' is deprecated.
Use 'TeX' instead.
See help("Deprecated") and help("latex2exp-deprecated").

Code

rc1_lines_plot <- gen_lines_plot(point_size=5)
rc1_lines_plot +
  geom_abline(
    data=rc1_df,
    aes(intercept=intercept, slope=slope),
    linewidth=2,
    color=cb_palette[1]
  ) +
  geom_segment(
    data=rc1_pred_df,
    aes(x=x, y=y, xend=x, yend=y_pred),
    # color=cb_palette[1]
  ) +
  xlim(0, 1) + ylim(0, 1) +
  labs(
    title = rc1_label
  )

Scale for x is already present.
Adding another scale for x, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

Code

gen_resid_plot <- function(pred_df) {
  rc_rss <- sum((pred_df$resid)^2)
  rc_resid_label <- latex2exp(paste0("Residuals: RSS $\\approx$ ",round(rc_rss,3)))
  rc_resid_plot <- pred_df |> ggplot(aes(x=x, y=resid)) +
    geom_point(size=5) +
    geom_hline(
      yintercept=0,
      color=cb_palette[1],
      linewidth=1.5
    ) +
    geom_segment(
      aes(xend=x, yend=0)
    ) +
    theme_dsan(base_size=28) +
    theme(axis.line.x = element_blank()) +
    labs(
      title=rc_resid_label
    )
  return(rc_resid_plot)
}
rc1_resid_plot <- gen_resid_plot(rc1_pred_df)

Warning in latex2exp(paste0("Residuals: RSS $\\approx$ ", round(rc_rss, : 'latex2exp' is deprecated.
Use 'TeX' instead.
See help("Deprecated") and help("latex2exp-deprecated").

Code

rc1_resid_plot

Code

rc2_df <- possible_lines |> slice(n() - 9)
# Predictions for this choice
rc2_pred_df <- rand_df |> mutate(
  y_pred = rc2_df$slope * x,
  resid = y - y_pred
)
rc2_label <- latex2exp(paste0("Estimate 2: $\\beta_1 \\approx ",round(rc2_df$slope,3),"$"))

Warning in latex2exp(paste0("Estimate 2: $\\beta_1 \\approx ", round(rc2_df$slope, : 'latex2exp' is deprecated.
Use 'TeX' instead.
See help("Deprecated") and help("latex2exp-deprecated").

Code

rc2_lines_plot <- gen_lines_plot(point_size=5)
rc2_lines_plot +
  geom_abline(
    data=rc2_df,
    aes(intercept=intercept, slope=slope),
    linewidth=2,
    color=cb_palette[3]
  ) +
  geom_segment(
    data=rc2_pred_df,
    aes(
      x=x, y=y, xend=x,
      yend=ifelse(y_pred <= 1, y_pred, Inf)
    )
    # color=cb_palette[1]
  ) +
  xlim(0, 1) + ylim(0, 1) +
  labs(
    title=rc2_label
  )

Scale for x is already present.
Adding another scale for x, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

Code

rc2_resid_plot <- gen_resid_plot(rc2_pred_df)

Warning in latex2exp(paste0("Residuals: RSS $\\approx$ ", round(rc_rss, : 'latex2exp' is deprecated.
Use 'TeX' instead.
See help("Deprecated") and help("latex2exp-deprecated").

Code

rc2_resid_plot

Now the Math

\[ \begin{align*} \beta_1^* = \argmin_{\beta_1}\left[ \sum_{i=1}^{n}(\widehat{y}_i - y_i)^2 \right] = \argmin_{\beta_1}\left[ \sum_{i=1}^{n}(\beta_1x_i - y_i)^2 \right] \end{align*} \]

We can compute this derivative to obtain:

\[ \frac{\partial}{\partial\beta_1}\left[ \sum_{i=1}^{n}(\beta_1x_i - y_i)^2 \right] = \sum_{i=1}^{n}\frac{\partial}{\partial\beta_1}(\beta_1x_i - y_i)^2 = \sum_{i=1}^{n}2(\beta_1x_i - y_i)x_i \]

And our first-order condition means that:

\[ \sum_{i=1}^{n}2(\beta_1^*x_i - y_i)x_i = 0 \iff \beta_1^*\sum_{i=1}^{n}x_i^2 = \sum_{i=1}^{n}x_iy_i \iff \boxed{\beta_1^* = \frac{\sum_{i=1}^{n}x_iy_i}{\sum_{i=1}^{n}x_i^2}} \]

Doing Things With Regression

Regression: `R` vs. `statsmodels`

In (Base) R: lm()

Code

lin_model <- lm(sales ~ TV, data=ad_df)
summary(lin_model)


Call:
lm(formula = sales ~ TV, data = ad_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3860 -1.9545 -0.1913  2.0671  7.2124 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.032594   0.457843   15.36   <2e-16 ***
TV          0.047537   0.002691   17.67   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared:  0.6119,    Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

General syntax:

lm(
  dependent ~ independent + controls,
  data = my_df
)

In Python: smf.ols()

Code

import statsmodels.formula.api as smf
results = smf.ols("sales ~ TV", data=ad_df).fit()
print(results.summary(slim=True))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.612
Model:                            OLS   Adj. R-squared:                  0.610
No. Observations:                 200   F-statistic:                     312.1
Covariance Type:            nonrobust   Prob (F-statistic):           1.47e-42
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      7.0326      0.458     15.360      0.000       6.130       7.935
TV             0.0475      0.003     17.668      0.000       0.042       0.053
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

General syntax:

smf.ols(
  "dependent ~ independent + controls",
  data = my_df
)

Interpreting Output

Code

mil_plot + theme_dsan("quarter")

Warning: Removed 8 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_point()`).

Code

gdp_model <- lm(military ~ industrial, data=gdp_df)
summary(gdp_model)


Call:
lm(formula = military ~ industrial, data = gdp_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3354 -1.0997 -0.3870  0.6081  6.7508 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.61969    0.59526   1.041   0.3010  
industrial   0.05253    0.02019   2.602   0.0111 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.671 on 79 degrees of freedom
  (8 observations deleted due to missingness)
Multiple R-squared:  0.07895,   Adjusted R-squared:  0.06729 
F-statistic: 6.771 on 1 and 79 DF,  p-value: 0.01106

Zooming In: Coefficients

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.61969	0.59526	1.041	0.3010
industrial	0.05253	0.02019	2.602	0.0111	*
	$\widehat{\beta}$	Uncertainty	Test stat $t$	How extreme is $t$?	Signif. Level

\[ \widehat{y} \approx \class{cb1}{\overset{\beta_0}{\underset{\small \pm 0.595}{0.620}}} + \class{cb2}{\overset{\beta_1}{\underset{\small \pm 0.020}{0.053}}} \cdot x \]

Zooming In: Significance

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.61969	0.59526	1.041	0.3010
industrial	0.05253	0.02019	2.602	0.0111	*
	$\widehat{\beta}$	Uncertainty	Test stat $t$	How extreme is $t$?	Signif. Level

Code

library(ggplot2)
int_tstat <- 1.041
int_tstat_str <- sprintf("%.02f", int_tstat)
label_df_int <- tribble(
    ~x, ~y, ~label,
    0.25, 0.05, paste0("P(t > ",int_tstat_str,")\n= 0.3")
)
label_df_signif_int <- tribble(
    ~x, ~y, ~label,
    2.7, 0.075, "95% Signif.\nCutoff"
)
funcShaded <- function(x, lower_bound, upper_bound){
    y <- dnorm(x)
    y[x < lower_bound | x > upper_bound] <- NA
    return(y)
}
funcShadedIntercept <- function(x) funcShaded(x, int_tstat, Inf)
funcShadedSignif <- function(x) funcShaded(x, 1.96, Inf)
ggplot(data=data.frame(x=c(-3,3)), aes(x=x)) +
  stat_function(fun=dnorm, linewidth=g_linewidth) +
  geom_vline(aes(xintercept=int_tstat), linewidth=g_linewidth, linetype="dashed") +
  geom_vline(aes(xintercept = 1.96), linewidth=g_linewidth, linetype="solid") +
  stat_function(fun = funcShadedIntercept, geom = "area", fill = cbPalette[1], alpha = 0.5) +
  stat_function(fun = funcShadedSignif, geom = "area", fill = "grey", alpha = 0.333) +
  geom_text(label_df_int, mapping = aes(x = x, y = y, label = label), size = 10) +
  geom_text(label_df_signif_int, mapping = aes(x = x, y = y, label = label), size = 8) +
  # Add single additional tick
  scale_x_continuous(breaks=c(-2, 0, int_tstat, 2), labels=c("-2","0",int_tstat_str,"2")) +
  dsan_theme("quarter") +
  labs(
    title = "t Value for Intercept",
    x = "t",
    y = "Density"
  ) +
  theme(axis.text.x = element_text(colour = c("black", "black", cbPalette[1], "black")))

Warning: Vectorized input to `element_text()` is not officially supported.
ℹ Results may be unexpected or may change in future versions of ggplot2.

Code

library(ggplot2)
coef_tstat <- 2.602
coef_tstat_str <- sprintf("%.02f", coef_tstat)
label_df_coef <- tribble(
    ~x, ~y, ~label,
    3.65, 0.06, paste0("P(t > ",coef_tstat_str,")\n= 0.01")
)
label_df_signif_coef <- tribble(
  ~x, ~y, ~label,
  1.05, 0.03, "95% Signif.\nCutoff"
)
funcShadedCoef <- function(x) funcShaded(x, coef_tstat, Inf)
ggplot(data=data.frame(x=c(-4,4)), aes(x=x)) +
  stat_function(fun=dnorm, linewidth=g_linewidth) +
  geom_vline(aes(xintercept=coef_tstat), linetype="dashed", linewidth=g_linewidth) +
  geom_vline(aes(xintercept=1.96), linetype="solid", linewidth=g_linewidth) +
  stat_function(fun = funcShadedCoef, geom = "area", fill = cbPalette[2], alpha = 0.5) +
  stat_function(fun = funcShadedSignif, geom = "area", fill = "grey", alpha = 0.333) +
  # Label shaded area
  geom_text(label_df_coef, mapping = aes(x = x, y = y, label = label), size = 10) +
  # Label significance cutoff
  geom_text(label_df_signif_coef, mapping = aes(x = x, y = y, label = label), size = 8) +
  coord_cartesian(clip = "off") +
  # Add single additional tick
  scale_x_continuous(breaks=c(-4, -2, 0, 2, coef_tstat, 4), labels=c("-4", "-2","0", "2", coef_tstat_str,"4")) +
  dsan_theme("quarter") +
  labs(
    title = "t Value for Coefficient",
    x = "t",
    y = "Density"
  ) +
  theme(axis.text.x = element_text(colour = c("black", "black", "black", "black", cbPalette[2], "black")))

Warning: Vectorized input to `element_text()` is not officially supported.
ℹ Results may be unexpected or may change in future versions of ggplot2.

The Residual Plot

A key assumption required for OLS: “homoskedasticity”
Given our model \[ y_i = \beta_0 + \beta_1x_i + \varepsilon_i \] the errors $\varepsilon_i$ should not vary systematically with $i$
Formally: $\forall i \left[ \Var{\varepsilon_i} = \sigma^2 \right]$

Code

library(broom)
gdp_resid_df <- augment(gdp_model)
ggplot(gdp_resid_df, aes(x = industrial, y = .resid)) +
    geom_point(size = g_pointsize/2) +
    geom_hline(yintercept=0, linetype="dashed") +
    dsan_theme("quarter") +
    labs(
      title = "Residual Plot for Military ~ Industrial",
      x = "Fitted Value",
      y = "Residual"
    )

Code

x <- 1:80
errors <- rnorm(length(x), 0, x^2/1000)
y <- x + errors
het_model <- lm(y ~ x)
df_het <- augment(het_model)
ggplot(df_het, aes(x = .fitted, y = .resid)) +
    geom_point(size = g_pointsize / 2) +
    geom_hline(yintercept = 0, linetype = "dashed") +
    dsan_theme("quarter") +
    labs(
        title = "Residual Plot for Heteroskedastic Data",
        x = "Fitted Value",
        y = "Residual"
    )

Q-Q Plot

If $(\widehat{y} - y) \sim \mathcal{N}(0, \sigma^2)$, points would lie on 45° line:

Code

ggplot(df_het, aes(sample=.resid)) +
  stat_qq(size = g_pointsize/2) + stat_qq_line(linewidth = g_linewidth) +
  dsan_theme("half") +
  labs(
    title = "Q-Q Plot for Heteroskedastic Data",
    x = "Normal Distribution Quantiles",
    y = "Observed Data Quantiles"
  )

Code

ggplot(gdp_resid_df, aes(sample=.resid)) +
  stat_qq(size = g_pointsize/2) + stat_qq_line(linewidth = g_linewidth) +
  dsan_theme("half") +
  labs(
    title = "Q-Q Plot for Industrial ~ Military Residuals",
    x = "Normal Distribution Quantiles",
    y = "Observed Data Quantiles"
  )

Multiple Linear Regression

Notation: $x_{i,j}$ = value of independent variable $j$ for person/observation $i$
$M$ = total number of independent variables

\[ \widehat{y}_i = \beta_0 + \beta_1x_{i,1} + \beta_2x_{i,2} + \cdots + \beta_M x_{i,M} \]

$\beta_j$ interpretation: a one-unit increase in $x_{i,j}$ is associated with a $\beta_j$ unit increase in $y_i$, holding all other independent variables constant

Visualizing Multiple Linear Regression

(ISLR, Fig 3.5): A pronounced non-linear relationship. Positive residuals (visible above the surface) tend to lie along the 45-degree line, where budgets are split evenly. Negative residuals (most not visible) tend to be away from this line, where budgets are more lopsided.

Interpreting MLR

Code

mlr_model <- lm(sales ~ TV + radio + newspaper, data=ad_df)
print(summary(mlr_model))


Call:
lm(formula = sales ~ TV + radio + newspaper, data = ad_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8277 -0.8908  0.2418  1.1893  2.8292 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.938889   0.311908   9.422   <2e-16 ***
TV           0.045765   0.001395  32.809   <2e-16 ***
radio        0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared:  0.8972,    Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Holding radio and newspaper spending constant…
- An increase of $1K in spending on TV advertising is associated with
- An increase in sales of about 46 units
Holding TV and newspaper spending constant…
- An increase of $1K in spending on radio advertising is associated with
- An increase in sales of about 189 units

But Wait…

Code

print(summary(mlr_model))


Call:
lm(formula = sales ~ TV + radio + newspaper, data = ad_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8277 -0.8908  0.2418  1.1893  2.8292 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.938889   0.311908   9.422   <2e-16 ***
TV           0.045765   0.001395  32.809   <2e-16 ***
radio        0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared:  0.8972,    Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Code

lr_model <- lm(sales ~ newspaper, data=ad_df)
print(summary(lr_model))


Call:
lm(formula = sales ~ newspaper, data = ad_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2272  -3.3873  -0.8392   3.5059  12.7751 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12.35141    0.62142   19.88  < 2e-16 ***
newspaper    0.05469    0.01658    3.30  0.00115 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.092 on 198 degrees of freedom
Multiple R-squared:  0.05212,   Adjusted R-squared:  0.04733 
F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

MLR results can be drastically different from SLR results, because of correlations (next slide)
This is a good thing! It’s how we’re able to control for confounding variables!

Correlations Among Features

Code

cor(ad_df |> select(-id))

                  TV      radio  newspaper     sales
TV        1.00000000 0.05480866 0.05664787 0.7822244
radio     0.05480866 1.00000000 0.35410375 0.5762226
newspaper 0.05664787 0.35410375 1.00000000 0.2282990
sales     0.78222442 0.57622257 0.22829903 1.0000000

Observe how $\text{cor}(\texttt{radio}, \texttt{newspaper}) \approx 0.35$
In markets where we spend more on radio our sales will tend to be higher…
Corr matrix $\implies$ we spend more on newspaper in those same markets…
In SLR which only examines sales vs. newspaper, we (correctly!) observe that higher values of newspaper are associated with higher values of sales…
In essence, newspaper advertising is a surrogate for radio advertising $\implies$ in our SLR, newspaper “gets credit” for the association between radio and sales

Another MLR Superpower: Incorporating Categorical Vars

(Preview for next week)

\[ Y = \beta_0 + \beta_1 \times \texttt{income} \]

Code

credit_df <- read_csv("assets/Credit.csv")

Rows: 400 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Own, Student, Married, Region
dbl (7): Income, Limit, Rating, Cards, Age, Education, Balance

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

credit_plot <- credit_df |> ggplot(aes(x=Income, y=Balance)) +
  geom_point(size=0.5*g_pointsize) +
  geom_smooth(method='lm', formula="y ~ x", linewidth=1) +
  theme_dsan() +
  labs(
    title="Credit Card Balance vs. Income Level",
    x="Income ($1K)",
    y="Credit Card Balance ($)"
  )
credit_plot

\[ \begin{align*} Y = &\beta_0 + \beta_1 \times \texttt{income} + \beta_2 \times \texttt{Student} \\ &+ \beta_3 \times (\texttt{Student} \times \texttt{Income}) \end{align*} \]

Code

student_plot <- credit_df |> ggplot(aes(x=Income, y=Balance, color=Student)) +
  geom_point(size=0.5*g_pointsize) +
  geom_smooth(method='lm', formula="y ~ x", linewidth=1) +
  theme_dsan() +
  labs(
    title="Credit Card Balance vs. Income Level",
    x="Income ($1K)",
    y="Credit Card Balance ($)"
  )
student_plot

Why do we need the $\texttt{Student} \times \texttt{Income}$ term?
Understanding this setup will open up a vast array of possibilities for regression 😎

References

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.

Linear Regression

How Do We Define “Best”?

Predictive Example: Advertising Effects

Explanatory Example: Industrialization Effects

Simple Linear Regression

Least Squares: Minimizing Residuals

Why Not Absolute Value?

Outliers Penalized Quadratically

Key Features of Regression Line

Where Did That \(\mathbb{E}[Y \mid X = x_i]\) Come From?

But… What About All the Other Types of Vars?

Deriving the Least Squares Estimate

A Sketch (HW is the Full Thing)

Evaluating with Residuals

Now the Math

Doing Things With Regression

Regression: R vs. statsmodels

Interpreting Output

Zooming In: Coefficients

Zooming In: Significance

The Residual Plot

Q-Q Plot

Multiple Linear Regression

Visualizing Multiple Linear Regression

Interpreting MLR

But Wait…

Correlations Among Features

Another MLR Superpower: Incorporating Categorical Vars

References

Regression: `R` vs. `statsmodels`