Schedule

Today’s Planned Schedule:

	Start	End	Topic
Lecture	6:30pm	7:00pm	Roadmap: Even More Wiggly “Linearity” →
	7:00pm	7:20pm	“Manual” Model Selection: Subsets →
	7:20pm	7:40pm	Key Regularization Building Block: \(L^p\) Norm →
	7:40pm	8:00pm	Regularized Regression Intro →
Break!	8:00pm	8:10pm
	8:10pm	8:50pm	Basically Lasso is the Coolest Thing Ever →
	8:50pm	9:00pm	Scary-Looking But Actually-Fun W07 Preview →

Roadmap

[Weeks 4-6]

Polynomial Regression

Linear Models

Figure 1: Font Credit

Piecewise Regression

Linear Models

Figure 2: Font Credit

[Today]

Splines (Truncated Power Bases)

Linear Models

Figure 3: Font Credit

Natural Splines

Linear Models

Figure 4: Font Credit

What Problem Are We Solving?

From W04-06, you are now really good at thinking through issues of non-linearity in terms of polynomial regression
- (Hence, why I kept “plugging in” degree of polynomial as our measure of complexity)
Now we want to move towards handling any kind of non-linearity
- (Hence, why I kept reminding you how degree of polynomial is only one possible measure of complexity)

What’s So Bad About Polynomial Regression?

Mathematically, Weierstrass Approximation Theorem says we can model any function as (possibly infinite) sum of polynomials
In practice, this can be a horrifically bad way to actually model things:

Raw data: \(y = \sin(x) + \varepsilon\)

Code

library(tidyverse) |> suppressPackageStartupMessages()
library(latex2exp) |> suppressPackageStartupMessages()
N <- 200
x_vals <- seq(from=-10, to=10, length.out=N)
y_raw <- sin(x_vals)
y_noise <- rnorm(length(y_raw), mean=0, sd=0.075)
y_vals <- y_raw + y_noise
dgp_label <- TeX("Raw Data")
data_df <- tibble(x=x_vals, y=y_vals)
base_plot <- data_df |> ggplot(aes(x=x, y=y)) +
  geom_point() +
  theme_dsan(base_size=28)
base_plot + labs(title=dgp_label)

Bad (quadratic) model

Code

quad_model <- lm(y ~ poly(x,2), data=data_df)
quad_rss <- round(get_rss(quad_model), 3)
poly_label_2 <- TeX(paste0("2 $\\beta$ Coefficients: RSS = ",quad_rss))
base_plot +
  geom_smooth(method='lm', formula=y ~ poly(x,2)) +
  labs(title=poly_label_2)

Making it “better” with more complex polynomials

Code

poly5_model <- lm(y ~ poly(x,5), data=data_df)
poly5_rss <- round(get_rss(poly5_model), 3)
poly5_label <- TeX(paste0("5 $\\beta$ Coefficients: RSS = ",poly5_rss))
base_plot +
  geom_smooth(method='lm', formula=y ~ poly(x,5)) +
  labs(title=poly5_label)

Code

poly8_model <- lm(y ~ poly(x,8), data=data_df)
poly8_rss <- round(get_rss(poly8_model), 3)
poly8_label <- TeX(paste0("8 $\\beta$ Coefficients: RSS = ",poly8_rss))
base_plot +
  geom_smooth(method='lm', formula=y ~ poly(x,8)) +
  labs(title=poly8_label)

Using all data to estimate single parameter, by using the “correct” basis function!

\[ Y = \beta_0 + \beta_1 \sin(x) \]

Code

sine_model <- lm(y ~ sin(x), data=data_df)
sine_rss <- round(get_rss(sine_model), 3)
sine_label <- TeX(paste0("Single sin(x) Coefficient: RSS = ",sine_rss))
base_plot +
  geom_smooth(method='lm', formula=y ~ sin(x)) +
  labs(title=sine_label)

Basis Functions

Seems like it’ll help to have a toolkit filled with different basis functions, so we can use them with lm() 🤔
…Let’s get into it!

The General Form

(Decomposing Fancy Regressions into Core “Pieces”)

Q: What do all these types of regression have in common?
A: They can all be written in the form

\[ Y = \beta_0 + \beta_1 b_1(X) + \beta_2 b_2(X) + \cdots + \beta_d b_d(X) \]
Where \(b(\cdot)\) is called a basis function
- Polynomial: \(b_j(X) = X^j\)
- Piecewise: \(b_j(X) = \mathbb{1}[c_{j-1} \leq X < c_j]\)
- Earlier Example: \(b_1(X) = \sin(X)\)

Segments \(\rightarrow\) Splines

What we want	How we do it	Name for result
Model sub-regions of \(\text{domain}(X)\)	Chop \(\text{domain}(X)\) into pieces, one regression per piece	Discontinuous Segmented Regression
Continuous prediction function	Require pieces to join at chop points \(\{\xi_1, \ldots, \xi_K\}\)	Continuous Segmented Regression
Smooth prediction function	Require pieces to join and have equal derivatives at \(\xi_i\)	Spline
Less jump in variance at boundaries	Reduce complexity of polynomials at endpoints	Natural Spline

Discontinuous Segmented Regression

Here we see why our first two basis functions were polynomial and piecewise: most rudimentary spline fits a polynomial to each piece

Code

library(tidyverse) |> suppressPackageStartupMessages()
set.seed(5300)
compute_y <- function(x) {
  return(x * cos(x^2))
}
N <- 500
xmin <- -1.9
xmax <- 2.7
x_vals <- runif(N, min=xmin, max=xmax)
y_raw = compute_y(x_vals)
y_noise = rnorm(N, mean=0, sd=0.5)
y_vals <- y_raw + y_noise
prod_df <- tibble(x=x_vals, y=y_vals)
knot <- (xmin + xmax) / 2
prod_df <- prod_df |> mutate(segment = x <= knot)
# First segment model
#left_df <- prod_df |> filter(x <= knot_point)
#left_model <- lm(y ~ poly(x, 2), data=left_df)
prod_df |> ggplot(aes(x=x, y=y, group=segment)) +
  geom_point(size=0.5) +
  geom_vline(xintercept=knot, linetype="dashed") +
  geom_smooth(method='lm', formula=y ~ poly(x, 2), se=TRUE) +
  theme_classic(base_size=22)

What’s the issue with this?

Forcing Continuity at Knot Points

Starting slow: let’s come back down to line world…
Why are these two lines “allowed” to be non-continuous, under our current approach?

\[ \hspace{6.5cm} Y^{\phantom{🧐}}_L = \beta_0 + \beta_1 X_L \]

\[ Y_R = \beta_0^{🧐} + \beta_1 X_R \hspace{5cm} \]

Code

set.seed(5300)
xmin_sub <- -1
xmax_sub <- 1.9
sub_df <- prod_df |> filter(x >= xmin_sub & x <= xmax_sub)
sub_knot <- (xmin_sub + xmax_sub) / 2
sub_df <- sub_df |> mutate(segment = x <= sub_knot)
# First segment model
sub_df |> ggplot(aes(x=x, y=y, group=segment)) +
  geom_point(size=0.5) +
  geom_vline(xintercept=sub_knot, linetype="dashed") +
  geom_smooth(method='lm', formula=y ~ x, se=TRUE, linewidth=g_linewidth) +
  theme_classic(base_size=28)

…We need \(X_R\) to have its own slope but not its own intercept! Something like:

\[ Y = \beta_0 + \beta_1 (X \text{ before } \xi) + \beta_2 (X \text{ after } \xi) \]

Truncated Power Basis (ReLU Basis)

\[ \text{ReLU}(x) \definedas (x)_+ \definedas \begin{cases} 0 &\text{if }x \leq 0 \\ x &\text{if }x > 0 \end{cases} \implies (x - \xi)_+ = \begin{cases} 0 &\text{if }x \leq \xi \\ x - \xi &\text{if }x > 0 \end{cases} \]

Code

library(latex2exp) |> suppressPackageStartupMessages()
trunc_x <- function(x, xi) {
  return(ifelse(x <= xi, 0, x - xi))
}
trunc_x_05 <- function(x) trunc_x(x, 1/2)
trunc_title <- TeX("$(x - \\xi)_+$ with $\\xi = 0.5$")
trunc_label <- TeX("$y = (x - \\xi)_+$")
ggplot() +
  stat_function(
    data=data.frame(x=c(-3,3)),
    fun=trunc_x_05,
    linewidth=g_linewidth
  ) +
  xlim(-3, 3) +
  theme_dsan(base_size=28) +
  labs(
    title = trunc_title,
    x = "x",
    y = trunc_label
  )

\(\Rightarrow\) we can include a “slope modifier” \(\beta_m (X - \xi)_+\) that only “kicks in” once \(x\) goes past \(\xi\)! (Changing slope by \(\beta_m\))

Linear Segments with Continuity at Knot

Our new (non-naïve) model:

\[ \begin{align*} Y &= \beta_0 + \beta_1 X + \beta_2 (X - \xi)_+ \\ &= \beta_0 + \begin{cases} \beta_1 X &\text{if }X \leq \xi \\ (\beta_1 + \beta_2)X &\text{if }X > \xi \end{cases} \end{align*} \]

Code

sub_df <- sub_df |> mutate(x_tr = ifelse(x < sub_knot, 0, x - sub_knot))
linseg_model <- lm(y ~ x + x_tr, data=sub_df)
broom::tidy(linseg_model) |> mutate_if(is.numeric, round, 3) |> select(-p.value)

term	estimate	std.error	statistic
(Intercept)	0.187	0.044	4.255
x	1.340	0.093	14.342
x_tr	-2.868	0.168	-17.095

Code

sub_df |> ggplot(aes(x=x, y=y)) +
  geom_point(size=0.5) +
  geom_vline(xintercept=sub_knot, linetype="dashed") +
  geom_smooth(method='lm', formula=y ~ x + ifelse(x > sub_knot, x-sub_knot, 0), se=TRUE, linewidth=g_linewidth) +
  theme_classic(base_size=28)

Continuous Segmented Regression

\[ \begin{align*} Y &= \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 (X - \xi)_+ + \beta_4 (X - \xi)_+^2 \\[0.8em] &= \beta_0 + \begin{cases} \beta_1 X + \beta_2 X^2 &\text{if }X \leq \xi \\ (\beta_1 + \beta_3) X + (\beta_2 + \beta_4) X^2 &\text{if }X > \xi \end{cases} \end{align*} \]

Code

prod_df <- prod_df |> mutate(x_tr = ifelse(x > knot, x - knot, 0))
seg_model <- lm(
  y ~ poly(x, 2) + poly(x_tr, 2),
  data=prod_df
)

Code

seg_model |> broom::tidy() |> mutate_if(is.numeric, round, 3) |> select(-p.value)

term	estimate	std.error	statistic
(Intercept)	0.104	0.035	2.961
poly(x, 2)1	112.929	6.824	16.548
poly(x, 2)2	64.558	3.692	17.484
poly(x_tr, 2)1	-125.195	7.690	-16.281
poly(x_tr, 2)2	1.526	1.036	1.473

Code

cont_seg_plot <- ggplot() +
  geom_point(data=prod_df, aes(x=x, y=y), size=0.5) +
  geom_vline(xintercept=knot, linetype="dashed") +
  stat_smooth(
    data=prod_df, aes(x=x, y=y),
    method='lm',
    formula=y ~ poly(x,2) + poly(ifelse(x > knot, (x - knot), 0), 2),
    n = 300
  ) +
  #geom_smooth(data=prod_df |> filter(segment == FALSE), aes(x=x, y=y), method='lm', formula=y ~ poly(x,2)) +
  # geom_smooth(method=segreg, formula=y ~ seg(x, npsi=1, fixed.psi=0.5)) + # + seg(I(x^2), npsi=1)) +
  theme_classic(base_size=22)
cont_seg_plot

There’s still a problem here… can you see what it is? (Hint: things that break calculus)

Splines (The Moment You’ve Been Waiting For!)

Constrained Derivatives: “Quadratic Splines” (⚠️)

Like “Linear Probability Models”, “Quadratic Splines” are a red flag
Why? If we have leftmost plot below, but want it differentiable everywhere… only option is to fit a single quadratic, defeating the whole point of the “chopping”!

\[ \frac{\partial Y}{\partial X} = \beta_1 + 2\beta_2 X + \beta_3 + 2\beta_4 X \]

Continuous Segmented:

Code

cont_seg_plot

“Quadratic Spline”:

Code

ggplot() +
  geom_point(data=prod_df, aes(x=x, y=y), size=0.5) +
  geom_vline(xintercept=knot, linetype="dashed") +
  stat_smooth(
    data=prod_df, aes(x=x, y=y),
    method='lm',
    formula=y ~ poly(x,2) + ifelse(x > knot, (x - knot)^2, 0),
    n = 300
  ) +
  theme_dsan(base_size=22)

\[ Y = \beta_0 + \beta_1 X + \beta_2 X^2 \]

Code

ggplot() +
  geom_point(data=prod_df, aes(x=x, y=y), size=0.5) +
  geom_vline(xintercept=knot, linetype="dashed") +
  stat_smooth(
    data=prod_df, aes(x=x, y=y),
    method='lm',
    formula=y ~ poly(x,2),
    n = 300
  ) +
  theme_dsan(base_size=22)

\(\implies\) Least-complex function that allows smooth “joining” is cubic function

Cubic Splines (aka, Splines)

(We did it, we finally did it)

Code

ggplot() +
  geom_point(data=prod_df, aes(x=x, y=y), size=0.5) +
  geom_vline(xintercept=knot, linetype="dashed") +
  stat_smooth(
    data=prod_df, aes(x=x, y=y),
    method='lm',
    formula=y ~ poly(x,3) + ifelse(x > knot, (x - knot)^3, 0),
    n = 300
  ) +
  theme_dsan(base_size=22)

Natural Splines: Why Not Stop There?

Polynomials start to BEHAVE BADLY as they go to \(-\infty\) and \(\infty\)

Code

library(splines) |> suppressPackageStartupMessages()
set.seed(5300)
N_sparse <- 400
x_vals <- runif(N_sparse, min=xmin, max=xmax)
y_raw = compute_y(x_vals)
y_noise = rnorm(N_sparse, mean=0, sd=1.0)
y_vals <- y_raw + y_noise
sparse_df <- tibble(x=x_vals, y=y_vals)
knot_sparse <- (xmin + xmax) / 2
knot_vec <- c(-1.8,-1.7,-1.6,-1.5,-1.4,-1.2,-1.0,-0.8,-0.6,-0.4,-0.2,0,knot_sparse,1,1.5,2)
knot_df <- tibble(knot=knot_vec)
# Boundary lines
left_bound_line <- geom_vline(
  xintercept = xmin + 0.1, linewidth=1,
  color="red", alpha=0.8
)
right_bound_line <- geom_vline(
  xintercept = xmax - 0.1,
  linewidth=1, color="red", alpha=0.8
)
ggplot() +
  geom_point(data=sparse_df, aes(x=x, y=y), size=0.5) +
  geom_vline(
    data=knot_df, aes(xintercept=knot),
    linetype="dashed"
  ) +
  stat_smooth(
    data=sparse_df, aes(x=x, y=y),
    method='lm',
    formula=y ~ bs(x, knots=c(knot_sparse), degree=25),
    n=300
  ) +
  left_bound_line +
  right_bound_line +
  theme_dsan(base_size=22)

Natural Splines: Force leftmost and rightmost pieces to be linear

Code

library(splines) |> suppressPackageStartupMessages()
set.seed(5300)
N_sparse <- 400
x_vals <- runif(N_sparse, min=xmin, max=xmax)
y_raw = compute_y(x_vals)
y_noise = rnorm(N_sparse, mean=0, sd=1.0)
y_vals <- y_raw + y_noise
sparse_df <- tibble(x=x_vals, y=y_vals)
knot_sparse <- (xmin + xmax) / 2
ggplot() +
  geom_point(data=sparse_df, aes(x=x, y=y), size=0.5) +
  stat_smooth(
    data=sparse_df, aes(x=x, y=y),
    method='lm',
    formula=y ~ ns(x, knots=knot_vec, Boundary.knots=c(xmin + 0.1, xmax - 0.1)),
    n=300
  ) +
  geom_vline(
    data=knot_df, aes(xintercept=knot),
    linetype="dashed"
  ) +
  left_bound_line +
  right_bound_line +
  theme_dsan(base_size=22)

How Do We Find The Magical Knot Points?

Hyperparameter 0: Number of knots \(K\)
Hyperparameters 1 to \(K\): Location \(\xi_k\) for each knot
Fear: Seems like we’re going to need a bunch of slides/lectures talking about this, right?
- (e.g., if you really do have to manually choose them, heuristic is to place them in more “volatile” parts of \(D_X\))
Reality: Nope, just use CV! 😉😎

Even Cooler Alternative Approach: Smoothing Splines

(Where “cooler” here really means “more statistically-principled”)

Reminder (W04): Penalizing Complexity

Week 4 (General Form):

\[ \boldsymbol\theta^* = \underset{\boldsymbol\theta}{\operatorname{argmin}} \left[ \mathcal{L}(y, \widehat{y}; \boldsymbol\theta) + \lambda \cdot \mathsf{Complexity}(\boldsymbol\theta) \right] \]

Last Week (Penalizing Polynomials):

\[ \boldsymbol\beta^*_{\text{lasso}} = \underset{\boldsymbol\beta, \lambda}{\operatorname{argmin}} \left[ \frac{1}{N}\sum_{i=1}^{N}(\widehat{y}_i(\boldsymbol\beta) - y_i)^2 + \lambda \|\boldsymbol\beta\|_1 \right] \]

Now (Penalizing Sharp Change / Wigglyness):

\[ g^* = \underset{g, \lambda}{\operatorname{argmin}} \left[ \sum_{i=1}^{N}(g(x_i) - y_i)^2 + \lambda \overbrace{\int}^{\mathclap{\text{Sum of}}} [\underbrace{g''(t)}_{\mathclap{\text{Change in }g'(t)}}]^2 \mathrm{d}t \right] \]

Restricting the first derivative would optimize for a function that doesn’t change much at all. Restricting the second derivative just constrains sharpness of the changes.

Think of an Uber driver: penalizing speed would just make them go slowly (ensuring slow ride). Penalizing acceleration ensures that they can travel whatever speed they want, as long as they smoothly accelerate and decelerate between those speeds (ensuring non-bumpy ride!).

Penalizing Wiggles

\[ g^* = \underset{g, \lambda}{\operatorname{argmin}} \left[ \sum_{i=1}^{N}(g(x_i) - y_i)^2 + \lambda \overbrace{\int}^{\mathclap{\text{Sum of}}} [\underbrace{g''(t)}_{\mathclap{\text{Change in }g'(t)}}]^2 \mathrm{d}t \right] \]

(Note that we do not assume it’s a spline! We optimize over \(g\) itself, not \(K\), \(\xi_k\))
Penalizing \([g'(t)]^2\) would mean: penalize function for changing over its domain
Penalizing \([g''(t)]^2\) means: it’s ok to change, but do so smoothly
Uber ride metaphor: remember how \(g'(t)\) gives speed while \(g''(t)\) gives acceleration
Penalizing speed… would just make driver go slowly (ensuring slow ride)
Penalizing acceleration…
- Ensures that they can fit the traffic pattern (can drive whatever speed they want)
- As long as they smoothly accelerate and decelerate between those speeds (ensuring non-bumpy ride!)

Three Mind-Blowing Math Facts

As \(\lambda \rightarrow 0\), \(g^*\) will grow as wiggly as necessary to achieve zero RSS / MSE / \(R^2\)

As \(\lambda \rightarrow \infty\), \(g^*\) converges to OLS linear regression line

We didn’t optimize over splines, yet optimal \(g\) is a spline!

Specifically, a Natural Cubic Spline with…

Knots at every data point (\(K = N\), \(\xi_i = x_i\)), but
Penalized via (optimal) \(\lambda^*\), so not as wiggly as “free to wiggle” splines from previous approach: shrunken natural cubic spline

Week 7: Basis Functions and Splines

Schedule

Roadmap

What Problem Are We Solving?

What’s So Bad About Polynomial Regression?

Basis Functions

The General Form

Segments \(\rightarrow\) Splines

Discontinuous Segmented Regression

Forcing Continuity at Knot Points

Truncated Power Basis (ReLU Basis)

Linear Segments with Continuity at Knot

Continuous Segmented Regression

Splines (The Moment You’ve Been Waiting For!)

Constrained Derivatives: “Quadratic Splines” (⚠️)

Cubic Splines (aka, Splines)

Natural Splines: Why Not Stop There?

How Do We Find The Magical Knot Points?

Even Cooler Alternative Approach: Smoothing Splines

Reminder (W04): Penalizing Complexity

Penalizing Wiggles

Three Mind-Blowing Math Facts

References

Schedule

Roadmap

What Problem Are We Solving?

What’s So Bad About Polynomial Regression?

Basis Functions

The General Form

Segments \(\rightarrow\) Splines

Discontinuous Segmented Regression

Forcing Continuity at Knot Points

Truncated Power Basis (ReLU Basis)

Linear Segments with Continuity at Knot

Continuous Segmented Regression

Splines (The Moment You’ve Been Waiting For!)

Constrained Derivatives: “Quadratic Splines” (⚠️)

Cubic Splines (aka, Splines)

Natural Splines: Why Not Stop There?

How Do We Find The Magical Knot Points?

Even Cooler Alternative Approach: Smoothing Splines

Reminder (W04): Penalizing Complexity

Penalizing Wiggles

Three Mind-Blowing Math Facts

Related (Important!) Local Regression at \(X = x_0\)

References