Week 7: Basis Functions and Splines

DSAN 5300: Statistical Learning
Spring 2025, Georgetown University

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Monday, February 24, 2025

Open slides in new tab →

Schedule

Today’s Planned Schedule:

Start End Topic
Lecture 6:30pm 7:00pm Roadmap: Even More Wiggly “Linearity” →
7:00pm 7:20pm “Manual” Model Selection: Subsets →
7:20pm 7:40pm Key Regularization Building Block: \(L^p\) Norm →
7:40pm 8:00pm Regularized Regression Intro →
Break! 8:00pm 8:10pm
8:10pm 8:50pm Basically Lasso is the Coolest Thing Ever →
8:50pm 9:00pm Scary-Looking But Actually-Fun W07 Preview →

Roadmap

[Weeks 4-6]

Polynomial Regression

Linear Models

Figure 1: Font Credit

Piecewise Regression

Linear Models

Figure 2: Font Credit

[Today]

Splines (Truncated Power Bases)

Linear Models

Figure 3: Font Credit

Natural Splines

Linear Models

Figure 4: Font Credit

What Problem Are We Solving?

  • From W04-06, you are now really good at thinking through issues of non-linearity in terms of polynomial regression
    • (Hence, why I kept “plugging in” degree of polynomial as our measure of complexity)
  • Now we want to move towards handling any kind of non-linearity
    • (Hence, why I kept reminding you how degree of polynomial is only one possible measure of complexity)

What’s So Bad About Polynomial Regression?

  • Mathematically, Weierstrass Approximation Theorem says we can model any function as (possibly infinite) sum of polynomials
  • In practice, this can be a horrifically bad way to actually model things:

Raw data: \(y = \sin(x) + \varepsilon\)

Code
library(tidyverse) |> suppressPackageStartupMessages()
library(latex2exp) |> suppressPackageStartupMessages()
N <- 200
x_vals <- seq(from=-10, to=10, length.out=N)
y_raw <- sin(x_vals)
y_noise <- rnorm(length(y_raw), mean=0, sd=0.075)
y_vals <- y_raw + y_noise
dgp_label <- TeX("Raw Data")
data_df <- tibble(x=x_vals, y=y_vals)
base_plot <- data_df |> ggplot(aes(x=x, y=y)) +
  geom_point() +
  theme_dsan(base_size=28)
base_plot + labs(title=dgp_label)

Bad (quadratic) model

Code
quad_model <- lm(y ~ poly(x,2), data=data_df)
quad_rss <- round(get_rss(quad_model), 3)
poly_label_2 <- TeX(paste0("2 $\\beta$ Coefficients: RSS = ",quad_rss))
base_plot +
  geom_smooth(method='lm', formula=y ~ poly(x,2)) +
  labs(title=poly_label_2)

 

Making it “better” with more complex polynomials

Code
poly5_model <- lm(y ~ poly(x,5), data=data_df)
poly5_rss <- round(get_rss(poly5_model), 3)
poly5_label <- TeX(paste0("5 $\\beta$ Coefficients: RSS = ",poly5_rss))
base_plot +
  geom_smooth(method='lm', formula=y ~ poly(x,5)) +
  labs(title=poly5_label)

Code
poly8_model <- lm(y ~ poly(x,8), data=data_df)
poly8_rss <- round(get_rss(poly8_model), 3)
poly8_label <- TeX(paste0("8 $\\beta$ Coefficients: RSS = ",poly8_rss))
base_plot +
  geom_smooth(method='lm', formula=y ~ poly(x,8)) +
  labs(title=poly8_label)

 

Using all data to estimate single parameter, by using the “correct” basis function!

\[ Y = \beta_0 + \beta_1 \sin(x) \]

Code
sine_model <- lm(y ~ sin(x), data=data_df)
sine_rss <- round(get_rss(sine_model), 3)
sine_label <- TeX(paste0("Single sin(x) Coefficient: RSS = ",sine_rss))
base_plot +
  geom_smooth(method='lm', formula=y ~ sin(x)) +
  labs(title=sine_label)

Basis Functions

  • Seems like it’ll help to have a toolkit filled with different basis functions, so we can use them with lm() 🤔
  • …Let’s get into it!

The General Form

(Decomposing Fancy Regressions into Core “Pieces”)

  • Q: What do all these types of regression have in common?

  • A: They can all be written in the form

    \[ Y = \beta_0 + \beta_1 b_1(X) + \beta_2 b_2(X) + \cdots + \beta_d b_d(X) \]

  • Where \(b(\cdot)\) is called a basis function

    • Polynomial: \(b_j(X) = X^j\)
    • Piecewise: \(b_j(X) = \mathbb{1}[c_{j-1} \leq X < c_j]\)
    • Earlier Example: \(b_1(X) = \sin(X)\)

Segments \(\rightarrow\) Splines

What we want How we do it Name for result
Model sub-regions of \(\text{domain}(X)\) Chop \(\text{domain}(X)\) into pieces, one regression per piece Discontinuous Segmented Regression
Continuous prediction function Require pieces to join at chop points \(\{\xi_1, \ldots, \xi_K\}\) Continuous Segmented Regression
Smooth prediction function Require pieces to join and have equal derivatives at \(\xi_i\) Spline
Less jump in variance at boundaries Reduce complexity of polynomials at endpoints Natural Spline

Discontinuous Segmented Regression

  • Here we see why our first two basis functions were polynomial and piecewise: most rudimentary spline fits a polynomial to each piece
Code
library(tidyverse) |> suppressPackageStartupMessages()
set.seed(5300)
compute_y <- function(x) {
  return(x * cos(x^2))
}
N <- 500
xmin <- -1.9
xmax <- 2.7
x_vals <- runif(N, min=xmin, max=xmax)
y_raw = compute_y(x_vals)
y_noise = rnorm(N, mean=0, sd=0.5)
y_vals <- y_raw + y_noise
prod_df <- tibble(x=x_vals, y=y_vals)
knot <- (xmin + xmax) / 2
prod_df <- prod_df |> mutate(segment = x <= knot)
# First segment model
#left_df <- prod_df |> filter(x <= knot_point)
#left_model <- lm(y ~ poly(x, 2), data=left_df)
prod_df |> ggplot(aes(x=x, y=y, group=segment)) +
  geom_point(size=0.5) +
  geom_vline(xintercept=knot, linetype="dashed") +
  geom_smooth(method='lm', formula=y ~ poly(x, 2), se=TRUE) +
  theme_classic(base_size=22)

  • What’s the issue with this?

Forcing Continuity at Knot Points

  • Starting slow: let’s come back down to line world…
  • Why are these two lines “allowed” to be non-continuous, under our current approach?

\[ \hspace{6.5cm} Y^{\phantom{🧐}}_L = \beta_0 + \beta_1 X_L \]

\[ Y_R = \beta_0^{🧐} + \beta_1 X_R \hspace{5cm} \]

Code
set.seed(5300)
xmin_sub <- -1
xmax_sub <- 1.9
sub_df <- prod_df |> filter(x >= xmin_sub & x <= xmax_sub)
sub_knot <- (xmin_sub + xmax_sub) / 2
sub_df <- sub_df |> mutate(segment = x <= sub_knot)
# First segment model
sub_df |> ggplot(aes(x=x, y=y, group=segment)) +
  geom_point(size=0.5) +
  geom_vline(xintercept=sub_knot, linetype="dashed") +
  geom_smooth(method='lm', formula=y ~ x, se=TRUE, linewidth=g_linewidth) +
  theme_classic(base_size=28)

  • …We need \(X_R\) to have its own slope but not its own intercept! Something like:

\[ Y = \beta_0 + \beta_1 (X \text{ before } \xi) + \beta_2 (X \text{ after } \xi) \]

Truncated Power Basis (ReLU Basis)

\[ \text{ReLU}(x) \definedas (x)_+ \definedas \begin{cases} 0 &\text{if }x \leq 0 \\ x &\text{if }x > 0 \end{cases} \implies (x - \xi)_+ = \begin{cases} 0 &\text{if }x \leq \xi \\ x - \xi &\text{if }x > 0 \end{cases} \]

Code
library(latex2exp) |> suppressPackageStartupMessages()
trunc_x <- function(x, xi) {
  return(ifelse(x <= xi, 0, x - xi))
}
trunc_x_05 <- function(x) trunc_x(x, 1/2)
trunc_title <- TeX("$(x - \\xi)_+$ with $\\xi = 0.5$")
trunc_label <- TeX("$y = (x - \\xi)_+$")
ggplot() +
  stat_function(
    data=data.frame(x=c(-3,3)),
    fun=trunc_x_05,
    linewidth=g_linewidth
  ) +
  xlim(-3, 3) +
  theme_dsan(base_size=28) +
  labs(
    title = trunc_title,
    x = "x",
    y = trunc_label
  )

  • \(\Rightarrow\) we can include a “slope modifier” \(\beta_m (X - \xi)_+\) that only “kicks in” once \(x\) goes past \(\xi\)! (Changing slope by \(\beta_m\))

Linear Segments with Continuity at Knot

Our new (non-naïve) model:

\[ \begin{align*} Y &= \beta_0 + \beta_1 X + \beta_2 (X - \xi)_+ \\ &= \beta_0 + \begin{cases} \beta_1 X &\text{if }X \leq \xi \\ (\beta_1 + \beta_2)X &\text{if }X > \xi \end{cases} \end{align*} \]

Code
sub_df <- sub_df |> mutate(x_tr = ifelse(x < sub_knot, 0, x - sub_knot))
linseg_model <- lm(y ~ x + x_tr, data=sub_df)
broom::tidy(linseg_model) |> mutate_if(is.numeric, round, 3) |> select(-p.value)
term estimate std.error statistic
(Intercept) 0.187 0.044 4.255
x 1.340 0.093 14.342
x_tr -2.868 0.168 -17.095
Code
sub_df |> ggplot(aes(x=x, y=y)) +
  geom_point(size=0.5) +
  geom_vline(xintercept=sub_knot, linetype="dashed") +
  geom_smooth(method='lm', formula=y ~ x + ifelse(x > sub_knot, x-sub_knot, 0), se=TRUE, linewidth=g_linewidth) +
  theme_classic(base_size=28)

Continuous Segmented Regression

\[ \begin{align*} Y &= \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 (X - \xi)_+ + \beta_4 (X - \xi)_+^2 \\[0.8em] &= \beta_0 + \begin{cases} \beta_1 X + \beta_2 X^2 &\text{if }X \leq \xi \\ (\beta_1 + \beta_3) X + (\beta_2 + \beta_4) X^2 &\text{if }X > \xi \end{cases} \end{align*} \]

Code
prod_df <- prod_df |> mutate(x_tr = ifelse(x > knot, x - knot, 0))
seg_model <- lm(
  y ~ poly(x, 2) + poly(x_tr, 2),
  data=prod_df
)
Code
seg_model |> broom::tidy() |> mutate_if(is.numeric, round, 3) |> select(-p.value)
term estimate std.error statistic
(Intercept) 0.104 0.035 2.961
poly(x, 2)1 112.929 6.824 16.548
poly(x, 2)2 64.558 3.692 17.484
poly(x_tr, 2)1 -125.195 7.690 -16.281
poly(x_tr, 2)2 1.526 1.036 1.473
Code
cont_seg_plot <- ggplot() +
  geom_point(data=prod_df, aes(x=x, y=y), size=0.5) +
  geom_vline(xintercept=knot, linetype="dashed") +
  stat_smooth(
    data=prod_df, aes(x=x, y=y),
    method='lm',
    formula=y ~ poly(x,2) + poly(ifelse(x > knot, (x - knot), 0), 2),
    n = 300
  ) +
  #geom_smooth(data=prod_df |> filter(segment == FALSE), aes(x=x, y=y), method='lm', formula=y ~ poly(x,2)) +
  # geom_smooth(method=segreg, formula=y ~ seg(x, npsi=1, fixed.psi=0.5)) + # + seg(I(x^2), npsi=1)) +
  theme_classic(base_size=22)
cont_seg_plot

  • There’s still a problem here… can you see what it is? (Hint: things that break calculus)

Splines (The Moment You’ve Been Waiting For!)

Constrained Derivatives: “Quadratic Splines” (⚠️)

  • Like “Linear Probability Models”, “Quadratic Splines” are a red flag
  • Why? If we have leftmost plot below, but want it differentiable everywhere… only option is to fit a single quadratic, defeating the whole point of the “chopping”!

\[ \frac{\partial Y}{\partial X} = \beta_1 + 2\beta_2 X + \beta_3 + 2\beta_4 X \]

Continuous Segmented:

Code
cont_seg_plot

 

“Quadratic Spline”:

Code
ggplot() +
  geom_point(data=prod_df, aes(x=x, y=y), size=0.5) +
  geom_vline(xintercept=knot, linetype="dashed") +
  stat_smooth(
    data=prod_df, aes(x=x, y=y),
    method='lm',
    formula=y ~ poly(x,2) + ifelse(x > knot, (x - knot)^2, 0),
    n = 300
  ) +
  theme_dsan(base_size=22)

 

\[ Y = \beta_0 + \beta_1 X + \beta_2 X^2 \]

Code
ggplot() +
  geom_point(data=prod_df, aes(x=x, y=y), size=0.5) +
  geom_vline(xintercept=knot, linetype="dashed") +
  stat_smooth(
    data=prod_df, aes(x=x, y=y),
    method='lm',
    formula=y ~ poly(x,2),
    n = 300
  ) +
  theme_dsan(base_size=22)

  • \(\implies\) Least-complex function that allows smooth “joining” is cubic function

Cubic Splines (aka, Splines)

(We did it, we finally did it)

Code
ggplot() +
  geom_point(data=prod_df, aes(x=x, y=y), size=0.5) +
  geom_vline(xintercept=knot, linetype="dashed") +
  stat_smooth(
    data=prod_df, aes(x=x, y=y),
    method='lm',
    formula=y ~ poly(x,3) + ifelse(x > knot, (x - knot)^3, 0),
    n = 300
  ) +
  theme_dsan(base_size=22)

Natural Splines: Why Not Stop There?

Polynomials start to BEHAVE BADLY as they go to \(-\infty\) and \(\infty\)

Code
library(splines) |> suppressPackageStartupMessages()
set.seed(5300)
N_sparse <- 400
x_vals <- runif(N_sparse, min=xmin, max=xmax)
y_raw = compute_y(x_vals)
y_noise = rnorm(N_sparse, mean=0, sd=1.0)
y_vals <- y_raw + y_noise
sparse_df <- tibble(x=x_vals, y=y_vals)
knot_sparse <- (xmin + xmax) / 2
knot_vec <- c(-1.8,-1.7,-1.6,-1.5,-1.4,-1.2,-1.0,-0.8,-0.6,-0.4,-0.2,0,knot_sparse,1,1.5,2)
knot_df <- tibble(knot=knot_vec)
# Boundary lines
left_bound_line <- geom_vline(
  xintercept = xmin + 0.1, linewidth=1,
  color="red", alpha=0.8
)
right_bound_line <- geom_vline(
  xintercept = xmax - 0.1,
  linewidth=1, color="red", alpha=0.8
)
ggplot() +
  geom_point(data=sparse_df, aes(x=x, y=y), size=0.5) +
  geom_vline(
    data=knot_df, aes(xintercept=knot),
    linetype="dashed"
  ) +
  stat_smooth(
    data=sparse_df, aes(x=x, y=y),
    method='lm',
    formula=y ~ bs(x, knots=c(knot_sparse), degree=25),
    n=300
  ) +
  left_bound_line +
  right_bound_line +
  theme_dsan(base_size=22)

Natural Splines: Force leftmost and rightmost pieces to be linear

Code
library(splines) |> suppressPackageStartupMessages()
set.seed(5300)
N_sparse <- 400
x_vals <- runif(N_sparse, min=xmin, max=xmax)
y_raw = compute_y(x_vals)
y_noise = rnorm(N_sparse, mean=0, sd=1.0)
y_vals <- y_raw + y_noise
sparse_df <- tibble(x=x_vals, y=y_vals)
knot_sparse <- (xmin + xmax) / 2
ggplot() +
  geom_point(data=sparse_df, aes(x=x, y=y), size=0.5) +
  stat_smooth(
    data=sparse_df, aes(x=x, y=y),
    method='lm',
    formula=y ~ ns(x, knots=knot_vec, Boundary.knots=c(xmin + 0.1, xmax - 0.1)),
    n=300
  ) +
  geom_vline(
    data=knot_df, aes(xintercept=knot),
    linetype="dashed"
  ) +
  left_bound_line +
  right_bound_line +
  theme_dsan(base_size=22)

How Do We Find The Magical Knot Points?

  • Hyperparameter 0: Number of knots \(K\)
  • Hyperparameters 1 to \(K\): Location \(\xi_k\) for each knot
  • Fear: Seems like we’re going to need a bunch of slides/lectures talking about this, right?
    • (e.g., if you really do have to manually choose them, heuristic is to place them in more “volatile” parts of \(D_X\))
  • Reality: Nope, just use CV! 😉😎

Even Cooler Alternative Approach: Smoothing Splines

(Where “cooler” here really means “more statistically-principled”)

Reminder (W04): Penalizing Complexity

Week 4 (General Form):

\[ \boldsymbol\theta^* = \underset{\boldsymbol\theta}{\operatorname{argmin}} \left[ \mathcal{L}(y, \widehat{y}; \boldsymbol\theta) + \lambda \cdot \mathsf{Complexity}(\boldsymbol\theta) \right] \]

Last Week (Penalizing Polynomials):

\[ \boldsymbol\beta^*_{\text{lasso}} = \underset{\boldsymbol\beta, \lambda}{\operatorname{argmin}} \left[ \frac{1}{N}\sum_{i=1}^{N}(\widehat{y}_i(\boldsymbol\beta) - y_i)^2 + \lambda \|\boldsymbol\beta\|_1 \right] \]

Now (Penalizing Sharp Change / Wigglyness):

\[ g^* = \underset{g, \lambda}{\operatorname{argmin}} \left[ \sum_{i=1}^{N}(g(x_i) - y_i)^2 + \lambda \overbrace{\int}^{\mathclap{\text{Sum of}}} [\underbrace{g''(t)}_{\mathclap{\text{Change in }g'(t)}}]^2 \mathrm{d}t \right] \]

Restricting the first derivative would optimize for a function that doesn’t change much at all. Restricting the second derivative just constrains sharpness of the changes.

Think of an Uber driver: penalizing speed would just make them go slowly (ensuring slow ride). Penalizing acceleration ensures that they can travel whatever speed they want, as long as they smoothly accelerate and decelerate between those speeds (ensuring non-bumpy ride!).

Penalizing Wiggles

\[ g^* = \underset{g, \lambda}{\operatorname{argmin}} \left[ \sum_{i=1}^{N}(g(x_i) - y_i)^2 + \lambda \overbrace{\int}^{\mathclap{\text{Sum of}}} [\underbrace{g''(t)}_{\mathclap{\text{Change in }g'(t)}}]^2 \mathrm{d}t \right] \]

  • (Note that we do not assume it’s a spline! We optimize over \(g\) itself, not \(K\), \(\xi_k\))
  • Penalizing \([g'(t)]^2\) would mean: penalize function for changing over its domain
  • Penalizing \([g''(t)]^2\) means: it’s ok to change, but do so smoothly
  • Uber ride metaphor: remember how \(g'(t)\) gives speed while \(g''(t)\) gives acceleration
  • Penalizing speed… would just make driver go slowly (ensuring slow ride)
  • Penalizing acceleration
    • Ensures that they can fit the traffic pattern (can drive whatever speed they want)
    • As long as they smoothly accelerate and decelerate between those speeds (ensuring non-bumpy ride!)

Three Mind-Blowing Math Facts

As \(\lambda \rightarrow 0\), \(g^*\) will grow as wiggly as necessary to achieve zero RSS / MSE / \(R^2\)

As \(\lambda \rightarrow \infty\), \(g^*\) converges to OLS linear regression line

We didn’t optimize over splines, yet optimal \(g\) is a spline!

Specifically, a Natural Cubic Spline with…

  • Knots at every data point (\(K = N\), \(\xi_i = x_i\)), but
  • Penalized via (optimal) \(\lambda^*\), so not as wiggly as “free to wiggle” splines from previous approach: shrunken natural cubic spline

References