Week 10: Deep Learning

DSAN 5300: Statistical Learning
Spring 2025, Georgetown University

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Monday, March 24, 2025

Open slides in new tab →

Schedule

Today’s Planned Schedule:

Start End Topic
Lecture 6:30pm 7:00pm Single Layer Neural Networks →
7:00pm 7:20pm Max-Margin Classifiers →
7:20pm 8:00pm Support Vector Classifiers
Break! 8:00pm 8:10pm
8:10pm 9:00pm Fancier Neural Networks →

Quick Roadmap

  • We made it! Cutting-edge method for statistical neural learning

Single-Layer Neural Networks

Single-Layer NN, Adapted from ISLR Fig 10.1

Diagram \(\leftrightarrow\) Math

  • \(p = 4\) features in Input Layer
  • \(K = 5\) Hidden Units
  • Output Layer: Regression on activations \(a_k\) (Hidden Unit outputs)

\[ \begin{align*} {\color{#976464} y} &= { \color{#976464} \beta_0 } + {\color{#666693} \sum_{k=1}^{5} } {\color{#976464} \beta_k } { \color{#666693} \overbrace{\boxed{a_k} }^{\mathclap{k^\text{th}\text{ activation}}} } \\ {\color{#976464} y} &= { \color{#976464} \beta_0 } + {\color{#666693} \sum_{k=1}^{5} } {\color{#976464} \beta_k } { \color{#666693} \underbrace{ g \mkern-4mu \left( w_{k0} + {\color{#679d67} \sum_{j=1}^{4} } w_{kj} {\color{#679d67} x_j} \right) }_{k^\text{th}\text{ activation}}} \end{align*} \]

Matrix Form (Only if Sanity-Helping)

Example

  • Rather than pondering over what that diagram can/can’t do, consider two “true” DGPs:

\[ \begin{align*} Y &= {\color{#e69f00} X_1 X_2 } \\ Y &= {\color{#56b4e9} X_1^2 + X_2^2 } \\ Y &= {\color{#009E73} X_1 \underset{\mathclap{\small \text{XOR}}}{\oplus} X_2} \end{align*} \]

  • How exactly is a neural net able to learn these relationships?

Sum of Squares

  • Can we learn \(y = {\color{#56b4e9} x_1^2 + x_2^2 }\)?
  • Let’s use \(g(x) = x^2\).
  • Let \(\mathbf{w}_1 = (0, 1, 0)\), \(\mathbf{w}_2 = (0, 0, 1)\).
  • Our two activations are:

\[ \begin{align*} {\color{#666693} a_1 } &= g(0 + (1)(x_1) + (0)(x_2)) = x_1^2 \\ {\color{#666693} a_2 } &= g(0 + (0)(x_1) + (1)(x_2)) = x_2^2 \end{align*} \]

  • So, if \(\boldsymbol\beta = (0, 1, 1)\), then

\[ {\color{#976464} y } = 0 + (1)(x_1^2) + (1)(x_2^2) = {\color{#56b4e9} x_1^2 + x_2^2} \; ✅ \]

Interaction Term

  • Can we learn \(Y = {\color{#e69f00} x_1x_2}\)?
  • Let’s use \(g(x) = x^2\) again.
  • Let \(\mathbf{w}_1 = (0, 1, 1)\), \(\mathbf{w}_2 = (0, 1, -1)\).
  • Our two activations are:

\[ \begin{align*} {\color{#666693} a_1 } &= g(0 + (1)(x_1) + (1)(x_2)) = (x_1 + x_2)^2 = x_1^2 + x_2^2 +2x_1x_2 \\ {\color{#666693} a_2 } &= g(0 + (1)(x_1) + (-1)(x_2)) = (x_1 - x_2)^2 = x_1^2 + x_2^2 - 2x_1x_2 \end{align*} \]

  • So, if we let \(\boldsymbol\beta = \left( 0, \frac{1}{4}, -\frac{1}{4} \right)\), then

\[ {\color{#976464} y } = 0 + \left(\frac{1}{4}\right)(x_1^2 + x_2^2 + 2x_1x_2) + \left(-\frac{1}{4}\right)(x_1^2 + x_2^2 - 2x_1x_2) = {\color{#e69f00} x_1x_2} \; ✅ \]

The XOR Problem

  • Can we learn \(Y = {\color{#009E73} x_1 \underset{\mathclap{\small \text{XOR}}}{\oplus} x_2}\)?
  • Let’s use \(g(x) = x^2\) once more.
  • Let \(\mathbf{w}_1 = (0, 1, 1)\), \(\mathbf{w}_2 = (0, 1, -1)\).
  • Our two activations are:

\[ \begin{align*} {\color{#666693} a_1 } &= g(0 + (1)(x_1) + (1)(x_2)) = (x_1 + x_2)^2 = x_1^2 + x_2^2 +2x_1x_2 \\ {\color{#666693} a_2 } &= g(0 + (1)(x_1) + (-1)(x_2)) = (x_1 - x_2)^2 = x_1^2 + x_2^2 - 2x_1x_2 \end{align*} \]

  • So, if we let \(\boldsymbol\beta = (0, 0, 1)\), then

\[ \begin{align*} {\color{#976464} y }(0,0) &= 0 + (0)(0^2 + 0^2 + 2(0)(0)) + (1)(0^2 + 0^2 - 2(0)(0)) = {\color{#009e73} 0} \; ✅ \\ {\color{#976464} y }(0,1) &= 0 + (0)(0^2 + 1^2 + 2(0)(1)) + (1)(0^2 + 1^2 - 2(0)(1)) = {\color{#009e73} 1} \; ✅ \\ {\color{#976464} y }(1,0) &= 0 + (0)(1^2 + 0^2 + 2(1)(0)) + (1)(1^2 + 0^2 - 2(1)(0)) = {\color{#009e73} 1} \; ✅ \\ {\color{#976464} y }(1,1) &= 0 + (0)(1^2 + 1^2 + 2(1)(1)) + (1)(1^2 + 1^2 - 2(1)(1)) = {\color{#009e73} 0} \; ✅ \end{align*} \]

But How?

  • Output Layer is just linear regression on activations (Hidden Layer outputs)
  • We saw in Week 7 how good basis function allows regression to learn any function
  • Neural Networks: GOAT non-linear basis function learners!
Code
library(tidyverse) |> suppressPackageStartupMessages()
library(latex2exp) |> suppressPackageStartupMessages()
xor_df <- tribble(
    ~x1, ~x2, ~label,
    0, 0, 0,
    0, 1, 1,
    1, 0, 1,
    1, 1, 0
) |>
mutate(
    h1 = (x1 - x2)^2,
    label = factor(label)
)
xor_df |> ggplot(aes(x=x1, y=x2, label=label)) +
  geom_point(
    aes(color=label, shape=label),
    size=g_pointsize * 2,
    stroke=6
  ) +
  geom_point(aes(fill=label), color='black', shape=21, size=g_pointsize * 2.5, stroke=0.75, alpha=0.4) +
  scale_x_continuous(breaks=c(0, 1)) +
  scale_y_continuous(breaks=c(0, 1)) +
  expand_limits(y=c(-0.1,1.1)) +
  # 45 is minus sign, 95 is em-dash
  scale_shape_manual(values=c(95, 43)) +
  theme_dsan(base_size=32) +
  remove_legend_title() +
  labs(
    x=TeX("$x_1$"),
    y=TeX("$x_2$"),
    title="XOR Problem: Original Features"
  )
Figure 1: The DGP \(Y = x_1 \oplus x_2\) produces points in \([0,1]^2\) which are not linearly separable
Code
library(tidyverse)
xor_df <- tribble(
    ~x1, ~x2, ~label,
    0, 0, 0,
    0, 1, 1,
    1, 0, 1,
    1, 1, 0
) |>
mutate(
    h1 = (x1 - x2)^2,
    h2 = (x1 + x2)^2,
    h2 = ifelse(h1 > 0.5 & x2==0, h2 + 0.5, h2),
    label = factor(label)
)
xor_df |> ggplot(aes(x=h1, y=h2, label=label)) +
  geom_vline(xintercept=0.5, linetype="dashed", linewidth=1) +
  # Negative space
  geom_rect(xmin=-Inf, xmax=0.5, ymin=-Inf, ymax=Inf, fill=cb_palette[1], alpha=0.15) +
  # Positive space
  geom_rect(xmin=0.5, xmax=Inf, ymin=-Inf, ymax=Inf, fill=cb_palette[2], alpha=0.15) +
  geom_point(
    aes(color=label, shape=label),
    size=g_pointsize * 2,
    stroke=6
  ) +
  geom_point(aes(fill=label), color='black', shape=21, size=g_pointsize*2.5, stroke=0.75, alpha=0.4) +
  expand_limits(y=c(-0.2,4.2)) +
  # 45 is minus sign, 95 is em-dash
  scale_shape_manual(values=c(95, 43)) +
  theme_dsan(base_size=32) +
  remove_legend_title() +
  labs(
    title="NN-Learned Feature Space",
    x=TeX("$h_1(x_1, x_2)$"),
    y=TeX("$h_2(x_1, x_2)$")
  )
Figure 2: Learned bases \(h_1 = (x_1 - x_2)^2\) and \(h_2 = (x_1 + x_2)^2\) enable separating hyperplane \(h_1 = 0.5\)
Code
library(tidyverse)
x1_vals <- seq(from=0, to=1, by=0.0075)
x2_vals <- seq(from=0, to=1, by=0.0075)
grid_df <- expand.grid(x1=x1_vals, x2=x2_vals) |>
  as_tibble() |>
  mutate(
    label=factor(as.numeric((x1-x2)^2 > 0.5))
  )
ggplot() +
  geom_point(
    data=grid_df,
    aes(x=x1, y=x2, color=label),
    alpha=0.4
  ) +
  geom_point(
    data=xor_df,
    aes(x=x1, y=x2, color=label, shape=label),
    size=g_pointsize * 2,
    stroke=6
  ) +
  geom_point(
    data=xor_df,
    aes(x=x1, y=x2, fill=label),
    color='black', shape=21, size=g_pointsize*2.5, stroke=0.75, alpha=0.4
  ) +
  geom_abline(slope=1, intercept=0.7, linetype="dashed", linewidth=1) +
  geom_abline(slope=1, intercept=-0.7, linetype="dashed", linewidth=1) +
  scale_shape_manual(values=c(95, 43)) +
  theme_dsan(base_size=32) +
  remove_legend_title() +
  labs(
    title="XOR Problem: Inverted NN Features",
    x=TeX("$X_1$"), y=TeX("$X_2$")
  )
Figure 3: Here, the blue area represents points where \(h_1 = (x_1 - x_2)^2 > 0.5\)

Multilayer Neural Networks

Multilayer NN for MNIST Handwritten Digit Recognition, Adapted from ISLR Fig 10.4

Input Representation

From But what is a neural network?, 3Blue1Brown

But Wait… Ten Outputs?

  • The (magical) softmax function!

\[ z_d = \Pr(Y = d \mid X) = \frac{e^{y_d}}{\sum_{i=0}^{9}e^{y_i}} \]

  • Ensures that each \(Z_d\) is a probability!

\[ \begin{align} 0 \leq z_d &\leq 1 \; \; \forall ~ d \in \{0,\ldots,9\} \\ \sum_{d=0}^{9}z_d &= 1 \end{align} \]

Visualizing Softmax Results

Fancier Neural Networks

  • Convolutional Neural Networks (CNNs)
  • Recurrent Neural Networks (RNNs)

CNNs

  • Key point: Convolutional layers are not fully connected!
  • Each layer “pools” info from two units in previous layer

Decoding the Thought Vector

  • Hidden layers closer to input layer detect low-level “fine-grained” features
  • Hidden layers closer to output layer detect high-level “coarse-grained” features

Decoding the Thought Vector

Variational Autoencoders

RNNs

…More next week, tbh

ISLR Figure 10.12

Ok But How Do We Learn The Weights?

Backpropagation! (3Blue1Brown Again!)

(Full NN playlist here)

References