Week 10: Deep Learning

DSAN 5300: Statistical Learning
Spring 2025, Georgetown University

Class Sessions

Author

Affiliation

Jeff Jacobs

jj1088@georgetown.edu

Published

Monday, March 24, 2025

Open slides in new tab →

Schedule

Today’s Planned Schedule:

	Start	End	Topic
Lecture	6:30pm	7:00pm	Single Layer Neural Networks →
	7:00pm	7:20pm	Max-Margin Classifiers →
	7:20pm	8:00pm	Support Vector Classifiers →
Break!	8:00pm	8:10pm
	8:10pm	9:00pm	Fancier Neural Networks →

Quick Roadmap

We made it! Cutting-edge method for statistical neural learning

Single-Layer Neural Networks

Single-Layer NN, Adapted from ISLR Fig 10.1

Diagram \(\leftrightarrow\) Math

\(p = 4\) features in Input Layer
\(K = 5\) Hidden Units
Output Layer: Regression on activations \(a_k\) (Hidden Unit outputs)

\[ \begin{align*} {\color{#976464} y} &= { \color{#976464} \beta_0 } + {\color{#666693} \sum_{k=1}^{5} } {\color{#976464} \beta_k } { \color{#666693} \overbrace{\boxed{a_k} }^{\mathclap{k^\text{th}\text{ activation}}} } \\ {\color{#976464} y} &= { \color{#976464} \beta_0 } + {\color{#666693} \sum_{k=1}^{5} } {\color{#976464} \beta_k } { \color{#666693} \underbrace{ g \mkern-4mu \left( w_{k0} + {\color{#679d67} \sum_{j=1}^{4} } w_{kj} {\color{#679d67} x_j} \right) }_{k^\text{th}\text{ activation}}} \end{align*} \]

Matrix Form (Only if Sanity-Helping)

Example

Rather than pondering over what that diagram can/can’t do, consider two “true” DGPs:

\[ \begin{align*} Y &= {\color{#e69f00} X_1 X_2 } \\ Y &= {\color{#56b4e9} X_1^2 + X_2^2 } \\ Y &= {\color{#009E73} X_1 \underset{\mathclap{\small \text{XOR}}}{\oplus} X_2} \end{align*} \]

How exactly is a neural net able to learn these relationships?

Sum of Squares

Can we learn \(y = {\color{#56b4e9} x_1^2 + x_2^2 }\)?
Let’s use \(g(x) = x^2\).
Let \(\mathbf{w}_1 = (0, 1, 0)\), \(\mathbf{w}_2 = (0, 0, 1)\).
Our two activations are:

\[ \begin{align*} {\color{#666693} a_1 } &= g(0 + (1)(x_1) + (0)(x_2)) = x_1^2 \\ {\color{#666693} a_2 } &= g(0 + (0)(x_1) + (1)(x_2)) = x_2^2 \end{align*} \]

So, if \(\boldsymbol\beta = (0, 1, 1)\), then

\[ {\color{#976464} y } = 0 + (1)(x_1^2) + (1)(x_2^2) = {\color{#56b4e9} x_1^2 + x_2^2} \; ✅ \]

Interaction Term

Can we learn \(Y = {\color{#e69f00} x_1x_2}\)?
Let’s use \(g(x) = x^2\) again.
Let \(\mathbf{w}_1 = (0, 1, 1)\), \(\mathbf{w}_2 = (0, 1, -1)\).
Our two activations are:

\[ \begin{align*} {\color{#666693} a_1 } &= g(0 + (1)(x_1) + (1)(x_2)) = (x_1 + x_2)^2 = x_1^2 + x_2^2 +2x_1x_2 \\ {\color{#666693} a_2 } &= g(0 + (1)(x_1) + (-1)(x_2)) = (x_1 - x_2)^2 = x_1^2 + x_2^2 - 2x_1x_2 \end{align*} \]

So, if we let \(\boldsymbol\beta = \left( 0, \frac{1}{4}, -\frac{1}{4} \right)\), then

\[ {\color{#976464} y } = 0 + \left(\frac{1}{4}\right)(x_1^2 + x_2^2 + 2x_1x_2) + \left(-\frac{1}{4}\right)(x_1^2 + x_2^2 - 2x_1x_2) = {\color{#e69f00} x_1x_2} \; ✅ \]

The XOR Problem

Can we learn \(Y = {\color{#009E73} x_1 \underset{\mathclap{\small \text{XOR}}}{\oplus} x_2}\)?
Let’s use \(g(x) = x^2\) once more.
Let \(\mathbf{w}_1 = (0, 1, 1)\), \(\mathbf{w}_2 = (0, 1, -1)\).
Our two activations are:

So, if we let \(\boldsymbol\beta = (0, 0, 1)\), then

\[ \begin{align*} {\color{#976464} y }(0,0) &= 0 + (0)(0^2 + 0^2 + 2(0)(0)) + (1)(0^2 + 0^2 - 2(0)(0)) = {\color{#009e73} 0} \; ✅ \\ {\color{#976464} y }(0,1) &= 0 + (0)(0^2 + 1^2 + 2(0)(1)) + (1)(0^2 + 1^2 - 2(0)(1)) = {\color{#009e73} 1} \; ✅ \\ {\color{#976464} y }(1,0) &= 0 + (0)(1^2 + 0^2 + 2(1)(0)) + (1)(1^2 + 0^2 - 2(1)(0)) = {\color{#009e73} 1} \; ✅ \\ {\color{#976464} y }(1,1) &= 0 + (0)(1^2 + 1^2 + 2(1)(1)) + (1)(1^2 + 1^2 - 2(1)(1)) = {\color{#009e73} 0} \; ✅ \end{align*} \]

But How?

Output Layer is just linear regression on activations (Hidden Layer outputs)
We saw in Week 7 how good basis function allows regression to learn any function
Neural Networks: GOAT non-linear basis function learners!

Code

library(tidyverse) |> suppressPackageStartupMessages()
library(latex2exp) |> suppressPackageStartupMessages()
xor_df <- tribble(
    ~x1, ~x2, ~label,
    0, 0, 0,
    0, 1, 1,
    1, 0, 1,
    1, 1, 0
) |>
mutate(
    h1 = (x1 - x2)^2,
    label = factor(label)
)
xor_df |> ggplot(aes(x=x1, y=x2, label=label)) +
  geom_point(
    aes(color=label, shape=label),
    size=g_pointsize * 2,
    stroke=6
  ) +
  geom_point(aes(fill=label), color='black', shape=21, size=g_pointsize * 2.5, stroke=0.75, alpha=0.4) +
  scale_x_continuous(breaks=c(0, 1)) +
  scale_y_continuous(breaks=c(0, 1)) +
  expand_limits(y=c(-0.1,1.1)) +
  # 45 is minus sign, 95 is em-dash
  scale_shape_manual(values=c(95, 43)) +
  theme_dsan(base_size=32) +
  remove_legend_title() +
  labs(
    x=TeX("$x_1$"),
    y=TeX("$x_2$"),
    title="XOR Problem: Original Features"
  )

Figure 1: The DGP \(Y = x_1 \oplus x_2\) produces points in \([0,1]^2\) which are not linearly separable

Code

library(tidyverse)
xor_df <- tribble(
    ~x1, ~x2, ~label,
    0, 0, 0,
    0, 1, 1,
    1, 0, 1,
    1, 1, 0
) |>
mutate(
    h1 = (x1 - x2)^2,
    h2 = (x1 + x2)^2,
    h2 = ifelse(h1 > 0.5 & x2==0, h2 + 0.5, h2),
    label = factor(label)
)
xor_df |> ggplot(aes(x=h1, y=h2, label=label)) +
  geom_vline(xintercept=0.5, linetype="dashed", linewidth=1) +
  # Negative space
  geom_rect(xmin=-Inf, xmax=0.5, ymin=-Inf, ymax=Inf, fill=cb_palette[1], alpha=0.15) +
  # Positive space
  geom_rect(xmin=0.5, xmax=Inf, ymin=-Inf, ymax=Inf, fill=cb_palette[2], alpha=0.15) +
  geom_point(
    aes(color=label, shape=label),
    size=g_pointsize * 2,
    stroke=6
  ) +
  geom_point(aes(fill=label), color='black', shape=21, size=g_pointsize*2.5, stroke=0.75, alpha=0.4) +
  expand_limits(y=c(-0.2,4.2)) +
  # 45 is minus sign, 95 is em-dash
  scale_shape_manual(values=c(95, 43)) +
  theme_dsan(base_size=32) +
  remove_legend_title() +
  labs(
    title="NN-Learned Feature Space",
    x=TeX("$h_1(x_1, x_2)$"),
    y=TeX("$h_2(x_1, x_2)$")
  )

Figure 2: Learned bases \(h_1 = (x_1 - x_2)^2\) and \(h_2 = (x_1 + x_2)^2\) enable **separating hyperplane** \(h_1 = 0.5\)

Code

library(tidyverse)
x1_vals <- seq(from=0, to=1, by=0.0075)
x2_vals <- seq(from=0, to=1, by=0.0075)
grid_df <- expand.grid(x1=x1_vals, x2=x2_vals) |>
  as_tibble() |>
  mutate(
    label=factor(as.numeric((x1-x2)^2 > 0.5))
  )
ggplot() +
  geom_point(
    data=grid_df,
    aes(x=x1, y=x2, color=label),
    alpha=0.4
  ) +
  geom_point(
    data=xor_df,
    aes(x=x1, y=x2, color=label, shape=label),
    size=g_pointsize * 2,
    stroke=6
  ) +
  geom_point(
    data=xor_df,
    aes(x=x1, y=x2, fill=label),
    color='black', shape=21, size=g_pointsize*2.5, stroke=0.75, alpha=0.4
  ) +
  geom_abline(slope=1, intercept=0.7, linetype="dashed", linewidth=1) +
  geom_abline(slope=1, intercept=-0.7, linetype="dashed", linewidth=1) +
  scale_shape_manual(values=c(95, 43)) +
  theme_dsan(base_size=32) +
  remove_legend_title() +
  labs(
    title="XOR Problem: Inverted NN Features",
    x=TeX("$X_1$"), y=TeX("$X_2$")
  )

Figure 3: Here, the blue area represents points where \(h_1 = (x_1 - x_2)^2 > 0.5\)

Multilayer Neural Networks

Multilayer NN for MNIST Handwritten Digit Recognition, Adapted from ISLR Fig 10.4

Input Representation

From *But what is a neural network?*, 3Blue1Brown

But Wait… Ten Outputs?

The (magical) softmax function!

\[ z_d = \Pr(Y = d \mid X) = \frac{e^{y_d}}{\sum_{i=0}^{9}e^{y_i}} \]

Ensures that each \(Z_d\) is a probability!

\[ \begin{align} 0 \leq z_d &\leq 1 \; \; \forall ~ d \in \{0,\ldots,9\} \\ \sum_{d=0}^{9}z_d &= 1 \end{align} \]

Visualizing Softmax Results

**Interactive Visualization: Handwritten-Digit Space**

Fancier Neural Networks

Convolutional Neural Networks (CNNs)
Recurrent Neural Networks (RNNs)

CNNs

Key point: Convolutional layers are not fully connected!
Each layer “pools” info from two units in previous layer

Decoding the Thought Vector

Hidden layers closer to input layer detect low-level “fine-grained” features
Hidden layers closer to output layer detect high-level “coarse-grained” features