Week 11: How Do Neural Networks Deep-Learn?

DSAN 5300: Statistical Learning
Spring 2025, Georgetown University

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Monday, March 31, 2025

Open slides in new tab →

Schedule

Today’s Planned Schedule:

Start End Topic
Lecture 6:30pm 7:00pm Single Layer Neural Networks →
7:00pm 7:20pm Max-Margin Classifiers →
7:20pm 8:00pm Support Vector Classifiers
Break! 8:00pm 8:10pm
8:10pm 9:00pm Fancier Neural Networks →

Quick Roadmap

  • Last week: Examples of how NNs are capable of learning…

  • The types of features that let us learn fancy non-linear DGPs: \(Y = {\color{#e69f00} X_1 X_2 }\) , \(Y = {\color{#56b4e9} X_1^2 + X_2^2 }\) , \(Y = {\color{#009E73} X_1 \underset{\mathclap{\small \text{XOR}}}{\oplus} X_2}\)

  • Multi-layer networks like CNNs for “pooling” low-level/fine-grained information into high-level/coarse-grained information

    • Ex: Early layers detect lines, later layers figure out whether they’re brows or smiles
  • This week: How do we actually learn the weights/biases which enable these capabilities?

    • The answer is (🙈) calculus (chain rule)

Step-by-Step

Neural Network Training Procedure

For each training observation \((\mathbf{x}_i, y_i)\)

Predict \(\widehat{y}_i\) from \(\mathbf{x}_i\)

Evaluate loss \(\mathcal{L}(\widehat{y}_i, y_i)\): Cross-Entropy Loss

Update parameters (weights/biases): Backpropagation

Key for success of NNs: Non-linear but differentiable

  • \(\Rightarrow\) parameters \(w^*\) most responsible for the loss value can be
    1. identified: \(w^* = \argmax_w\left[ \frac{\partial \mathcal{L}}{\partial w} \right]\), then
    2. changed the most \(w^*_t \rightarrow w^*_{t+1}\)

Recall: MNIST Digits

Multilayer NN for MNIST Handwritten Digit Recognition, Adapted from ISLR Fig 10.4

How Do We Evaluate Output?

Multilayer NN for MNIST Handwritten Digit Recognition, Adapted from ISLR Fig 10.4

Entropy in General

“Entropy Loss”: Output Layer Uncertainty

Max entropy = max uncertainty

Step 1: NN has no idea, guesses (via softmax) \(\widehat{y}_d = \Pr(y = d) = 0.1 \; \forall d\)

Less entropy = less uncertainty

Step 2: NN starting to converge: \(\Pr(Y = 9)\) high, \(\Pr(Y = 3)\) medium, \(\Pr(Y = d)\) low for all other \(d\)

Min entropy = no uncertainty

Step 3: NN has converged to predicting ultra-high \(\widehat{y}_9 = \Pr(y = 9 \mid X)\)

The Problem With Entropy Loss

Max entropy = max uncertainty

Step 1: NN has no idea, guesses (via softmax) \(\Pr(y = d) = 0.1\) for every \(d\)

Less entropy = less uncertainty

Step 2: NN starting to converge: probably \(d = 3\), maybe \(d = 9\), low probability on all other values

Min entropy = no uncertainty

Step 3: NN has converged to predicting ultra-high \(\Pr(y = 3)\)

Cross-Entropy Loss: Output Layer vs. Truth

Max entropy = max uncertainty

Step 1: \(H(y, \widehat{y}) = -1\cdot \log_2(0.1) \approx 3.32\)

Less entropy = less uncertainty

Step 2: \(H(y,\widehat{y}) = -1\cdot \log_2(0.4) \approx 1.32\)

Min entropy = no uncertainty

Step 3: \(H(y,\widehat{y}) = -1\cdot \log_2(1) = 0\)

It’s Not as Silly as You Think!

  • In our example, we know the true digit… But remember the origin of the dataset: postal workers trying to figure out handwritten digits
  • May not know with certainty, but may be able to say, e.g., “It’s either a 1 or a 7”

Ok But How Do We Learn The Weights?

  • Backpropagation!

Backpropagation: Simple Example

  • Literally just one neuron (which is the output layer), \(\mathcal{L}(\widehat{y},y) = (\widehat{y} - y)^2\)
  • Consider a training datapoint \((x,y) = (2,10)\)
  • And say our current parameters are \(\beta_0 = 1, \beta_1 = 3\)
  • Predicted output: \(\widehat{y} = \beta_0 + \beta_1 x = 1 + 3\cdot 2 = 7\)
  • Since true output is \(y = 10\), we have loss \(\mathcal{L} = (10-7)^2 = 9\)
  • Now, let’s backpropagate to update \(\beta_1\) (on the board!)
    • (Using learning rate of \(0.1\))

Top Secret Answer Slide

  • Weight \(\beta_1\) becomes 4.2…
  • New prediction: \(\widehat{y} = 1 + 4.2\cdot 2 = 9.4\)
  • New loss: \((10-9.4)^2 = 0.36\) 🥳

Backpropagation Deeper Dive

Backpropagation! (3Blue1Brown Again!)

(Full NN playlist here)

Simplest Possible Backprop

  • One input unit, one hidden unit, one output unit

References