Week 11: How Do Neural Networks Deep-Learn?

DSAN 5300: Statistical Learning
Spring 2025, Georgetown University

Class Sessions

Author

Affiliation

Jeff Jacobs

jj1088@georgetown.edu

Published

Monday, March 31, 2025

Open slides in new tab →

Schedule

Today’s Planned Schedule:

	Start	End	Topic
Lecture	6:30pm	7:00pm	Single Layer Neural Networks →
	7:00pm	7:20pm	Max-Margin Classifiers →
	7:20pm	8:00pm	Support Vector Classifiers →
Break!	8:00pm	8:10pm
	8:10pm	9:00pm	Fancier Neural Networks →

Quick Roadmap

Last week: Examples of how NNs are capable of learning…
The types of features that let us learn fancy non-linear DGPs: \(Y = {\color{#e69f00} X_1 X_2 }\) ✅, \(Y = {\color{#56b4e9} X_1^2 + X_2^2 }\) ✅, \(Y = {\color{#009E73} X_1 \underset{\mathclap{\small \text{XOR}}}{\oplus} X_2}\) ✅
Multi-layer networks like CNNs for “pooling” low-level/fine-grained information into high-level/coarse-grained information
- Ex: Early layers detect lines, later layers figure out whether they’re brows or smiles
This week: How do we actually learn the weights/biases which enable these capabilities?
- The answer is (🙈) calculus (chain rule)

Step-by-Step

Neural Network Training Procedure

For each training observation \((\mathbf{x}_i, y_i)\)…

Predict \(\widehat{y}_i\) from \(\mathbf{x}_i\)

Evaluate loss \(\mathcal{L}(\widehat{y}_i, y_i)\): Cross-Entropy Loss

Update parameters (weights/biases): Backpropagation

Key for success of NNs: Non-linear but differentiable

\(\Rightarrow\) parameters \(w^*\) most responsible for the loss value can be
1. identified: \(w^* = \argmax_w\left[ \frac{\partial \mathcal{L}}{\partial w} \right]\), then
2. changed the most \(w^*_t \rightarrow w^*_{t+1}\)

Recall: MNIST Digits

Multilayer NN for MNIST Handwritten Digit Recognition, Adapted from ISLR Fig 10.4

How Do We Evaluate Output?

Entropy in General

“Entropy Loss”: Output Layer Uncertainty

Max entropy = max uncertainty

Step 1: NN has no idea, guesses (via softmax) \(\widehat{y}_d = \Pr(y = d) = 0.1 \; \forall d\)

Less entropy = less uncertainty

Step 2: NN starting to converge: \(\Pr(Y = 9)\) high, \(\Pr(Y = 3)\) medium, \(\Pr(Y = d)\) low for all other \(d\)

Min entropy = no uncertainty

Step 3: NN has converged to predicting ultra-high \(\widehat{y}_9 = \Pr(y = 9 \mid X)\)

The Problem With Entropy Loss

Max entropy = max uncertainty

Step 1: NN has no idea, guesses (via softmax) \(\Pr(y = d) = 0.1\) for every \(d\)

Less entropy = less uncertainty

Step 2: NN starting to converge: probably \(d = 3\), maybe \(d = 9\), low probability on all other values

Min entropy = no uncertainty

Step 3: NN has converged to predicting ultra-high \(\Pr(y = 3)\)

Cross-Entropy Loss: Output Layer vs. Truth

Max entropy = max uncertainty

Step 1: \(H(y, \widehat{y}) = -1\cdot \log_2(0.1) \approx 3.32\)

Less entropy = less uncertainty

Step 2: \(H(y,\widehat{y}) = -1\cdot \log_2(0.4) \approx 1.32\)

Min entropy = no uncertainty

Step 3: \(H(y,\widehat{y}) = -1\cdot \log_2(1) = 0\)

It’s Not as Silly as You Think!

In our example, we know the true digit… But remember the origin of the dataset: postal workers trying to figure out handwritten digits
May not know with certainty, but may be able to say, e.g., “It’s either a 1 or a 7”

From *Perceptions of Probability* Dataset

Ok But How Do We Learn The Weights?

…Backpropagation!

Backpropagation: Simple Example

Literally just one neuron (which is the output layer), \(\mathcal{L}(\widehat{y},y) = (\widehat{y} - y)^2\)
Consider a training datapoint \((x,y) = (2,10)\)
And say our current parameters are \(\beta_0 = 1, \beta_1 = 3\)
Predicted output: \(\widehat{y} = \beta_0 + \beta_1 x = 1 + 3\cdot 2 = 7\)
Since true output is \(y = 10\), we have loss \(\mathcal{L} = (10-7)^2 = 9\)
Now, let’s backpropagate to update \(\beta_1\) (on the board!)
- (Using learning rate of \(0.1\))

Backpropagation Deeper Dive

(Full NN playlist here)

Simplest Possible Backprop

One input unit, one hidden unit, one output unit

Week 11: How Do Neural Networks Deep-Learn?

Schedule

Quick Roadmap

Step-by-Step

Recall: MNIST Digits

How Do We Evaluate Output?

Entropy in General

“Entropy Loss”: Output Layer Uncertainty

The Problem With Entropy Loss

Cross-Entropy Loss: Output Layer vs. Truth

It’s Not as Silly as You Think!

Ok But How Do We Learn The Weights?

Backpropagation: Simple Example

Top Secret Answer Slide

Backpropagation Deeper Dive

Simplest Possible Backprop

References