Week 9: Generative vs. Discriminative Models

DSAN 5300: Statistical Learning
Spring 2025, Georgetown University

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Monday, March 17, 2025

Open slides in new tab →

Schedule

Today’s Planned Schedule:

Start End Topic
Lecture 6:30pm 7:00pm Separating Hyperplanes →
7:00pm 7:20pm Max-Margin Classifiers →
7:20pm 8:00pm Support Vector Classifiers
Break! 8:00pm 8:10pm
8:10pm 9:00pm Quiz 2 →

Quick Roadmap

  • Weeks 8-9: Shift from focus on regression to focus on classification (Though we use lessons from regression!)
  • Last Week (W08): SVMs as new method with this focus
    • Emphasis on boundary between classes \(\leadsto\) 2.5hrs on separating hyperplanes: in original feature space (Max-Margin, SVCs) or derived feature spaces (SVMs)
  • Now: Wait, didn’t we discuss a classification method before, though its name confusingly had “regression” in it? 🤔
    • Take logistic regression but use Bayes rule to “flip” from regression[+thresholding] task to class-separation task (think of SVM’s max-width-of-“slab” objective!)

Logistic Regression Refresher

  • We don’t have time for full refresher, but just remember how it involves learning \(\beta_j\) values to minimize loss w.r.t.

\[ \begin{align*} &\log\left[ \frac{\Pr(Y = 1 \mid X)}{1 - \Pr(Y = 1 \mid X)} \right] = \beta_0 + \beta_1 X \\ &\iff \Pr(Y = 1 \mid X = x_i) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \end{align*} \]

  • And then, if we want to classify \(x\) rather than just predict \(\Pr(Y = 1 \mid X = x)\), we apply a threshold \(t \in [0,1]\):

\[ \widehat{y} = \begin{cases} 1 &\text{if }\Pr(Y = 1 \mid X = x) = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} > t \\ 0 &\text{otherwise} \end{cases} \]

Intuition

  • Logistic regression is called a discriminative model, since we are learning parameters \(\beta_j\) that best produce a predicted class \(\widehat{y_i}\) from features \(\mathbf{x}_i\)
  • We’re modeling \(\Pr(Y = k \mid X)\) (for two classes, \(k = 0\) and \(k = 1\)), hence the LHS of the Logistic Regression formula
  • But there are cases where we can do better1 by instead modeling (learning parameters for) \(\Pr(X \mid Y = k)\), for each \(k\), then using Bayes rule to “flip” back to \(\Pr(Y = k \mid X)\)!
    \(\leadsto\) LDA, QDA, and Naïve Bayes classifiers

Linear Discriminant Analysis (LDA)

  • Not to be confused with the NLP model called “LDA”!
  • In that case LDA = “Latent Dirichlet Allocation”

Bayes’ Rule

  • First things first, we generalize from \(Y \in \{0, 1\}\) to \(K\) possible classes (labels), since the notation for \(K\) classes here is not much more complex than 2 classes!
  • We label the pieces using ISLR’s notation to make our lives easier:

\[ \underbrace{\Pr(Y = k \mid X = x)}_{p_k(x)} = \frac{ \overbrace{\Pr(X = x \mid Y = k)}^{f_k(x)} \overbrace{\Pr(Y = k)}^{\pi_k} }{ \sum_{\ell = 1}^{K} \underbrace{\Pr(X = x \mid Y = \ell)}_{f_{\ell}(x)} \underbrace{\Pr(Y = \ell)}_{\pi_{\ell}} } = \frac{f_k(x) \overbrace{\pi_k}^{\mathclap{\text{Prior}(k)}}}{\sum_{\ell = 1}^{K}f_{\ell}(x) \underbrace{\pi_\ell}_{\mathclap{\text{Prior}(\ell)}}} \]

  • So if we do have only two classes, \(K = 2\) and \(p_1(x) = \frac{f_1(x)\pi_1}{f_1(x)\pi_1 + f_0(x)\pi_0}\)

  • Priors can be estimated as \(n_k / n\). The hard work is in modeling \(f_k(x)\)! With estimates of these two “pieces” for each \(k\), we can derive a classifier \(\widehat{y}(x) = \argmax_k p_k(x)\)

The LDA Assumption (One Feature \(x\))

  • Within each class \(k\), values of \(x\) are normally distributed:

\[ (X \mid Y = k) \sim \mathcal{N}(\param{\mu_k}, \param{\sigma^2}) \iff f_k(x) = \frac{1}{\sqrt{2 \pi}\sigma}\exp\left[-\frac{1}{2}\left( \frac{x - \mu_k}{\sigma} \right)^2\right] \]

  • Plugging back into (notationally-simplified) classifier, we get

\[ \widehat{y}(x) = \argmax_{k}\left[ \frac{ \pi_k \frac{1}{\sqrt{2 \pi}\sigma}\exp\left[-\frac{1}{2}\left( \frac{x - \mu_k}{\sigma} \right)^2\right] }{ \sum_{\ell = 1}^{K}\pi_{\ell} \frac{1}{\sqrt{2 \pi}\sigma}\exp\left[-\frac{1}{2}\left( \frac{x - \mu_\ell}{\sigma} \right)^2\right] }\right], \]

  • Gross, BUT \(\argmax_k p_k(x) = \argmax_k \log(p_k(x)) \leadsto\) “linear” discriminant \(\delta_k(x)\):

\[ \widehat{y}(x) = \argmax_k[\delta_k(x)] = \argmax_{k}\left[ \overbrace{\frac{\mu_k}{\sigma^2}}^{\smash{m}} x ~ \overbrace{- \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)}^{\smash{b}} \right] \]

Decision Boundaries

  • The boundary between two classes \(k\) and \(k'\) will be the point at which \(\delta_k(x) = \delta_{k'}(x)\)

  • For two classes, can solve \(\delta_0(x) = \delta_1(x)\) for \(x\) to obtain \(x = \frac{\mu_0 + \mu_1}{2}\)

  • To derive a boundary from data: \(x = \frac{\widehat{\mu}_0 + \widehat{\mu}_1}{2}\) \(\Rightarrow\) Predict \(1\) if \(x > \frac{\widehat{\mu}_0 + \widehat{\mu}_1}{2}\), \(0\) otherwise

ISLR Figure 4.4: Estimating the Decision Boundary from data. The dashed line is the “true” boundary \(x = \frac{\mu_0 + \mu_1}{2}\), while the solid line in the right panel is the boundary estimated from data as \(x = \frac{\widehat{\mu}_0 + \widehat{\mu}_1}{2}\).

Number of Parameters

  • \(K = 2\) is special case, since lots of things cancel out, but in general need to estimate:
English Notation How Many Formula
Prior for class \(k\) \(\widehat{\pi}_k\) \(K - 1\) \(\widehat{\pi}_k = n_k / n\)
Estimated mean for class \(k\) \(\widehat{\mu}_k\) \(K\) \(\widehat{\mu}_k = \displaystyle \frac{1}{n_k}\sum_{\{i \mid y_i = k\}}x_i\)
Estimated (shared) variance \(\widehat{\sigma}^2\) 1 \(\widehat{\sigma}^2 = \displaystyle \frac{1}{n - K}\sum_{k = 1}^{K}\sum_{i:y_i = k}(x_i - \widehat{\mu}_k)^2\)
Total: \(2K\)
  • (Keep in mind for fancier methods! This may blow up to be much larger than \(n\))

LDA with Multiple Features (Here \(p = 2\))

  • Within each class \(k\), values of \(\mathbf{x}\) are (multivariate) normally distributed:

\[ \left( \begin{bmatrix}X_1 \\ X_2\end{bmatrix} \middle| ~ Y = k \right) \sim \mathbf{\mathcal{N}}_2(\param{\boldsymbol\mu_k}, \param{\mathbf{\Sigma}}) \]

  • Increasing \(p\) to 2 and \(K\) to 3 means more parameters, but still linear boundaries. It turns out: shared variance (\(\sigma^2\) or \(\mathbf{\Sigma}\)) will always produce linear boundaries 🤔

ISLR Figure 4.6: Like before, dashed lines are “true” boundaries while solid lines are boundaries estimated from data

Quadratic Class Boundaries

  • To achieve non-linear boundaries, estimate covariance matrix \(\mathbf{\Sigma}_k\) for each class \(k\):

\[ \left( \begin{bmatrix}X_1 \\ X_2\end{bmatrix} \middle| ~ Y = k \right) \sim \mathbf{\mathcal{N}}_2(\param{\boldsymbol\mu_k}, \param{\mathbf{\Sigma}_k}) \]

  • Pros: Non-linear class boundaries! Cons: More parameters to estimate, does worse than LDA if data linearly-separable (or nearly linearly-separable).
  • Deciding factor: do you think DGP produces normal classes with same variance?

ISLR Figure 4.9: Dashed purple line is “true” boundary (Bayes decision boundary), dotted black line is LDA boundary, solid green line is QDA boundary

Key Advantage of Generative Model

  • You get an actual “picture” of what the data looks like!
\(\widehat{f}_k(X_1)\) \(\widehat{f}_k(X_2)\) \(\widehat{f}_k(X_3)\)
\(k = 1\)
\(k = 2\)

Classifying New Points

  • For new feature values \(x_{ij}\), compare how likely this value is under \(k = 1\) vs. \(k = 2\)
  • Example: \(\mathbf{x} = (0.4, 1.5, 1)^{\top}\)
\(\widehat{f}_k(X_1)\) \(\widehat{f}_k(X_2)\) \(\widehat{f}_k(X_3)\)
\(k = 1\)
\(f_1(0.4) = 0.368\) \(f_1(1.5) = 0.484\) \(f_1(1) = 0.226\)
\(k = 2\)
\(f_2(0.4) = 0.030\) \(f_2(1.5) = 0.130\) \(f_2(1) = 0.616\)

Quiz Time!

Appendix: Fuller Logistic Derivation

\[ \begin{align*} &\log\left[ \frac{\Pr(Y = 1 \mid X)}{1 - \Pr(Y = 1 \mid X)} \right] = \beta_0 + \beta_1 X \\ &\iff \frac{\Pr(Y = 1 \mid X = x_i)}{1 - \Pr(Y = 1\ \mid X = x_i)} = e^{\beta_0 + \beta_1 X} \\ &\iff \Pr(Y = 1 \mid X) = e^{\beta_0 + \beta_1 X}(1 - \Pr(Y = 1 \mid X)) \\ &\iff \Pr(Y = 1 \mid X) = e^{\beta_0 + \beta_1 X} - e^{\beta_0 + \beta_1 X}\Pr(Y = 1 \mid X) \\ &\iff \Pr(Y = 1 \mid X) + e^{\beta_0 + \beta_1 X}\Pr(Y = 1 \mid X) = e^{\beta_0 + \beta_1 X} \\ &\iff \Pr(Y = 1 \mid X)(1 + e^{\beta_0 + \beta_1 X}) = e^{\beta_0 + \beta_1 X} \\ &\iff \Pr(Y = 1 \mid X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \end{align*} \]

References

Footnotes

  1. (More normally-distributed \(X\) \(\implies\) more likely to “beat” Logistic Regression)↩︎