Week 9: Generative vs. Discriminative Models

DSAN 5300: Statistical Learning
Spring 2025, Georgetown University

Author
Affiliation

Jeff Jacobs

Published

Monday, March 17, 2025

Open slides in new tab →

Schedule

Today’s Planned Schedule:

Start End Topic
Lecture 6:30pm 7:00pm Separating Hyperplanes →
7:00pm 7:20pm Max-Margin Classifiers →
7:20pm 8:00pm Support Vector Classifiers
Break! 8:00pm 8:10pm
8:10pm 9:00pm Quiz 2 →

Quick Roadmap

  • Weeks 8-9: Shift from focus on regression to focus on classification (Though we use lessons from regression!)
  • Last Week (W08): SVMs as new method with this focus
    • Emphasis on boundary between classes 2.5hrs on separating hyperplanes: in original feature space (Max-Margin, SVCs) or derived feature spaces (SVMs)
  • Now: Wait, didn’t we discuss a classification method before, though its name confusingly had “regression” in it? 🤔
    • Take logistic regression but use Bayes rule to “flip” from regression[+thresholding] task to class-separation task (think of SVM’s max-width-of-“slab” objective!)

Logistic Regression Refresher

  • We don’t have time for full refresher, but just remember how it involves learning βj values to minimize loss w.r.t.

log[Pr(Y=1X)1Pr(Y=1X)]=β0+β1XPr(Y=1X=xi)=eβ0+β1X1+eβ0+β1X

  • And then, if we want to classify x rather than just predict Pr(Y=1X=x), we apply a threshold t[0,1]:

y^={1if Pr(Y=1X=x)=eβ0+β1x1+eβ0+β1x>t0otherwise

Intuition

  • Logistic regression is called a discriminative model, since we are learning parameters βj that best produce a predicted class yi^ from features xi
  • We’re modeling Pr(Y=kX) (for two classes, k=0 and k=1), hence the LHS of the Logistic Regression formula
  • But there are cases where we can do better by instead modeling (learning parameters for) Pr(XY=k), for each k, then using Bayes rule to “flip” back to Pr(Y=kX)!
    LDA, QDA, and Naïve Bayes classifiers

Linear Discriminant Analysis (LDA)

  • Not to be confused with the NLP model called “LDA”!
  • In that case LDA = “Latent Dirichlet Allocation”

Bayes’ Rule

  • First things first, we generalize from Y{0,1} to K possible classes (labels), since the notation for K classes here is not much more complex than 2 classes!
  • We label the pieces using ISLR’s notation to make our lives easier:

Pr(Y=kX=x)pk(x)=Pr(X=xY=k)fk(x)Pr(Y=k)πk=1KPr(X=xY=)f(x)Pr(Y=)π=fk(x)πkPrior(k)=1Kf(x)πPrior()

  • So if we do have only two classes, K=2 and p1(x)=f1(x)π1f1(x)π1+f0(x)π0

  • Priors can be estimated as nk/n. The hard work is in modeling fk(x)! With estimates of these two “pieces” for each k, we can derive a classifier y^(x)=argmaxkpk(x)

The LDA Assumption (One Feature x)

  • Within each class k, values of x are normally distributed:

(XY=k)N(μk,σ2)fk(x)=12πσexp[12(xμkσ)2]

  • Plugging back into (notationally-simplified) classifier, we get

y^(x)=argmaxk[πk12πσexp[12(xμkσ)2]=1Kπ12πσexp[12(xμσ)2]],

  • Gross, BUT argmaxkpk(x)=argmaxklog(pk(x)) “linear” discriminant δk(x):

y^(x)=argmaxk[δk(x)]=argmaxk[μkσ2mx μk22σ2+log(πk)b]

Decision Boundaries

  • The boundary between two classes k and k will be the point at which δk(x)=δk(x)

  • For two classes, can solve δ0(x)=δ1(x) for x to obtain x=μ0+μ12

  • To derive a boundary from data: x=μ^0+μ^12 Predict 1 if x>μ^0+μ^12, 0 otherwise

ISLR Figure 4.4: Estimating the Decision Boundary from data. The dashed line is the “true” boundary x=μ0+μ12, while the solid line in the right panel is the boundary estimated from data as x=μ^0+μ^12.

Number of Parameters

  • K=2 is special case, since lots of things cancel out, but in general need to estimate:
English Notation How Many Formula
Prior for class k π^k K1 π^k=nk/n
Estimated mean for class k μ^k K μ^k=1nk{iyi=k}xi
Estimated (shared) variance σ^2 1 σ^2=1nKk=1Ki:yi=k(xiμ^k)2
Total: 2K
  • (Keep in mind for fancier methods! This may blow up to be much larger than n)

LDA with Multiple Features (Here p=2)

  • Within each class k, values of x are (multivariate) normally distributed:

([X1X2]| Y=k)N2(μk,Σ)

  • Increasing p to 2 and K to 3 means more parameters, but still linear boundaries. It turns out: shared variance (σ2 or Σ) will always produce linear boundaries 🤔

ISLR Figure 4.6: Like before, dashed lines are “true” boundaries while solid lines are boundaries estimated from data

Quadratic Class Boundaries

  • To achieve non-linear boundaries, estimate covariance matrix Σk for each class k:

([X1X2]| Y=k)N2(μk,Σk)

  • Pros: Non-linear class boundaries! Cons: More parameters to estimate, does worse than LDA if data linearly-separable (or nearly linearly-separable).
  • Deciding factor: do you think DGP produces normal classes with same variance?

ISLR Figure 4.9: Dashed purple line is “true” boundary (Bayes decision boundary), dotted black line is LDA boundary, solid green line is QDA boundary

Key Advantage of Generative Model

  • You get an actual “picture” of what the data looks like!
f^k(X1) f^k(X2) f^k(X3)
k=1
k=2

Classifying New Points

  • For new feature values xij, compare how likely this value is under k=1 vs. k=2
  • Example: x=(0.4,1.5,1)
f^k(X1) f^k(X2) f^k(X3)
k=1
f1(0.4)=0.368 f1(1.5)=0.484 f1(1)=0.226
k=2
f2(0.4)=0.030 f2(1.5)=0.130 f2(1)=0.616

Quiz Time!

Appendix: Fuller Logistic Derivation

log[Pr(Y=1X)1Pr(Y=1X)]=β0+β1XPr(Y=1X=xi)1Pr(Y=1 X=xi)=eβ0+β1XPr(Y=1X)=eβ0+β1X(1Pr(Y=1X))Pr(Y=1X)=eβ0+β1Xeβ0+β1XPr(Y=1X)Pr(Y=1X)+eβ0+β1XPr(Y=1X)=eβ0+β1XPr(Y=1X)(1+eβ0+β1X)=eβ0+β1XPr(Y=1X)=eβ0+β1X1+eβ0+β1X

References

Footnotes

  1. (More normally-distributed X more likely to “beat” Logistic Regression)↩︎