Week 4: Discriminative vs. Generative Models

DSAN 5300: Statistical Learning
Spring 2026, Georgetown University

Jeff Jacobs

jj1088@georgetown.edu

Monday, February 2, 2026

Schedule

Today’s Planned Schedule:

Start End Topic
Lecture 6:30pm 6:45pm Logistic Regression Recap →
6:45pm 7:10pm From Discriminative to Generative Models →
7:10pm 8:00pm LDA, QDA, and Naive Bayes →
Break! 8:00pm 8:10pm
8:10pm 8:30pm Quiz Review →
8:30pm 9:00pm Quiz!

From Discriminative to Generative Models

Recap: Logistic Regression

  • What happens to the probability \(\Pr(Y = 1 \mid X)\) when \(X\) increases by 1 unit?
  • “Linear Probability Model” \(\Pr(Y = 1 \mid X) = \beta_0 + \beta_1X\) fails, but, “fixing” it leads to

\[ \begin{align*} &\log\left[ \frac{\Pr(Y = 1 \mid X)}{1 - \Pr(Y = 1 \mid X)} \right] = \beta_0 + \beta_1 X \\ \iff &\Pr(Y = 1 \mid X) = \frac{\exp[\beta_0 + \beta_1X]}{1 + \exp[\beta_0 + \beta_1X]} = \frac{1}{1 + \exp\left[ -(\beta_0 + \beta_1X) \right] } \end{align*} \]

  • A 1-unit increase in \(X\) is associated with a \(\beta_1\) increase in the log-odds \(\log\mkern-3mu\left[ \frac{\Pr(Y=1)}{\Pr(Y=0)}\right]\)
  • \(\Rightarrow\) effect of 1-unit increase in \(X\) on \(\Pr(Y = 1)\) depends on \(x\) (low, high, low)
  • Then if we want to classify \(x \in \{0, 1\}\), we apply a threshold \(t \in [0,1]\):

\[ \widehat{y} = \begin{cases} 1 &\text{if }\Pr(Y = 1 \mid X = x) = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} > t \\ 0 &\text{otherwise} \end{cases} \]

What Does Logistic Regression Model?

Logistic Regression = Discriminative Model

  • We choose \(\beta_0\) and \(\beta_1\) that best discriminates between classes \(Y = 0\) and \(Y = 1\)
  • In other words: We’re modeling the odds ratio \(\frac{\Pr(Y = 1 \mid x)}{\Pr(Y = 0 \mid x)}\) for given “vertical slices” \(x\)

  • …What if we instead modeled, for each \(Y\) value \(k\), what the distribution of the data \([X \mid Y = k]\) looks like?
  • Sometimes we can do better1 by learning \(\Pr(X \mid Y = k)\) for each \(k\), then using Bayes rule to “flip” to \(\Pr(Y = k \mid X)\) for prediction!

LDA, QDA, Naive Bayes = Generative Models

Where Did The Decision Boundary Come From?

  • May seem “obvious” for simple case (2 classes, 1 feature), but we need a method for deriving this boundary, for any number of classes/features!
  • Boundary point \(\frac{\mu_0 + \mu_1}{2}\) emerges out of LDA approach:
  • Compute discriminant functions \(\delta_k(x)\) (defined in a few slides, but proportional to probability that \(x\) in class \(k\))
  • If \(\delta_i(x) > \delta_j(x)\), \(x\) more likely to be generated by class \(i\) than class \(j\)
  • If \(\delta_i(x) < \delta_j(x)\), \(x\) more likely to be generated by class \(j\) than class \(i\)
  • If \(\delta_i(x) = \delta_j(x)\), \(x\) equally likely to be generated by class \(i\) and class \(j\)
  • Class \(i\)-\(j\) boundary = set of \(x\) values satisfying \(\delta_i(x) = \delta_j(x)\)
  • On last slide: \(\delta_0(x) = \delta_1(x)\) when \(x = \frac{\mu_0 + \mu_1}{2}\)

Linear Discriminant Analysis (LDA)

  • Not to be confused with the NLP model called “LDA”!
  • In that case LDA = “Latent Dirichlet Allocation”

Bayes’ Rule

  • First things first, we generalize from \(Y \in \{0, 1\}\) to \(K\) possible classes (labels), since the notation for \(K\) classes here is not much more complex than 2 classes!
  • We label the pieces using ISLP’s notation to make our lives easier:

\[ \underbrace{\Pr(Y = k \mid X = x)}_{p_k(x)} = \frac{ \overbrace{\Pr(X = x \mid Y = k)}^{f_k(x)} \overbrace{\Pr(Y = k)}^{\pi_k} }{ \sum_{\ell = 1}^{K} \underbrace{\Pr(X = x \mid Y = \ell)}_{f_{\ell}(x)} \underbrace{\Pr(Y = \ell)}_{\pi_{\ell}} } = \frac{f_k(x) \overbrace{\pi_k}^{\mathclap{\text{Prior}(k)}}}{\sum_{\ell = 1}^{K}f_{\ell}(x) \underbrace{\pi_\ell}_{\mathclap{\text{Prior}(\ell)}}} \]

  • So if we do have only two classes, \(K = 2\) and \(p_1(x) = \frac{f_1(x)\pi_1}{f_1(x)\pi_1 + f_0(x)\pi_0}\)

  • Priors can be estimated as \(n_k / n\). The hard work is in modeling \(f_k(x)\)!

  • Estimates of these two “pieces” for each \(k\) \(\leadsto\) classifier \(\widehat{y}(x) = \argmax_k p_k(x)\)

The LDA Assumption (One Feature \(x\))

  • Within each class \(k\), values of \(x\) are normally distributed:

\[ (X \mid Y = k) \sim \mathcal{N}(\param{\mu_k}, \param{\sigma^2}) \iff f_k(x) = \frac{1}{\sqrt{2 \pi}\sigma}\exp\left[-\frac{1}{2}\left( \frac{x - \mu_k}{\sigma} \right)^2\right] \]

  • Plugging back into (notationally-simplified) classifier, we get

\[ \widehat{y}(x) = \argmax_{k}\left[ \frac{ \pi_k \frac{1}{\sqrt{2 \pi}\sigma}\exp\left[-\frac{1}{2}\left( \frac{x - \mu_k}{\sigma} \right)^2\right] }{ \sum_{\ell = 1}^{K}\pi_{\ell} \frac{1}{\sqrt{2 \pi}\sigma}\exp\left[-\frac{1}{2}\left( \frac{x - \mu_\ell}{\sigma} \right)^2\right] }\right], \]

  • Gross 🤮 BUT \(\argmax_k p_k(x) = \argmax_k \log(p_k(x)) \leadsto\) “linear” discriminant \(\delta_k(x)\):

\[ \widehat{y}(x) = \argmax_k[\delta_k(x)] = \argmax_{k}\left[ \overbrace{\frac{\mu_k}{\sigma^2}}^{\smash{m}} x ~ \overbrace{- \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)}^{\smash{b}} \right] \]

Estimating Decision Boundaries

  • For two classes, can solve \(\delta_0(x) = \delta_1(x)\) for \(x\) to obtain \(x = \frac{\mu_0 + \mu_1}{2}\)
  • …But in real world we don’t have \(\mu_0\) or \(\mu_1\) (population parameters)!
  • Instead we estimate boundary from data: \(x = \frac{\widehat{\mu}_0 + \widehat{\mu}_1}{2}\) \(\Rightarrow\) Predict \(1\) if \(x > \frac{\widehat{\mu}_0 + \widehat{\mu}_1}{2}\), \(0\) otherwise

ISLR Figure 4.4: Estimating the Decision Boundary from data. The dashed line is the “true” boundary \(x = \frac{\mu_0 + \mu_1}{2}\), while the solid line in the right panel is the boundary estimated from data as \(x = \frac{\widehat{\mu}_0 + \widehat{\mu}_1}{2}\).

Number of Parameters

  • \(K = 2\) is special case, since lots of things cancel out, but in general need to estimate:
English Notation How Many? Formula
Prior for class \(k\) \(\widehat{\pi}_k\) \(K - 1\) \(\widehat{\pi}_k = n_k / n\)
Estimated mean for class \(k\) \(\widehat{\mu}_k\) \(K\) \(\widehat{\mu}_k = \displaystyle \frac{1}{n_k}\sum_{\{i \mid y_i = k\}}x_i\)
Estimated (shared) variance \(\widehat{\sigma}^2\) 1 \(\widehat{\sigma}^2 = \displaystyle \frac{1}{n - K}\sum_{k = 1}^{K}\sum_{i:y_i = k}(x_i - \widehat{\mu}_k)^2\)
Total: \(2K\)
  • (Keep in mind for fancier methods! This may blow up to be much larger than \(n\))

LDA with Multiple Features (Here \(p = 2\))

  • Within each class \(k\), values of \(\mathbf{x}\) are (multivariate) normally distributed:

\[ \left( \begin{bmatrix}X_1 \\ X_2\end{bmatrix} \middle| ~ Y = k \right) \sim \mathbf{\mathcal{N}}_2(\param{\boldsymbol\mu_k}, \param{\mathbf{\Sigma}}) \]

  • Increasing \(p\) to 2 and \(K\) to 3 means more parameters, but still linear boundaries. It turns out: shared variance (\(\sigma^2\) or \(\mathbf{\Sigma}\)) will always produce linear boundaries 🤔

ISLR Figure 4.6: Like before, dashed lines are “true” boundaries while solid lines are boundaries estimated from data

Quadratic Class Boundaries

  • To achieve non-linear boundaries, estimate covariance matrix \(\mathbf{\Sigma}_k\) for each class \(k\):

\[ \left( \begin{bmatrix}X_1 \\ X_2\end{bmatrix} \middle| ~ Y = k \right) \sim \mathbf{\mathcal{N}}_2(\param{\boldsymbol\mu_k}, \param{\mathbf{\Sigma}_k}) \]

  • Pros: Non-linear class boundaries! Cons: More parameters to estimate, does worse than LDA if data linearly-separable (or nearly linearly-separable).
  • Deciding factor: do you think DGP produces normal classes with same variance?

ISLR Figure 4.9: Dashed purple line is “true” boundary (Bayes decision boundary), dotted black line is LDA boundary, solid green line is QDA boundary

So… What’s the Catch?

  • Why not just use QDA all the time?
  • (Hint: overfitting)

Explosion in # Parameters

  • For more than a few features, complexity of fitting QDA model can become prohibitive: we have to estimate bigger and bigger covariance matrix

QDA

  • \(p\) predictors \(\Rightarrow \frac{p^2 + p}{2}\) unique covariance matrix entries

\[ \mathbf{\Sigma}^{\text{QDA}}_{p=2} = \begin{bmatrix} \sigma_1^2 & \sigma_{12} \\ \sigma_{12} & \sigma_2^2 \end{bmatrix}, \mathbf{\Sigma}_{p=3} = \begin{bmatrix} \sigma_1^2 & \sigma_{12} & \sigma_{13} \\ \sigma_{12} & \sigma_2^2 & \sigma_{23} \\ \sigma_{13} & \sigma_{23} & \sigma_3^2 \end{bmatrix}, \ldots \]

  • \(K\) classes \(\Rightarrow \frac{Kp^2 + Kp}{2}\) parameters to estimate

Naïve Bayes Assumption

  • Independent features \(\Rightarrow\) diagonal \(\mathbf{\Sigma}\): no need to estimate relationships between features!

\[ \mathbf{\Sigma}^{\text{NB}}_{p=2} = \begin{bmatrix} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \end{bmatrix}, \mathbf{\Sigma}_{p=3} = \begin{bmatrix} \sigma_1^2 & 0 & 0 \\ 0 & \sigma_2^2 & 0 \\ 0 & 0 & \sigma_3^2 \end{bmatrix}, \ldots \]

  • \(\leadsto\) Reduction from \(Kp^2\) to \(Kp\) parameters!

…Naïve Bayes to the Rescue!

  • With Naïve Bayes assumption, can visualize feature distributions by class:
\(\widehat{f}_k(X_1)\) \(\widehat{f}_k(X_2)\) \(\widehat{f}_k(X_3)\)
\(k = 1\)
\(k = 2\)

Classifying New Points

  • For new feature values \(x_{ij}\), compare how likely this value is under \(k = 1\) vs. \(k = 2\)
  • Example: \(\mathbf{x} = (0.4, 1.5, 1)^{\top}\)
\(\widehat{f}_k(X_1)\) \(\widehat{f}_k(X_2)\) \(\widehat{f}_k(X_3)\)
\(k = 1\)
\(f_1(0.4) = 0.368\) \(f_1(1.5) = 0.484\) \(f_1(1) = 0.226\)
\(k = 2\)
\(f_2(0.4) = 0.030\) \(f_2(1.5) = 0.130\) \(f_2(1) = 0.616\)

Which Class’s Data “Looks More Like” \(X\)?

Quiz Review

References