Weeks 8-9: Shift from focus on regression to focus on classification (Though we use lessons from regression!)
Last Week (W08): SVMs as new method with this focus
Emphasis on boundary between classes 2.5hrs on separating hyperplanes: in original feature space (Max-Margin, SVCs) or derived feature spaces (SVMs)
Now: Wait, didn’t we discuss a classification method before, though its name confusingly had “regression” in it? 🤔
Take logistic regression but use Bayes rule to “flip” from regression[+thresholding] task to class-separation task (think of SVM’s max-width-of-“slab” objective!)
Logistic Regression Refresher
We don’t have time for full refresher, but just remember how it involves learning values to minimize loss w.r.t.
And then, if we want to classify rather than just predict, we apply a threshold :
Intuition
Logistic regression is called a discriminative model, since we are learning parameters that best produce a predicted class from features…
We’re modeling (for two classes, and ), hence the LHS of the Logistic Regression formula
But there are cases where we can do better1 by instead modeling (learning parameters for) , for each , then using Bayes rule to “flip” back to ! LDA, QDA, and Naïve Bayes classifiers
Linear Discriminant Analysis (LDA)
Not to be confused with the NLP model called “LDA”!
In that case LDA = “Latent Dirichlet Allocation”
Bayes’ Rule
First things first, we generalize from to possible classes (labels), since the notation for classes here is not much more complex than 2 classes!
We label the pieces using ISLR’s notation to make our lives easier:
So if we do have only two classes, and
Priors can be estimated as . The hard work is in modeling ! With estimates of these two “pieces” for each , we can derive a classifier
The LDA Assumption (One Feature )
Within each class , values of are normally distributed:
Plugging back into (notationally-simplified) classifier, we get
Gross, BUT “linear” discriminant :
Decision Boundaries
The boundary between two classes and will be the point at which
For two classes, can solve for to obtain
To derive a boundary from data: Predict if , otherwise
ISLR Figure 4.4: Estimating the Decision Boundary from data. The dashed line is the “true” boundary , while the solid line in the right panel is the boundary estimated from data as .
Number of Parameters
is special case, since lots of things cancel out, but in general need to estimate:
English
Notation
How Many
Formula
Prior for class
Estimated mean for class
Estimated (shared) variance
1
Total:
(Keep in mind for fancier methods! This may blow up to be much larger than )
LDA with Multiple Features (Here )
Within each class , values of are (multivariate) normally distributed:
Increasing to 2 and to 3 means more parameters, but still linear boundaries. It turns out: shared variance ( or ) will always produce linear boundaries 🤔
ISLR Figure 4.6: Like before, dashed lines are “true” boundaries while solid lines are boundaries estimated from data
Quadratic Class Boundaries
To achieve non-linear boundaries, estimate covariance matrix for each class :
Pros: Non-linear class boundaries! Cons: More parameters to estimate, does worse than LDA if data linearly-separable (or nearly linearly-separable).
Deciding factor: do you think DGP produces normal classes with same variance?
ISLR Figure 4.9: Dashed purple line is “true” boundary (Bayes decision boundary), dotted black line is LDA boundary, solid green line is QDA boundary
Key Advantage of Generative Model
You get an actual “picture” of what the data looks like!
Classifying New Points
For new feature values , compare how likely this value is under vs.
Example:
Quiz Time!
Appendix: Fuller Logistic Derivation
References
Footnotes
(More normally-distributed more likely to “beat” Logistic Regression)↩︎