home_id | sqft | bedrooms | rating |
---|---|---|---|
0 | 1000 | 1 | Disliked |
1 | 2000 | 2 | Liked |
2 | 2500 | 1 | Liked |
3 | 1500 | 2 | Disliked |
4 | 2200 | 1 | Liked |
DSAN5000: Data Science and Analytics
(Addendum to Week 07)
2025-01-22
Supervised Learning: You want the computer to learn the existing pattern of how you are classifying1 observations
Unsupervised Learning: You want the computer to find patterns in a dataset, without any prior classification info
Supervised Learning: Dataset has both explanatory variables (“features”) and response variables (“labels”)
home_id | sqft | bedrooms | rating |
---|---|---|---|
0 | 1000 | 1 | Disliked |
1 | 2000 | 2 | Liked |
2 | 2500 | 1 | Liked |
3 | 1500 | 2 | Disliked |
4 | 2200 | 1 | Liked |
Unsupervised Learning: Dataset has only explanatory variables (“features”)
home_id | sqft | bedrooms |
---|---|---|
0 | 1000 | 1 |
1 | 2000 | 2 |
2 | 2500 | 1 |
3 | 1500 | 2 |
4 | 2200 | 1 |
So… what’s wrong here?
How well does our model represent the world?1 \(\mathsf{Correspondence}(y_{obs}, \theta)\)
\(P\left(y_{obs}, \theta\right)\), \(P\left(\theta \; | \; y_{obs}\right)\), \(P\left(y_{obs} \; | \; \theta\right)\)2
Maximum Likelihood Estimation?
\[ \begin{align*} \mathsf{Correspondence}(y_{obs}, \theta) &\equiv P(y = y_{obs}, \theta) \\ P(y = y_{obs}, \theta) &= P(y=y_{obs} \; | \; \theta)P(\theta) \\ &\propto P\left(y = y_{obs} \; | \; \theta\right)\ldots \implies \text{(maximize this!)} \\ \end{align*} \]
\[ \begin{align*} \mathsf{Precision} &= \frac{\# \text{true positives}}{\# \text{predicted positive}} = \frac{tp}{tp+fp} \\[1.5em] \mathsf{Recall} &= \frac{\# \text{true positives}}{\# \text{positives in data}} = \frac{tp}{tp+fn} \\[1.5em] F_1 &= 2\frac{\mathsf{Precision} \cdot \mathsf{Recall}}{\mathsf{Precision} + \mathsf{Recall}} = \mathsf{HMean}(\mathsf{Precision}, \mathsf{Recall}) \end{align*} \]
\[ \frac{\partial F_1(weights)}{\partial weights} = \ldots \; ? \; \ldots 💀 \]
\[ \mathcal{L}_{CE}(y_{pred}, y_{obs}) = -(y_{obs}\log(y_{pred}) + (1-y_{obs})\log(1-y_{pred})) \]
Once we’ve chosen a loss function, the learning algorithm has what it needs to proceed with the actual learning
Notation: Bundle all the model’s parameters together into \(\theta\)
The goal: \[ \min_{\theta} \mathcal{L}(y_{obs}, y_{pred}(\theta)) \]
What would this look like for the random-lines approach?
Is there a more efficient way?
\[ f \in C([a,b],[a,b]) \] \[ \implies \forall \epsilon > 0, \exists p \in \mathbb{R}[x] : \] \[ \forall x \in [a, b] \; \left|f(x) − p(x)\right| < \epsilon \]
Original optimization: \[ \theta^* = \underset{\theta}{\operatorname{argmin}} \mathcal{L}(y_{obs}, y_{pred}(\theta)) \]
New optimization: \[ \theta^* = \underset{\theta}{\operatorname{argmin}} \left[ \mathcal{L}(y_{obs}, y_{pred}(\theta)) + \mathsf{Complexity}(\theta) \right] \]
\[ \mathsf{Complexity}(y_{pred} = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3) > \mathsf{Complexity}(y_{pred} = \beta_0 + \beta_1 x) \]
\[ \mathsf{Complexity} \propto \frac{|\text{AmplifiedFeatures}|}{|\text{ShrunkFeatures}|} \]
\[ \beta^*_{LASSO} = {\underset{\beta}{\operatorname{argmin}}}\left\{{\frac {1}{N}}\left\|y-X\beta \right\|_{2}^{2}+\lambda \|\beta \|_{1}\right\} \]
\[ \beta^*_{EN} = {\underset {\beta }{\operatorname {argmin} }}\left\{ \|y-X\beta \|^{2}_2+\lambda _{2}\|\beta \|^{2}+\lambda _{1}\|\beta \|_{1} \right\} \]
DSAN5000 Extra Slides: Machine Learning