Week 6: Multivariate Distributions

DSAN 5100: Probabilistic Modeling and Statistical Computing
Section 01

Purna Gamage, Amineh Zadbood, and Jeff Jacobs

Tuesday, September 30, 2025

Distributions in Continuous World

Discrete = “Easy mode”: Based (intuitively) on sets
\(\Pr(A)\): Four equally-likely marbles \(\{A, B, C, D\}\) in box, what is probability I pull out \(A\)?

\[ \Pr(A) = \underset{\mathclap{\small \text{Probability }\textbf{mass}}}{\boxed{\frac{|\{A\}|}{|\Omega|}}} = \frac{1}{|\{A,B,C,D\}|} = \frac{1}{4} \]

Continuous = “Hard mode”: Based (intuitively) on areas
\(\Pr(A)\): Throw dart at random point in square, what is probability I hit \(\require{enclose}\enclose{circle}{\textsf{A}}\)?

\[ \Pr(A) = \underset{\mathclap{\small \text{Probability }\textbf{density}}}{\boxed{\frac{\text{Area}(\{A\})}{\text{Area}(\Omega)}}} = \frac{\pi r^2}{s^2} = \frac{\pi \left(\frac{1}{4}\right)^2}{4} = \frac{\pi}{64} \]

Moving to Continuous World

Intuitions from discrete world do translate into good intuitions for continuous world, in this case!
Can “move” discrete table into continuous space like Riemann sums “move” discrete sums into integrals:

“Smoothing” Our Example

Instead of discrete \(G\) with \(\mathcal{R}_G = \{10, 11, 12\}\), we have continuous \(G\) with \(\mathcal{R}_G = [10,12] \subset \mathbb{R}\) (HS progress)
Instead of discrete \(H\) with \(\mathcal{R}_H = \{0, 1\}\), now we keep track of continuous “honors spectrum” \(H\) with \(\mathcal{R}_H = [0, 1] \subset \mathbb{R}\)
Student near beginning of 10th grade who is at “high end” of honors spectrum: \(G = 10.03\) and \(H = 0.95\)

“Smoothing” Our Tables

	Discrete World	Continuous World
How do we get probabilities?	Sums \(\Pr(X = 1)\) \(+ \Pr(X = 2) = \frac{2}{6}\)	Integrals \(\displaystyle\int_{x=20}^{x=21}f_X(x) = 0.4\)
What makes a valid distribution?	\(\displaystyle\sum_{x \in \mathcal{R}_X}\Pr(X = x) = 1\)	\(\displaystyle\int_{x \in \mathcal{R}_X}f_X(x) = 1\)
How do we model uncertainty about multiple variables?	Frequency table	Joint pdf

Continuous Joint pdfs

The volume of this Hershey Kiss is exactly \(1\)
Integrating over a region \(C\) gives us

\[ \frac{\text{Volume}(\{(X,Y) \mid (X,Y) \in C\})}{\text{Volume}(\text{Hershey Kiss})} = \Pr((X,Y) \in C) \]

Figure 3.11 from DeGroot and Schervish (2013)

Conditional pdfs

Now, if we learn that \(Y = y_0\), can take “slice” at \(y = y_0\)
Total area of this slice is not \(1\), so \(f_{X,Y}(x, y_0)\) is not a valid pdf
Dividing by total area of slice would generate valid pdf. What is this area? \(f_X(y_0)\)
\(\implies\) \(f_{X \mid Y = y_0}(x \mid y_0) = \frac{f_{X,Y}(x, y_0)}{f_X(y_0)}\) is valid (conditional) pdf

Figure 3.20 from DeGroot and Schervish (2013)

Working Backwards in Continuous World

While in discrete world we could easily provide a table, in continuous world we often/usually have to work backwards; we may just be given:
\(G \sim \mathcal{U}(10, 12)\)
\(H \sim \ddot{\mathcal{N}}(\mu = 0.5, \sigma = 0.1, a = 0, b = 1)\), and
\(G \perp H\) (so \(\Pr(G \mid H) = \Pr(G), \Pr(H \mid G) = \Pr(H)\))
(i.e., marginal distributions = conditional distributions)

The Marginal pdfs of \(G\) and \(H\)

Since we know \(G \sim \mathcal{U}(10,12)\), we know (or we could look up) that \(G\) has pdf

\[ f_G(g) = \frac{1}{12 - 10} = \frac{1}{2} \]

\(H\) has a slightly fancier distribution: truncated normal
Nonetheless a pdf we can derive from (a) knowing the pdf of the normal distribution and (b) knowing what we’ve just talked about regarding conditional distributions

The Truncated Normal Distribution

\(\ddot{\mathcal{N}}\) may look scary, but \(X \sim \ddot{\mathcal{N}}(\mu, \sigma, a, b)\) just means \(X\) can be constructed as:

\[ X \sim \mathcal{N}(\mu, \sigma) \leadsto [X \mid a < X < b] \sim \ddot{\mathcal{N}}(\mu, \sigma, a, b) \]

Adapted from simstudy package documentation

The Truncated Normal pdf

Since we see a conditioning bar on the previous slide, we can infer what the pdf of this conditional distribution would look like. If \(X \sim \ddot{\mathcal{N}}(\mu, \sigma, a, b)\), then \(X\) has pdf

\[ f_X(x) = \frac{1}{\sigma}\cdot \frac{\varphi\left(\frac{x - \mu}{\sigma}\right)}{\Phi\left(\frac{b-\mu}{\sigma}\right) - \Phi\left(\frac{a - \mu}{\sigma}\right)} \approx \frac{\Pr(X \approx x, a < X < b)}{\Pr(a < X < b)} \]

\(\varphi\) is the pdf of \(\mathcal{N}(0,1)\)
\(\Phi\) is the CDF of \(\mathcal{N}(0,1)\)

Back to Working Backwards

By definition of independence, we can obtain joint pdf \(f_{G,H}(g, h)\) by multiplying marginal pdf \(f_G(g)\) and marginal pdf \(f_H(h)\):

\[ \begin{align*} f_{G,H}(g, h) &= f_G(g) \cdot f_H(h) \\ &= \frac{1}{2} \cdot \left[ \frac{1}{0.1} \cdot \frac { \varphi\left(\frac{h - 0.5}{0.1}\right) } { \Phi\left(\frac{12-0.5}{0.1}\right) - \Phi\left(\frac{10 - 0.5}{0.1}\right) } \right] \end{align*} \]

Moving Forwards Again

We’ve arrived at the case we had in discrete world, where we know the joint distribution!
We can integrate wherever we took sums in the discrete case to obtain marginal pdfs:

\[ \begin{align*} f_G(g) &= \int_{0}^{1}f_{G,H}(g,h)\mathrm{d}h = \frac{1}{2}, \\ f_H(h) &= \int_{10}^{12}f_{G,H}(g, h)\mathrm{d}g = \frac{1}{0.1} \cdot \frac { \varphi\left(\frac{h - 0.5}{0.1}\right) } { \Phi\left(\frac{12 - 0.5}{0.1}\right) - \Phi\left(\frac{10 - 0.5}{0.1}\right) } \end{align*} \]

Conditional Distributions in Continuous World

We can compute conditional pdfs by renormalizing
Denominator is no longer integral of the distribution over all possible values (hence just the number \(1\)), but a ratio of joint distribution to marginal distribution values like:

\[ f_{H \mid G}(h | g) = \frac{f_{G,H}(g, h)}{f_G(g)}. \]

At first this is magic… we can use the exact same equation from discrete world?!?
But then you remember, continuous world was set up as a “hack”, using area under the curve between two points rather than height of the curve at a point to “port” discrete formulas into continuous world
Think of how, when mathematicians hit the “wall” of solving \(x^2 + 1 = 0\)… They set up a new number \(i\) to work with these polynomials using “standard” algebra. Here we set up continuous pdf \(f_X(x)\) to work with continuous RVs using “standard” probability theory!

You Don’t Have To Stare Helplessly At Scary Math!

Try to link the continuous equations back to their simpler discrete forms
Work with the discrete forms to develop intuition, then
Convert sums back into integrals once you’re ready

A Concrete Strategy

Start by discretizing (“binning”) the possible values of a continuous RV to obtain a discrete RV:
- Split \([10,12]\) into three equal-length bins, \([0,1]\) into two equal-length bins
- Simulate \((G, H)\) pairs, sort into bins, plot joint / marginal / conditional distributions of binned data
As you make bins thinner and thinner… 🤔

Thinking Through Specific Multivariate Distributions

From \(2\) to \(N\) variables!

The Multinoulli Distribution

May seem like a weird/contrived distribution, but is perfect for building intuition, as your first \(N\)-dimensional distribution (\(N > 2\))
\(\mathbf{X}\) is a six-dimensional Vector-Valued RV, so that

\[ \mathbf{X} = (X_1, X_2, X_3, X_4, X_5, X_6), \]

where \(\mathcal{R}_{X_1} = \{0, 1\}, \mathcal{R}_{X_2} = \{0, 1\}, \ldots, \mathcal{R}_{X_6} = \{0, 1\}\)
But, \(X_1, X_2, \ldots, X_6\) are not independent! In fact, they are so dependent that if one has value \(1\), the rest must have value \(0\) \(\leadsto\) we can infer the support of \(\mathbf{X}\):

\[ \begin{align*} \mathcal{R}_{\mathbf{X}} = \{ &(1,0,0,0,0,0),(0,1,0,0,0,0),(0,0,1,0,0,0), \\ &(0,0,0,1,0,0),(0,0,0,0,1,0),(0,0,0,0,0,1)\} \end{align*} \]
Lastly, need to define the probability that \(\mathbf{X}\) takes on any of these values. Let’s say \(\Pr(\mathbf{X} = \mathbf{v}) = \frac{1}{6}\) for all \(\mathbf{v} \in \mathcal{R}_{\mathbf{X}}\). Do we see the structure behind this contrived case?
(For math major friends, there is an isomorphism afoot… For the rest, it’s an extremely inefficient way to model outcomes from rolling a fair die)

The Multivariate Normal Distribution

We’ve already seen the matrix notation for writing the parameters of this distribution: \(\mathbf{X}_{[k \times 1]} \sim \mathcal{N}_k\left(\boldsymbol\mu_{[k \times 1]}, \Sigma_{[k \times k]}\right)\)
Now we get to crack open the matrix notation for writing its pdf:

\[ f_\mathbf{X}(\mathbf{x}_{[k \times 1]}) = \underbrace{\left(\frac{1}{\sqrt{2\pi}}\right)^k \frac{1}{\sqrt{\det(\Sigma)}}}_{\text{Normalizing constants}} \exp\left(-\frac{1}{2}\underbrace{(\mathbf{x} - \boldsymbol\mu)^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol\mu)}_{\text{Quadratic form}}\right) \]

Try to squint your eyes while looking at the above and compare with…

pdf of \(\mathcal{N}(\mu,\sigma)\)

\[ f_X(v) = \frac{1}{\sigma\sqrt{2\pi}}\bigexp{-\frac{1}{2}\left(\frac{v - \mu}{\sigma}\right)^2} \]

Structure of \(\Sigma\)

\[ \begin{align*} \mathbf{\Sigma} &= \begin{bmatrix}\sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_2\sigma_1 & \sigma_2^2\end{bmatrix} \\[0.1em] \implies \det(\Sigma) &= \sigma_1^2\sigma_2^2 - \rho^2\sigma_1^2\sigma_2^2 \\ &= \sigma_1^2\sigma_2^2(1-\rho^2) \end{align*} \]

Quadratic Forms

Quadratic forms will seem scary until someone forces you to write out the matrix multiplication!
Start with the 1D case: \(\mathbf{x} = [x_1]\), \(\boldsymbol\mu = [\mu_1]\), \(\Sigma = [\sigma^2]\). Then

\[ (\mathbf{x} - \boldsymbol\mu)^\top \Sigma^{-1} (\mathbf{x - \boldsymbol\mu}) = (x_1 - \mu_1)\frac{1}{\sigma^2}(x_1 - \mu_1) = \left(\frac{x_1 - \mu_1}{\sigma}\right)^2 ~ 🤯 \]

The 2D Case

Let \(\mathbf{x} = \left[\begin{smallmatrix}x_1 \\ x_2\end{smallmatrix}\right]\), \(\boldsymbol\mu = \left[ \begin{smallmatrix}\mu_1 \\ \mu_2 \end{smallmatrix}\right]\), \(\Sigma\) as in previous slide. Then \(\mathbf{x} - \boldsymbol\mu = \left[ \begin{smallmatrix} x_1 - \mu_1 \\ x_2 - \mu_2 \end{smallmatrix} \right]\)
Using what we know about \(2 \times 2\) matrix inversion,

\[ \Sigma^{-1} = \frac{1}{\det(\Sigma)}\left[ \begin{smallmatrix} \sigma_2^2 & -\rho \sigma_2\sigma_1 \\ -\rho \sigma_1\sigma_2 & \sigma_1^2\end{smallmatrix} \right] = \frac{1}{\sigma_1^2\sigma_2^2(1-\rho^2)}\left[ \begin{smallmatrix} \sigma_2^2 & -\rho \sigma_2\sigma_1 \\ -\rho \sigma_1\sigma_2 & \sigma_1^2\end{smallmatrix} \right] \]

With \(\widetilde{\rho} \equiv 1 - \rho^2\), we can write everything as a bunch of matrix multiplications:

\[ \begin{align*} &(\mathbf{x} - \boldsymbol\mu)^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol\mu) = \frac{1}{\sigma_1^2\sigma_2^2 \widetilde{\rho}} \begin{bmatrix}x_1 - \mu_1 & x_2 - \mu_2\end{bmatrix} \cdot \begin{bmatrix} \sigma_2^2 & -\rho \sigma_2\sigma_1 \\ -\rho \sigma_1\sigma_2 & \sigma_1^2\end{bmatrix} \cdot \begin{bmatrix}x_1 - \mu_1 \\ x_2 - \mu_2\end{bmatrix} \\ &= \frac{1}{\sigma_1^2\sigma_2^2 \widetilde{\rho}} \begin{bmatrix}(x_1-\mu_1)\sigma_2^2 - (x_2-\mu_2)\rho\sigma_1\sigma_2 & (x_2 - \mu_2)\sigma_1^2 - (x_1 - \mu_1)\rho\sigma_2\sigma_1 \end{bmatrix} \cdot \begin{bmatrix}x_1 - \mu_1 \\ x_2 - \mu_2\end{bmatrix} \\ &= \frac{1}{\sigma_1^2\sigma_2^2 \widetilde{\rho}} \left( (x_1-\mu_1)^2 \sigma_2^2 - (x_1 - \mu_1)(x_1 - \mu_2)\sigma_1\sigma_2 + (x_2 - \mu_2)^2 \sigma_1^2 - (x_1 - \mu_1)(x_2 - \mu_2)\sigma_2\sigma_1 \right) \\ &= \boxed{ \frac{1}{\widetilde{\rho}} \left( \left(\frac{x_1 - \mu_1}{\sigma_1}\right)^2 + \left(\frac{x_2 - \mu_2}{\sigma_2}\right)^2 - 2\rho\frac{(x_1 - \mu_1)(x_2 - \mu_2)}{\sigma_1\sigma_2} \right) } \end{align*} \]

The 2D Case In Its FINAL FORM

\[ f_{\mathbf{X}}(\mathbf{x}) = C \bigexp{ -\frac{1}{2(1-\rho^2)} \left( \left(\frac{x_1 - \mu_1}{\sigma_1}\right)^2 + \left(\frac{x_2 - \mu_2}{\sigma_2}\right)^2 - 2\rho\frac{(x_1 - \mu_1)(x_2 - \mu_2)}{\sigma_1\sigma_2} \right) } \]

where

\[ C = \frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}. \]

References

DeGroot, Morris H., and Mark J. Schervish. 2013. Probability and Statistics. Pearson Education.