Week 5A: Moments and Covariance

DSAN 5100: Probabilistic Modeling and Statistical Computing
Section 01

Purna Gamage, Amineh Zadbood, and Jeff Jacobs

Monday, September 22, 2025

Moments of a Distribution: Expectation and Variance

Expectations = Weighted Means

We already know how to find the (unweighted) mean of a list of numbers:

\[ \begin{align*} \begin{array}{|p{1cm}||p{1cm}|p{1cm}|p{1cm}|}\hline X & \orange{4} & \orange{10} & \orange{8} \\\hline\end{array} \implies \overline{X} &= \frac{\orange{4} + \orange{10} + \orange{8}}{\purp{3}} = \purp{\left(\frac{1}{3}\right)} \cdot \orange{4} + \purp{\left( \frac{1}{3} \right)} \cdot \orange{10} + \purp{\left( \frac{1}{3} \right)} \cdot \orange{8} \\ &= \frac{22}{3} \approx 7.33 \end{align*} \]

Discrete distributions: just lists of numbers and their probability of occurring!

\[ \begin{align*} \begin{array}{|p{1cm}|p{1cm}|p{1cm}|p{1cm}|}\hline X & \orange{4} & \orange{10} & \orange{8} \\\hline \Pr(X) & \purp{0.01} & \purp{0.01} & \purp{0.98}\\\hline\end{array} \implies \overline{X} &= \purp{\left( \frac{1}{100} \right)} \cdot \orange{4} + \purp{\left( \frac{1}{100} \right)} \cdot \orange{10} + \purp{\left( \frac{98}{100} \right)} \cdot \orange{8} \\ &= \left.\frac{798}{100}\right.^{1} \approx 7.98 \end{align*} \]

Different Types of “Averages”

(This will seem like overkill now, but will help us later!)
To avoid confusion, we denote the “regular” (arithmetic) mean function as \(M_1(\cdot)\)
- If \(V = \{v_1, \ldots, v_n\}\), \(M_1(V) \definedas \frac{v_1+\cdots+v_n}{n}\).
Then \(\overline{V}\) will denote the number which results from applying \(M_1\) to the set \(V\).
Other common functions which get called “averages” in Machine Learning: median, harmonic mean (\(M_{-1}\)), geometric mean (\(M_0\)), the hadamard product \(\odot\), etc.—pop up surprisingly often in Data Science/Machine Learning!
The things we’re averaging also take on weird forms: bits, logical predicates, vectors, tensors (Hence Google’s ML platform, TensorFlow), …

Definition

For a discrete RV \(X\):

\[ \expect{X} = \sum_{x \in \mathcal{R}_X}x P(x) \]

For a continuous RV \(X\):

\[ \expect{X} = \int_{-\infty}^{\infty}xf(x)dx \]

Important Properties

For RVs \(X\), \(Y\), and \(a, b \in \mathbb{R}\):

Linear

\[ \expect{aX} = a\expect{X} \]

Additive

\[ \expect{X + Y} = \expect{X} + \expect{Y} \]

Affine¹

\[ \expect{aX + b} = a\expect{X} + b \]

LOTUS:

\[ \expect{g(X)} = g(x)f(x)dx \]

Not Multiplicative:

\[ \expect{X \cdot Y} = \expect{X} \cdot \expect{Y} \iff X \perp Y \]

Variance: Motivation

We’ve now got a “measure of central tendency”, the expectation \(\expect{X}\), with some nice properties. We can use it to produce point estimates.
Now, how do we describe and communicate the spread of the data in a dataset? Similarly, how can we describe our uncertainty about a point estimate?
Let’s try to develop a function, \(\text{Spread}\), that takes in a set of values and computes how spread out they are
(Hint: can use arithmetic mean, applied to differences between points rather than points themselves)

First Attempt

What properties should \(\text{Spread}(\cdot)\) have?
- Should be \(0\) if every data point is identical, then increase as they spread apart
How about: average difference between each point and the overall (arithmetic) mean? \[ \text{Spread}(X) = M_1(X - \overline{X}) = \frac{(x_1 - \overline{X}) + (x_2 - \overline{X}) + \cdots + (x_n - \overline{X})}{n} \]

Code

library(tidyverse)
library(latex2exp)
N <- 10
x <- seq(1,N)
y <- rnorm(N, 0, 10)
mean_y <- mean(y)
spread <- y - mean_y
df <- tibble(x=x, y=y, spread=spread)
ggplot(df, aes(x=x, y=y)) +
  geom_hline(aes(yintercept=mean_y, linetype="dashed"), color="purple", size=g_linesize) +
  geom_segment(aes(xend=x, yend=mean_y, color=ifelse(y>0,"Positive","Negative")), size=g_linesize) +
  geom_point(size=g_pointsize) +
  scale_linetype_manual(element_blank(), values=c("dashed"="dashed"), labels=c("dashed"=unname(TeX(c("$M_1(X)$"))))) +
  dsan_theme("half") +
  scale_color_manual("Spread", values=c("Positive"=cbPalette[3],"Negative"=cbPalette[6]), labels=c("Positive"="Positive","Negative"="Negative")) +
  scale_x_continuous(breaks=seq(0,10,2)) +
  #remove_legend_title() +
  theme(legend.spacing.y=unit(0.1,"mm")) +
  labs(
    title=paste0(N, " Randomly-Generated Points, N(0,10)"),
    x="Index",
    y="Value"
  )

The result? To ten decimal places:

Code

spread_fmt <- sprintf("%0.10f", mean(df$spread))
writeLines(spread_fmt)

-0.0000000000

😞 What happened?

Avoiding Cancellation

How to avoid positive and negative deviations cancelling out? Two ideas:
- Absolute value \(\left|X - \overline{X}\right|\)
- Squared error \(\left( X - \overline{X} \right)^2\)…
Ghost of calculus past: which is differentiable everywhere?¹

We’ve Arrived at Variance!

\[ \Var{X} = \bigexpect{ \left(X - \expect{X}\right)^2 } \]

And, we can apply what we know about \(\expect{X}\) to derive:

\[ \begin{align*} \Var{X} &= \bigexpect{ \left(X - \expect{X}\right)^2 } = \bigexpect{ X^2 - 2X\expect{X} + \left( \expect{X} \right)^2 } \\ &= \expect{X^2} - \expect{2 X\expect{X}} + \left( \expect{X} \right)^2 \\ &= \expect{X^2} - 2\expect{X}\expect{X} + \left(\expect{X}\right)^2 \\ &= \expect{X^2} - \left( \expect{X} \right)^2 \; \; \green{\small{\text{ (we'll need this in a minute)}}} \end{align*} \]

Standard Deviation

When we squared the deviations, we lost the units of our datapoints!
To see spread, but in the same units as the original data, let’s just undo the squaring!

\[ \text{SD}[X] = \sqrt{\Var{X}} \]

But, computers don’t care about the unit of this measure (just minimizing it). No reason to do this additional step if humans aren’t looking at the results!

Properties of Variance

Recall that Expectation was an affine function:

\[ \mathbb{E}[aX + b] = a\mathbb{E}[X] + b \]

Variance has a similar property, but is called homogeneous of degree 2, which means

\[ \Var{aX + b} = a^2\Var{X} \; \underbrace{\phantom{+ b}}_{\mathclap{\text{(Something missing?)}}} \]

What Happened to the \(b\) Term?

Mathematically:

\[ \begin{align*} \Var{aX + b} \definedas \; &\mathbb{E}[(aX + b - \mathbb{E}[aX + b])^2] \\ \definedalign \; &\expect{(aX \color{orange}{+ b} - a\expect{X} \color{orange}{- b})^2} \\ \definedalign \; &\expect{a^2X^2 - 2a^2\expectsq{X} + a^2\expectsq{X}} \\ \definedalign \; &a^2 \expect{X^2 - \expectsq{X}} = a^2(\expect{X^2} - \expectsq{X})b \\ \definedas \; & a^2\Var{X} \end{align*} \]

What Happened to the \(b\) Term?

Visually (Assuming \(X \sim \mathcal{N}(0,1)\))

\[ \begin{align*} \expect{{\color{lightblue}X + 1}} = \expect{{\color{orange}X}} + 1, \; \; \Var{{\color{lightblue}X + 1}} = \Var{{\color{orange}X}} \\ \expect{{\color{green}X - 3}} = \expect{{\color{orange}X}} - 3, \; \; \Var{{\color{green}X - 3}} = \Var{{\color{orange}X}} \end{align*} \]

Generalizing Expectation/Variance: The Moment-Generating Function (MGF)

Generalizing Expectation and Variance

It turns out that, expectation and variance are just two “levels” of a hierarchy of information about a distribution!
In calculus: knowing \(f(x)\) is sufficient information for us to subsequently figure out \(f'(x)\), \(f''(x)\), …
In probability/statistics: knowing \(M_X(t)\) is sufficient information for us to figure out \(\expect{X}\), \(\Var{X}\), …

Not a Metaphor!

This calculus \(\leftrightarrow\) statistics connection is not a metaphor: differentiating \(M_X(t)\) literally gives us \(\expect{X}\), \(\Var{X}\), …
Let’s look at MGF for \(X \sim \text{Bern}(\param{p})\), and try to derive \(\expect{X}\)¹.

\[ \begin{align*} M_X(t) &= (1 - p) + pe^t \\ M'_X(t) &= pe^t,\text{ and }\expect{X} = M'_X(0) = \green{p} \; ✅ \end{align*} \]

\(\Var{X}\)?

\[ \begin{align*} M''_{X}(t) &= pe^t,\text{ and }\expect{X^2} = M''_X(0) = p \\ \Var{X} &\definedas{} \expect{X^2} - (\expect{X})^2 = p - p^2 = \green{p(1-p)} \; ✅ \end{align*} \]

MGF in Econometrics

Open in new window

BEWARE ☠️

As we saw last week (the Dreaded Cauchy Distribution):

Not all random variables have moment-generating functions.
Worse yet, not all random variables have well-defined variances
Worse yet, not all random variables have well-defined means
(This happens in non-contrived cases!)

Moving Beyond One RV

Multivariate Distributions (W02)

The bivariate normal distribution represents the distribution of two normally-distributed RVs \(\mathbf{X} = [\begin{smallmatrix} X_1 & X_2\end{smallmatrix}]\), which may or may not be correlated:

\[ \mathbf{X} = \begin{bmatrix}X_1 \\ X_2\end{bmatrix}, \; \boldsymbol{\mu} = %\begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix} \begin{bmatrix}\smash{\overbrace{\mu_1}^{\mathbb{E}[X_1]}} \\ \smash{\underbrace{\mu_2}_{\mathbb{E}[X_2]}}\end{bmatrix} , \; \mathbf{\Sigma} = \begin{bmatrix}\smash{\overbrace{\sigma_1^2}^{\text{Var}[X_1]}} & \smash{\overbrace{\rho\sigma_1\sigma_2}^{\text{Cov}[X_1,X_2]}} \\ \smash{\underbrace{\rho\sigma_2\sigma_1}_{\text{Cov}[X_2,X_1]}} & \smash{\underbrace{\sigma_2^2}_{\text{Var}[X_2]}}\end{bmatrix} % \begin{bmatrix}\sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_2\sigma_1 & \sigma_2^2 \end{bmatrix} % = \begin{bmatrix}\text{Var}[X_1] & \text{Cov}[X_1,X_2] \\ \text{Cov}[X_2,X_1] & \text{Var}[X_2] \end{bmatrix} \]

By squishing all this information intro matrices, we can specify the parameters of multivariate-normally-distributed vectors of RVs similarly to how we specify single-dimensional normally-distributed RVs:

\[ \begin{align*} \overbrace{X}^{\mathclap{\text{scalar}}} &\sim \mathcal{N}\phantom{_k}(\overbrace{\mu}^{\text{scalar}}, \overbrace{\sigma}^{\text{scalar}}) \tag{Univariate} \\ \underbrace{\mathbf{X}}_{\text{vector}} &\sim \boldsymbol{\mathcal{N}}_k(\smash{\underbrace{\boldsymbol{\mu}}_{\text{vector}}}, \underbrace{\mathbf{\Sigma}}_{\text{matrix}}) \tag{Multivariate} \end{align*} \]

Visualizing 3D Distributions: Projection

Since most of our intuitions about plots come from 2D plots, it is extremely useful to be able to take a 3D plot like this and imagine “projecting” it down into different 2D plots:

Adapted (and corrected!) from LaTeX code in this StackExchange thread

Visualizing 3D Distributions: Contours

From Prof. Hickman’s slides!

Visualizing 3D Distributions: Contours

Also from Prof. Hickman’s slides!

Bivariate Distributions

DeGroot and Schervish (2013, 118) | DSPS Sec. 3.4

We generalize the concept of the distribution of a random variable to the joint distribution of two random variables.
In doing so, we introduce the joint pmf for two discrete random variables, the joint pdf for two continuous variables, and the joint CDF for any two random variables.

References

DeGroot, Morris H., and Mark J. Schervish. 2013. Probability and Statistics. Pearson Education.

Hansen, Lars Peter. 1982. “Large Sample Properties of Generalized Method of Moments Estimators.” Econometrica: Journal of the Econometric Society, 1029–54.

Appendix Slides

Appendix I: The Hadamard Product

Used in nearly all neural NLP algorithms, as basis of LSTM (see LSTM equations on right)
Subscripts on harmonic mean \(M_{-1}\), geometric mean \(M_0\), and arithmetic mean \(M_1\) come from the definition of the generalized mean:

\[ M_p(V) = \left( \frac{1}{n} \sum_{i=1}^n v_i^p \right)^{1/p} \]

\[ \begin{align*} f_t &= \sigma(W_f [h_{t - 1}, x_t] + b_f) \\ i_t &= \sigma(W_i [h_{t - 1}, x_t] + b_i) \\ \tilde{C}_t &= \tanh(W_C [h_{t - 1}, x_t] + b_C) \\ C_t &= f_t \odot C_{t - 1} + i_t \odot \tilde{C}_t \\ o_t &= \sigma(W_o [h_{t - 1}, x_t] + b_o) \\ h_t &= o_t \odot \tanh(C_t) \\ \hat{y} &= \text{softmax}(W_y h_t + b_y) \end{align*} \]

If you’re a dork like me, you can read about generalized means, Fréchet means, or Stata’s trimmean function, all of which bring together seemingly-unrelated functions used throughout Machine Learning!

Appendix II: Continuous RV Support

In most cases, for continuous RVs, the definition

\[ \mathcal{R}_X = \{x \in \mathsf{Domain}(f_X) \given f_X(x) > 0\} \]

works fine. But, to fully capture all possible continuous RVs, the following formal definition is necessary:

\[ \mathcal{R}_X = \left\{x \in \mathbb{R} \given \forall r > 0 \left[ f_X(B(x,r)) > 0 \right] \right\}, \]

where \(B(x,r)\) is a “band”¹ around \(x\) with radius \(r\).