Skip to article frontmatterSkip to article content

Probabilistic Reasoning

What is a Probability?

First things first, we need to get some very basic notions of probability down on paper, before we can move into the real meat of the section, Bayesian Statistics. For the purposes of this course, a probability Pr(X)\Pr(X) is just a number between 0 and 1 representing how likely we think some event XX is[1]. If we’re flipping a coin, for example, and we have no reason to believe that the coin is biased in any way, then we can create and adopt a model of the coin called fair\textsf{fair} in which Pr(Headsfair)=0.5\Pr(\Heads{} \mid \fair{}) = 0.5 (read this as ``the probability of seeing heads given the \fair{} model’') and Pr(Tailsfair)=1Pr(Headsfair)=0.5\Pr(\Tails{} \mid \fair{}) = 1 - \Pr(\Heads{} \mid \fair{}) = 0.5.

As a quick but important aside, the reason we only have to define Pr(Headsfair)\Pr(\Heads{} \mid \fair{}) here (with Pr(Tailsfair)\Pr(\Tails{} \mid \fair{}) being automatically derived as a result) is because the full \textbf{event space} or set of all possible events is Ω={Heads,Tails}\Omega = \{\Heads{}, \Tails{}\}. By the rules of probability, Pr(Ω)\Pr(\Omega) or the probability of \textit{anything} happening:

Pr(Event 1 in Ω or Event 2 in Ω or Event 3 in Ω or )\begin{align*} \Pr(\text{Event 1 in }\Omega\text{ or Event 2 in }\Omega\text{ or Event 3 in }\Omega\text{ or }\ldots) \end{align*}

must equal 1, for a model to be valid\footnote{Hence the equations in this paragraph look like just Pr(thing)\Pr(\textsf{thing}) rather than P(thingmodel)P(\textsf{thing} \mid \textsf{model})!}. Then, knowing that the logical connectives and'' and or’’ for events in a valid model must satisfy

P(X and Y)=P(X)P(Y),P(X or Y)=P(X)+P(Y)P(X and Y),\begin{align*} P(X\text{ and }Y) &= P(X)\cdot P(Y), \\ P(X\text{ or }Y) &= P(X) + P(Y) - P(X\text{ and }Y), \end{align*}

we know that in any of our models we must have

P(Ω)=P(Heads or Tails)=P(Heads)+P(Tails)P(Heads and Tails)=P(Heads)+P(Tails)0=1,\begin{align*} P(\Omega) = P(\Heads{}\text{ or }\Tails{}) &= P(\Heads{}) + P(\Tails{}) - P(\Heads{}\text{ and }\Tails{}) \\ &= P(\Heads{}) + P(\Tails{}) - 0 = 1, \end{align*}

so that

P(Tails)=1P(Heads),\begin{align*} P(\Tails{}) = 1 - P(\Heads{}), \end{align*}

allowing us to immediately derive Pr(Tailsfair)\Pr(\Tails{} \mid \fair{}) from our model’s assertion that Pr(Headsfair)=0.5\Pr(\Heads{} \mid \fair{}) = 0.5\footnote{Scrupulous readers will notice that I actually snuck a model assumption into the P(Ω)P(\Omega) line above, namely, that P(Heads and Tails)=0P(\Heads{}\text{ and }\Tails{}) = 0. However, if you’re that scrupulous hopefully you also know that the ``atomic’’ events within Ω\Omega must be mutually exclusive...}. In fact, this gives us a third rule that a valid probability model must satisfy:

P(not X)=1P(X),\begin{align*} P(\text{not }X) = 1 - P(X), \end{align*}

where “not XX” is shorthand for “the event XX does \textit{not} happen”. In our case, since the only two possible events are Heads\Heads{} and Tails\Tails{},

Pr(Tailsfair)=Pr(not Headsfair)=1Pr(Headsfair)=10.5=0.5.\begin{align*} \Pr(\Tails{} \mid \fair{}) = \Pr(\text{not }\Heads{} \mid \fair{}) = 1 - \Pr(\Heads{} \mid \fair{}) = 1 - 0.5 = 0.5. \end{align*}

In the back of our heads, however, we can construct an alternate model of the coin called biased\biased{}, in which Prbiased(Heads)=0.8\Pr_\biased{}(\Heads{}) = 0.8 and Prbiased(Tails)=10.8=0.2\Pr_\biased{}(\Tails{}) = 1 - 0.8 = 0.2. Then, the beauty of Bayesian statistics is that we can go out into the world and see which model best comports with what we observe. So, if we notice that the coin keeps coming up heads a suspiciously large number of times, we can change the model we believe from fair\fair{} to biased\biased{}. Mathematically, we would want to do so if L(databiased)>L(datafair)\likelihood{\data{} \mid \biased{}} > \likelihood{\data{} \mid \fair{}}. This L\mathscr{L} thing literally just means “likelihood”, and it’s mathematically the same as probability but written differently to emphasize an unusual property of this calculation: usually when we write Pr(XY)\Pr(X \mid Y) we mean that we’re computing Pr(XY)\Pr(X \mid Y) because we want to know how likely XX is in a world where we know that YY happened, but in this case we’re working ``in reverse’', computing Pr(data)\Pr(\data{}) not because we’re interested in that quantity in and of itself but only because we want to see how \textit{likely} this outcome was given the (varying) model on the right-hand side of the conditional bar.

Optimal Queueing: Why Whole Foods is (Sadly but Truly) the Future

Now let’s apply what we learned in the previous section. Ever notice how, in grocery stores that let you choose which checkout line to wait in, the other lines always seem to go faster than the one you’re in? There’s a probability-theoretic reason for this! Let’s start from the simplest (non-trivial) case: a grocery store with two lines AA and BB. There are only two possibilities here: either AA moves faster than BB, or vice-versa. Mathematically, we can represent this situation by writing out the \textbf{event space} as {AB,BA}\{AB, BA\}, where the first element represents the case where AA moves faster than BB and the second elements represents the case where BB moves faster than AA. Without knowing anything about the cashier or the customers in line, the best model we can develop \textit{a priori} is that these two outcomes are equally likely: P(AB)=P(BA)=0.5P(AB) = P(BA) = 0.5. So, if you choose to enter line AA, there’s a 50% chance that your line moves the fastest, and a 50% chance that the other line BB moves the fastest.

So far so good -- the probability that your line moves fastest is the same as the probability that some other line moves fastest. But what happens when we move to a grocery store with 3 lines, AA, BB, and CC? In this case, the possible line orderings (from fastest to slowest) are {ABC,ACB,BAC,BCA,CAB,CBA}\{ABC, ACB, BAC, BCA, CAB, CBA\}. As before, we consider all of these outcomes as equally likely. So, now what is the probability that the line you choose will be the fastest? If you choose line AA, your line is the fastest in only two of the six possible outcomes: ABCABC and ACBACB. Since each outcome is equally likely, each has probability 1/61/6, and thus the probability that your line moves fastest, P(ABC or ACB)=P(ABC)+P(ACB)P(ABC and ACB)P(ABC\text{ or }ACB) = P(ABC)+P(ACB)-P(ABC\text{ and }ACB), is 1/6+1/60=2/61/6+1/6-0 = 2/6\footnote{Note again that P(ABC and ACB)=0P(ABC\text{ and }ACB) = 0, since ABCABC and ACBACB are disjoint events -- it can’t both be the case that line BB moved faster than line CC \textit{and} line CC moved faster than line BB.}. Then, using our ``not’’ rule from above, the probability that your line does \textit{not} move fastest is

1P(ABC or ACB)=1(2/6)=4/6.\begin{align*} 1 - P(ABC\text{ or }ACB) = 1 - (2/6) = 4/6. \end{align*}

So, even with only three lines in the store, we already see that there’s only a 33.3% chance that the line we choose will go the fastest, versus a 66.6% chance that we will see (at least) one of the other two lines moving faster...

To quickly look at the case of four lines AA, BB, CC, and DD, note that now the possible events are

Ω={ABCD,ABDC,ACBD,ACDB,ADBC,ADCB,={BACD,BADC,BCAD,BCDA,BDAC,BDCA,={CABD,CADB,CBAD,CBDA,CDAB,CDBA,={DABC,DACB,DBAC,DBCA,DCAB,DCBA},\begin{align*} \Omega &= \{ABCD, ABDC, ACBD, ACDB, ADBC, ADCB, \\ &\phantom{= \{}BACD, BADC, BCAD, BCDA, BDAC, BDCA, \\ &\phantom{= \{}CABD, CADB, CBAD, CBDA, CDAB, CDBA, \\ &\phantom{= \{}DABC, DACB, DBAC, DBCA, DCAB, DCBA\}, \end{align*}

so that whichever line you enter the probability of it being fastest is only 25%25\%, and the probability that you will see another line go faster is 75%75\%. The logic continues in this way, such that in general if there are NN lines the probability that you see another line moving faster than yours is (N1)/N(N-1)/N. I used to work as a cashier at a grocery store with 10 checkout lines, putting the probability of frustration for a given customer at an astounding 9/10=90%9/10 = 90\%... though obviously they all rationally calculated this in their heads and never yelled at me upon seeing one of the 9 other lines moving faster.

Bayesian Statistics: A Scary Term for an Intuitive Concept

One of the scariest sentiments in politics, to me, is the notion that someone or some group “knows” that they’re “right” about something. One of the core principles that sets Marxism or anti-capitalism apart from religion and superstition[2] is the notion that we can come to hold these beliefs by looking out into the world, observing and measuring and comparing things, and updating our beliefs to incorporate whatever we learn. This is the basic intuition behind Bayesian reasoning, which puts this into practice via an “update equation” (which you don’t have to memorize!) specifying exactly how much one should “nudge” their degree of belief in some proposition, thus updating their “mental model” of the world, given observed evidence for or against it.

As simple as this seems, it turn out that Bayesian reasoning is the optimal method for drawing inferences about social phenomena, In a mathematically-precise sense that we will discuss. For example, a Bayesian gambler will always beat a non-Bayesian given a sufficient number of bets. More on that later.

Probabilistic Graphical Models

“Probabilistic Graphical Model” or PGM is just a fancy term for a statistical tool which operationalizes an intuitive idea: when trying to understand a complex phenomenon with lots of “moving parts” interacting with one another, a good way to start analyzing it is often to break it down into its constituent parts and then specify how these parts work together to give rise to the phenomenon. With this in mind, a PGM is a collection of nodes (drawn as circles) representing variables and edges (drawn as arrows) representing relationships of influence between nodes, codified as “Conditional Probability Tables”. So, if we wanted to model the relationship between weather and a person’s choice of whether to go out and party or stay in and watch a movie on a given Saturday evening, we could use

  1. A variable ww representing the weather, which can take on values in {Sunny,Rainy}\{\Sunny{}, \Rainy{}\},

  2. A variable aa representing the person’s action, which can take on values in {Go Out,Stay In}\{\GoOut{}, \StayIn{}\}, and

  3. An edge ee from ww to aa which encodes the intuition that one is more likely to go out if it’s sunny than if it’s rainy via the probability distribution Pr(a=Go Outw=Sunny)=0.8\Pr(a = \GoOut{} \mid w = \Sunny{}) = 0.8, Pr(a=Stay Inw=Sunny)=0.2\Pr(a = \StayIn{} \mid w = \Sunny{}) = 0.2, Pr(a=Go Outw=Rainy)=0.1\Pr(a = \GoOut{} \mid w = \Rainy{}) = 0.1, and Pr(a=Stay Inw=Rainy)=0.9\Pr(a = \StayIn{} \mid w = \Rainy{}) = 0.9, which we can also represent as a simple Conditional Probability Table:

    \begin{center} \begin{tabular}{cc} \hline Weather (Value of ww) & Probability of Going Out Pr(a=Go Outw)\Pr(a = \GoOut{} \mid w)\ \hline \hline \Sunny{} & 0.8 \ \Rainy{} & 0.1 \\hline \end{tabular} \end{center}

We need just one more thing before our PGM is complete, however: while we can use this Conditional Probability Table to obtain any information we want about aa, notice that the table depends on information about ww. Thus, to fully parameterize our PGM, we’ll need to supply a non-conditional probability table giving the initial distribution over the weather. In this case, let’s just say that there’s a 50/50 chance of rain or sunshine, so that Pr(w=Sunny)=0.5\Pr(w = \Sunny{}) = 0.5.

Now we have everything we need! The resulting PGM, in graphical form, is presented in Figure 3. Pretty boring, but it gets the job done.

\begin{figure}[!ht] \centering \begin{tikzpicture}[ %every node/.style={draw, minimum size={width(“dmztp”)},node distance=0.8cm}, every node/.style={draw, node distance=0.8cm}, every path/.style={thick}, outer/.style={draw,circle}, inner/.style={draw,circle,minimum size=1cm,inner sep=0}, metanode/.style={draw,minimum size=3cm,rounded corners=0.4cm}, obs/.style={fill=lightgray}, latent/.style={fill=white}, ar/.style={->,>=latex}, auto ] % %\nodeouter,obs{ww}; % \nodeouter{ww}; \nodeouter,right=of w{aa}; %\nodeouter,obs, right=of dyt, xshift=2cm{Dt+zD^z_{t+}}; % \draw-> edge (a); %\draw-> edge (dztp); \end{tikzpicture} \label{fig:pgm-noshade} \caption{A basic PGM, representing the relationship between ww, the weather, and aa, the subsequent action of a person deciding whether to go out or stay in for the night.} \end{figure}

The beautiful thing about PGMs, though (and the primary reason to use them), is that you can then use this model to make inferences about the world in the face of incomplete information -- i.e., the situation in pretty much every real-world problem. The key tool here is the separation of nodes into two categories: observed (represented graphically as a shaded node) and latent (represented graphically as an unshaded node). Thus we can now use our model as a weather-inference machine: if we observe that the person we’re modeling is out at a party with us, what can we infer from this information about the weather outside? We can draw this situation as a PGM with shaded and unshaded nodes, as in Figure 4, and then use Bayes’ Rule to perform calculations over the network, to see how the observed information about the person at the party “flows” back into the node representing the weather.

\begin{figure}[!ht] \centering \begin{tikzpicture}[ %every node/.style={draw, minimum size={width(“dmztp”)},node distance=0.8cm}, every node/.style={draw, node distance=0.8cm}, every path/.style={thick}, outer/.style={draw,circle}, inner/.style={draw,circle,minimum size=1cm,inner sep=0}, metanode/.style={draw,minimum size=3cm,rounded corners=0.4cm}, obs/.style={fill=lightgray}, latent/.style={fill=white}, ar/.style={->,>=latex}, auto ] % %\nodeouter,obs{ww}; % \nodeouter{ww}; \nodeouter,obs,right=of w{aa}; %\nodeouter,obs, right=of dyt, xshift=2cm{Dt+zD^z_{t+}}; % \draw-> edge (a); %\draw-> edge (dztp); \end{tikzpicture} \label{fig:pgm-shaded} \caption{The same situation as in Figure \ref{fig:pgm-noshade}, except that the node for variable aa is now shaded, indicating a situation where we have observed the person’s action (a=GoOuta = Go Out) but still only have a probability distribution over the weather ww.} \end{figure}

Keeping in mind that Bayes’ Rule tells us, for any two events AA and BB, how to use information about Pr(BA)\Pr(B \mid A) to obtain information about Pr(AB)\Pr(A \mid B):

Pr(AB)=Pr(BA)Pr(A)Pr(B),\begin{align*} \Pr(A \mid B) = \frac{\Pr(B \mid A)\Pr(A)}{\Pr(B)}, \end{align*}

We can now apply this rule to obtain our new probability distribution over the weather, taking into account the new information that the person has chosen to go out:

Pr(w=Sunnya=Go Out)=Pr(a=Go Outw=Sunny)Pr(w=Sunny)Pr(a=Go Out)=Pr(a=Go Outw=Sunny)Pr(w=Sunny)Pr(a=Go Outw=Sunny)+Pr(a=Go Outw=Rainy)\begin{align*} \Pr(w = \Sunny{} \mid a = \GoOut{}) &= \frac{\Pr(a = \GoOut{} \mid w = \Sunny{})\Pr(w = \Sunny{})}{\Pr(a = \GoOut{})} \\ &= \frac{\Pr(a = \GoOut{} \mid w = \Sunny{})\Pr(w = \Sunny{})}{\Pr(a = \GoOut{} \mid w = \Sunny{}) + \Pr(a = \GoOut{} \mid w = \Rainy{})} \end{align*}

And now we simply plug in the information we already have from our conditional probability table to obtain our new (conditional) probability of interest:

Pr(w=Sunnya=Go Out)=(0.8)(0.5)(0.8)(0.5)+(0.1)(0.5)=0.40.4+0.05=0.40.450.89.\begin{align*} \Pr(w = \Sunny{} \mid a = \GoOut{}) &= \frac{(0.8)(0.5)}{(0.8)(0.5) + (0.1)(0.5)} = \frac{0.4}{0.4 + 0.05} = \frac{0.4}{0.45} \approx 0.89. \end{align*}

So we have now learned something interesting from our observation! Namely: now that we’ve observed the person out at a party, the probability that it is sunny out jumps from 0.5 (called the “prior” estimate of the ww, i.e., our best guess without any other relevant information) to 0.89 (called the “posterior” estimate of ww)

Footnotes
  1. If you’ve taken a stats class in high school or undergrad, you probably learned that a probability P(X)P(X) represents “how likely the event XX is”. The latter represents the “frequentist” philosophy of probability, but in this book we instead adopt a “Bayesian” philosophy, which foregrounds the model-builder (you) by treating a probability Pr(X)\Pr(X) as representing “how likely we think the event XX is”. Notice the subtle but important difference in wording.

  2. I said it explicitly in the first sentence, but from here on out basically just add “to me” to the beginning of all these opinionated statements, in your head. Disclaimer complete.