Berkson’s Paradox Example
To see Berkson’s paradox in action, consider a simplified model of hospital admissions:
- There are two independently-occurring diseases
and , which occur with probability - The healthcare system functions in such a way that anyone who is found to have either disease is immediately admitted to a specialized hospital (
) which treats only these two diesases
The Data-Generating Process, in this case, looks as follows:
- Generate exogenous noise variables
and - Set
- Set
- Set
if , 0 otherwise
So that we can represent the connections between the variables using the following PGM:
From this DGP (or just from the earlier fact that the diseases occur independently), we immedately have the two facts:
Analyzing Hospital Admissions Data
Now, let’s say we are analyzing data from the hospital, so that all of the data in our dataset has
The first step, which is not yet an example of Berkson’s paradox (just an application of Bayes’ theorem), is to compute the new disease probabilities given the observation that
and by symmetry we also have
These two quantities do fit our intuition, generally, since we can reason that we’re more likely to encounter a person with disease
Computing the Joint pdf
There are many ways we could proceed to “build up to” having the full joint pdf of
As a reminder here, in looking for the joint pdf, we’re looking for the missing values in the following table. I’ve started by placing a 0 in the logically-impossible rows:
- Since having
or guarantees admission into the hospital, any row where or but is not possible - Since the hospital only treats diseases
and , admission is not possible when and
0 | 0 | 0 | |
0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 |
0 | 1 | 1 | |
1 | 0 | 0 | 0 |
1 | 0 | 1 | |
1 | 1 | 0 | 0 |
1 | 1 | 1 |
From this table, we see that there are only four quantities we need to compute:
Let’s try tackling these one-by-one. First:
Next:
By symmetry, we also have
Or, if we want to compute it directly for sanity:
Thus our final pdf table is:
0 | 0 | 0 | |
0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 |
0 | 1 | 1 | |
1 | 0 | 0 | 0 |
1 | 0 | 1 | |
1 | 1 | 0 | 0 |
1 | 1 | 1 |
Berkson’s Paradox
Now, the point where Berkson’s Paradox enters the picture is when we try to evaluate the independence of the two diseases, solely on the basis of the hospital admissions data!
To see this, let’s now look at whether observing
The numerator value of
For the denominator, we can sum the probabilities across every row where
This means that the full result, dividing the numerator by the denominator, is
This reveals the issue: that if we only ever observe data on hospital patients, i.e., data where
in other words, we may easily be “tricked” into concluding that observing
The do-Operator
Now, let’s re-compute these probabilities, using
- Generate exogenous noise variables
and - Set
() - Set
- Set
if , 0 otherwise
By applying this
- Generate exogenous noise variables
and - Set
- Set
- Set
And from this post-
and thus we have causal independence: