Week 3: Causality and the Scientific Method

DSUA111: Data Science for Everyone, NYU, Fall 2020

TA Jeff, jpj251@nyu.edu

https://github.com/jpowerj/dsua111-sections

Zoom Room

https://nyu.zoom.us/j/6821254378

Same URL for both [in case you want a reminder of your Lab/Section number]:

  • Lab 007, Fridays 3:30pm-4:20pm
  • Lab 004, Fridays 4:55-5:45pm

Outline

  1. Doing HW1
  2. Recap: Data? Science?
    • The Scientific Method
    • Measurement Issues
  3. Correlation vs. Causation
    • Correlation Pitfalls
  4. Causality and Counterfactuals
    • The Fundamental Problem of Causal Inference
  5. Experiments: How Do They Help Us?

1. Recap: Data? Science?

Data

  • More than numbers (think "Symbolic Systems")
  • It does not "speak for itself", but must be interpreted
In [9]:
%%HTML
<video width="80%" controls>
      <source src="looked_at_the_data.mp4" type="video/mp4">
</video>

Science

  • Doubt and skepticism
  • Disproving theories
    • "In so far as a scientific statement speaks about reality, it must be falsifiable.
    • And in so far as it is not falsifiable, it does not speak about reality." (Popper)
  • Asking questions
  • Humility that one study is a [usually very] imperfect snapshot

Scientific Method

  • Data <-> Theory
  • Theory <-> Data

  • No proving!
  • Collective endeavor, over decades/centuries!
    • "Standing on the shoulders of giants"

Zooming In

2. Measurement Issues

What does this look like in practice?

(Berger, Daniel, William Easterly, Nathan Nunn, and Shanker Satyanath. 2013. "Commercial Imperialism? Political Influence and Trade during the Cold War." American Economic Review, 103 (2): 863-96.)

The Variables

How Are They Measured?

Crucial point here (if you remember nothing else!):

  • Before we can even start thinking about the relationship between two variables $X$ and $Y$, we need to know exactly how each of them is measured
  • So when you read/hear some claim like "New study finds [] causes []", make sure you know exactly what's going in those blanks!

3. Correlation vs. Causation

  • Correlation: $X$ and $Y$ change "together" -- higher values of $X$ tend to "co-occur" with higher values of $Y$

  • Causation: ...it's trickier than this. Why? Let's find out.

4. Correlation Pitfalls

Spurious Correlations

Omitted variables

  • [Umbrellas] cause [car accidents]! (Days with high umbrella use also have high car accident frequency!)

5. Causality and Counterfactuals

  • Causality: the holy grail of science
  • Causal statements require counterfactuals: What would have happened?
  • "Easy" mode: an experiment (in the lab or "in nature")
  • Hard mode: observational data

The Fundamental Problem of Causal Inference

  • Remember John Snow: the causal effect of [drinking "bad" water] on the [infection status] for a particular person on a given day is the difference between:
    • (a) Their infection status after drinking the water, and
    • (b) The infection status they would have had on the same day had they not drunk "bad" water.
  • Fundamental Problem of Causal Inference:
    • We never get to see both scenarios for the same unit (person) at the same time, and so
    • We can never know the causal effect with certainty!

So, what is to be done?!? Two options...

  • Forget Everything And Run?

Face Everything And Rise

  • Find good comparison cases: Treatment and Control
  • Without a control group, you cannot make inferences!
    • (Snow needed at least some people who did not drink the pump water... why?)

6. Controlled Experiments: How Do They Help Us?

  • Random Assignment: Vietnam War/Second Indochina War Draft
    • Key point: makes treatment and control groups similar, on average, without us having to do any work!
    • (e.g., don't need to worry about "pairing up" similar treatment+control units)
  • No more Selection Effects
  • Omitted variables are in BOTH Treatment and Control groups

Complications: Selection

  • Tl;dr Why did this person (unit) end up in the treatment group? Why did this other person (unit) end up in the control group?
    • Are there systematic differences?
  • Vietnam/Indochina Draft: Why can't we just study [men who join the military] versus [men who don't], and take the difference as a causal estimate?

Complications: Compliance

  • We ideally want people assigned to the treatment to take the treatment, and people assigned to the control to take the control.
  • "Compliance" is the degree to which this is actually true in your experiment
    • High compliance = most people actually took what they were assigned
    • Low compliance = lots of people who were assigned to treatment actually took control, and vice-versa
  • What problems might there be with compliance in the Draft example?

7. The Biggest Complication: Observational Data

  • In observational studies, researchers have no control over assignment to treatment/control 😨
  • On the one hand... Forget Everything And Run, if you can.
  • On the other hand... statisticians over the last ~4 centuries have developed fancy causal inference tools/techniques to help us Face Everything And Rise

Causal Terminology for Observational Studies