Week 10: Text-as-Data

DSAN 5650: Causal Inference for Computational Social Science
Summer 2025, Georgetown University

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Wednesday, July 23, 2025

Open slides in new window →

Schedule

Today’s Planned Schedule:

Start End Topic
Lecture 6:30pm 6:45pm Final Projects: Dependent vs. Independent Vars →
6:45pm 7:10pm Instrumental Variables Lite →
7:10pm 8:00pm Text-as-Data Part 1: TAD in General →
Break! 8:00pm 8:10pm
8:10pm 9:00pm Text-as-Data Part 2: Causal Text Analysis →

Final Project Timeline

  • Proposal (Abstract on Notion):
    • Submitted to instructors by Tuesday, July 15th, 6:30pm EDT ✅
    • Approved by an instructor by Tuesday, July 22nd, 6:30pm EDT ✅
  • Final Draft:
    • Submitted to instructors for review by Friday, August 1st, 5:59pm EDT
    • Approved by an instructor by Monday, August 4th, 11:59pm EDT
  • Final Submission:
    • Submitted via Canvas by Friday, August 8th, 5:59pm EDT

Independent vs. Dependent Variables

  • Starting point: puzzle in social world!
    • Why are fertility rates dropping in these countries?
    • What explains the move from “quietism” to “Political Islam”?
    • What behaviors produce positive health outcomes in old age?

Independent Variable / Treatment \(D\)

  • What happens when…
    • [Someone gets a degree/internship]
    • [There are casualties in a conflict]
  • Start with independent var \(\Rightarrow\) “What are the effects of this cause?”

Dependent Variable / Outcome \(Y\)

  • “I wonder what explains this?
    • [Differences in earnings]
    • [Probability of news story]
  • Start with dependent var \(\Rightarrow\) “What are the causes of this effect?”

Instrumental Variables

If randomization works to obtain causal effects…

…Find something random in the causal system, use e.g. matching to “force” the same scenario!

General form: \(\text{Effect}(D \rightarrow Y) = \frac{\text{Effect}(Z \rightarrow Y)}{\text{Effect}(Z \rightarrow D)}\) (Try “plugging in” \(Z\) = Coin Flip!)

\[ \beta_{\text{IV}}^{\text{Wald}} = \frac{ \mathbb{E}[Y_i \mid Z_i = 1] - \mathbb{E}[Y_i \mid Z_i = 0] }{ \mathbb{E}[D_i \mid Z_i = 1] - \mathbb{E}[D_i \mid Z_i = 0] }, \; \beta_{\text{IV}} = \frac{\text{Cov}[Y, Z]}{\text{Cov}[D,Z]} \]

Birthday as Instrument

  • Mini-Lab Time!

Text-as-Data Part 1: TAD in General

  • Computers don’t exactly “read” text! They process numeric representations of some feature(s) of the text
    • Ex: sentiment, topic, embedding in semantic space
  • \(\Rightarrow\) When we do causal inference with text, we’re not studying \(D \rightarrow Y\) itself! Instead, we study:
    • Text as Outcome: \(D \rightarrow g(Y)\) and/or
    • Text as Treatment: \(g(D) \rightarrow Y\)

Text-as-Data Part 2: Causal Inferences with Text

(The necessity for sample splitting!)

  • Recall the media effects example from Week 3; here an experiment where:
  • Treatment (\(D_i = 1\)) watches presidential debate (control doesn’t watch anything)
  • Outcome \(Y_i\): We estimate a topic model of the respondent’s verbal answer to “what do you think are the most important issues in US politics today?”
\(Y_i \mid \textsf{do}(D_i \leftarrow 1)\) \(Y_i \mid \textsf{do}(D_i \leftarrow 0)\)
Person 1 Candidate’s Morals Taxes
Person 2 Candidate’s Morals Taxes
Person 3 Polarization Immigration
Person 4 Polarization Immigration
Table 1: From Egami et al. (2022)

“Discovered” Topics Depend on the Data 😟

\(Y_i \mid \textsf{do}(D_i \leftarrow 1)\) \(Y_i \mid \textsf{do}(D_i \leftarrow 0)\)
Person 1 Candidate’s Morals Taxes
Person 2 Candidate’s Morals Taxes
Person 3 Polarization Immigration
Person 4 Polarization Immigration
Table 2: From Egami et al. (2022)
Actual Assignment Outcome \(Y_i\)
Person 1 \(D_1 = 1\) Morals
Person 2 \(D_2 = 1\) Morals
Person 3 \(D_3 = 0\) Immigration
Person 4 \(D_4 = 0\) Immigration
Table 3: Realized assignments and outcomes in World 1
Actual Assignment Outcome \(Y_i\)
Person 1 \(D_1 = 1\) Morals
Person 2 \(D_2 = 0\) Taxes
Person 3 \(D_3 = 1\) Polarization
Person 4 \(D_4 = 0\) Immigration
Table 4: Realized assignments and outcomes in World 2

The Solution? Sample Splitting!

  • Machine learning noticed this long ago: the goal is a model that generalizes, not memorizes!

Topic Models

  • Intuition is just: let’s model latent topics “underlying” observed words
Section Keywords
U.S. News state, court, federal, republican
World News government, country, officials, minister
Arts music, show, art, dance
Sports game, league, team, coach
Real Estate home, bedrooms, bathrooms, building
  • Already, by just classifying articles based on these keyword counts:
Arts Real Estate Sports U.S. News World News
Correct 3020 690 4860 1330 1730
Incorrect 750 60 370 1100 590
Accuracy 0.801 0.920 0.929 0.547 0.746

Topic Models as PGMs

From Blei (2012)

…Unlocks a world of social modeling through text!

Cross-Sectional Analysis

Blaydes, Grimmer, and McQueen (2018)

Influence Over Time

From Barron et al. (2018)

Textual Influence Over Time

References

Barron, Alexander T. J., Jenny Huang, Rebecca L. Spang, and Simon DeDeo. 2018. “Individuals, Institutions, and Innovation in the Debates of the French Revolution.” Proceedings of the National Academy of Sciences 115 (18): 4607–12. https://doi.org/10.1073/pnas.1717729115.
Blaydes, Lisa, Justin Grimmer, and Alison McQueen. 2018. “Mirrors for Princes and Sultans: Advice on the Art of Governance in the Medieval Christian and Islamic Worlds.” The Journal of Politics 80 (4): 1150–67. https://doi.org/10.1086/699246.
Blei, David M. 2012. “Probabilistic Topic Models.” Commun. ACM 55 (4): 77–84. https://doi.org/10.1145/2133806.2133826.
Egami, Naoki, Christian J. Fong, Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart. 2022. “How to Make Causal Inferences Using Texts.” Science Advances 8 (42): eabg2652. https://doi.org/10.1126/sciadv.abg2652.