The question that motivated this list was: What is “Text-As-Data”? Why should I care?

My lazy answer was originally: there are entire excursuses1 on exactly this question, in basically whatever field you could imagine, at this point. Then I decided I’d better give a quick overview of these resources (to show that I’m standing on the shoulders of giants), and that led to this list, which I will only skim in the workshop:

Social Science

  • Overviews
    • The paper that, in my mind, solidified text-as-data as a “legitimate” social-scientific tool is Roberts, Stewart, and Airoldi (2016)2.
    • However, slides from a slightly more novice-friendly presentation of the work are available here.
    • These are probably less cited than they should be…
    • As for actual works rather than just surveys, my favorite text-as-data paper in social science is unquestionably Blaydes, Grimmer, and McQueen (2018) (noticing a trend in these authors?), but I’m biased as a political theorist doing text-as-data stuff…
  • Political Science
    • A canonical citation for text-as-data work in Political Science is Grimmer and Stewart (2013). More recently, I like Wilkerson and Casas (2017), it actually has an interestingly different take on the philosophy and “scientific” basis of TAD.
    • If you’re a political theorist (99% chance you’re not but… dare to dream) then I’d highly recommend London (2016). That’s the paper I cite in basically all of my work, along with Skinner (1969), to try and explain what a “Text-as-data approach to Cambridge School historiography” means.
    • Then for International Relations there’s a lot of cool stuff. I have two papers on this front that I can talk about, but for a sampling of what’s out there see Nielsen (2013).
  • Sociology
    • I’m less aware of “canonical” citations in Sociology, obviously, but there are a couple of Annual Review of Sociology articles that give great overviews, e.g., Evans and Aceves (2016).
    • And there is a ton of amazing text-as-data work in Sociology, but my favorite is probably Rule, Cointet, and Bearman (2015). Full disclosure I’m friends with the first author Alix :P.
  • Videos

Humanities

  • Skipping over a more detailed rant about the history (can an AI be trained to do GADAMERIAN HERMENEUTICS?), I’ll condense to just saying that Moretti (2013) is the book I always turn to for an “interface” and a vocabulary for “translating” between the humanities and text-as-data worlds3.
  • Nowadays, however, I’m geeking out about the work coming out of David Mimno’s group, for example Baumer et al. (2017). You can watch a brilliant video presentation of it by David himself here.

STEM (Computer Science)

  • I won’t spend too much time on this, but if you’re interested in the algorithmic and statistical “roots” of text-as-data the foundational works for different methods are as follows:
    • Topic Modeling: Blei, Ng, and Jordan (2003) (the paper that “mainstreamed” topic modeling and thus text-as-data more generally4).
    • Word Embeddings: Mikolov et al. (2013), though the REAL foundation is JR Firth’s 1957 exclamation “You shall know a word by the company it keeps!” (Firth (1957), or see here on page 11), which gave rise to the “Firthian distributional hypothesis” that word embeddings simply… implement :P
    • Neural Networks: This is the technology that… makes 2018 an interesting time to be alive. For example all of Google Translate’s improvements in translation accuracy over its first 12 years (with hundreds of full-time researchers and software engineers) were DWARFED by simply switching the system from the complex statistical model they had developed to… dropping shit into a neural net and seeing what comes out. We still don’t know what they’re doing or why they’re so effective. I’ll leave you with Andrej Karpathy’s aptly-named “The Unreasonable Effectiveness of Recurrent Neural Networks” (Generate infinitely many new Neural Bibles! Neural Irish folk songs! Neural Stir-Fry Recipes! …a deep rabbithole for sure.)

Bibliography

Baumer, Eric P. S., David Mimno, Shion Guha, Emily Quan, and Geri K. Gay. 2017. “Comparing Grounded Theory and Topic Modeling: Extreme Divergence or Unlikely Convergence?” Journal of the Association for Information Science and Technology 68 (6): 1397–1410. doi:10.1002/asi.23786.

Blaydes, Lisa, Justin Grimmer, and Alison McQueen. 2018. “Mirrors for Princes and Sultans: Advice on the Art of Governance in the Medieval Christian and Islamic Worlds.” The Journal of Politics 80 (4): 1150–67. doi:10.1086/699246.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (March): 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937.

Evans, James A., and Pedro Aceves. 2016. “Machine Translation: Mining Text for Social Theory.” Annual Review of Sociology 42 (1): 21–50. doi:10.1146/annurev-soc-081715-074206.

Firth, John Rupert. 1957. Papers in Linguistics, 1934-1951. Oxford University Press.

Grimmer, Justin, and Brandon M. Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267–97. https://www.jstor.org/stable/24572662.

London, Jennifer A. 2016. “Re-Imagining the Cambridge School in the Age of Digital Humanities.” Annual Review of Political Science 19 (1): 351–73. doi:10.1146/annurev-polisci-061513-115924.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” arXiv:1310.4546 [Cs, Stat], October. http://arxiv.org/abs/1310.4546.

Moretti, Franco. 2013. Distant Reading. London: Verso.

Nielsen, Richard Alexander. 2013. “The Lonely Jihadist: Weak Networks and the Radicalization of Muslim Clerics,” September. https://dash.harvard.edu/handle/1/11124850.

Roberts, Margaret E., Brandon M. Stewart, and Edoardo M. Airoldi. 2016. “A Model of Text for Experimentation in the Social Sciences.” Journal of the American Statistical Association 111 (515): 988–1003. doi:10.1080/01621459.2016.1141684.

Rule, Alix, Jean-Philippe Cointet, and Peter S. Bearman. 2015. “Lexical Shifts, Substantive Changes, and Continuity in State of the Union Discourse, 1790–2014.” Proceedings of the National Academy of Sciences 112 (35): 10837–44. doi:10.1073/pnas.1512221112.

Skinner, Quentin. 1969. “Meaning and Understanding in the History of Ideas.” History and Theory 8 (1): 3–53. doi:10.2307/2504188.

Wilkerson, John, and Andreu Casas. 2017. “Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges.” Annual Review of Political Science 20 (1): 529–44. doi:10.1146/annurev-polisci-052615-025542.


  1. Yes, apparently the plural of “excursus” is “excursuses”… I had been pretentiously saying “excursi” for years :(.

  2. You’ll probably quickly notice that this paper (and the presentation) are kind of auxiliary to the “main event”, their R package called stm (Structural Topic Modeling). I’ll talk about it in the workshop, but basically 1. I think stm provides a sort of semantic “framing”/vocabulary of text-as-data that I think is perfect for working at the intersection of social science and computer science, and yet 2. I have never been able to get stm to work :| so I’ll be using a different set of libraries. But just know that, IN THEORY, stm is the way I would recommend thinking about your models and their estimation.

  3. I actually probably spend more time reading papers and books from the Humanities than I do the Social Sciences. At the end of the day, in my view, there’s a lot more text (more importantly, a lot more interesting text) in the corpora of various Humanities subfields than in any social science… Not a sermon, just a thought. 10 points if you get that reference.

  4. Actually, a graph of the frequency of text-as-data papers before and after this publication would be fascinating… 10 bonus points to anyone who does it!