Intro to Text Analysis

Last month I did a workshop on text analysis in Python for a computational text analysis group we started in Duke Sociology (see workshop GitHub page), which was a follow-up to a workshop I did for the UCSB Broom Center for Demography last spring. The tutorial covers some basics of parsing text in SpaCy and using matrices to manipulate document representations for analyses.

I had two thoughts while creating this workshop: (1) around 60% of most text analysis projects are the same. The key is to come up with a system and design pattern that works for you. (2) There aren’t that many new tools for text analysis in the social sciences. Most algorithms we’ve picked up are simply more efficient or fancier (read: Bayesian) versions of very old algorithms. Now I’ll elaborate.

Most text analysis projects are the same. By this I mean that most projects require the same or similar boilerplate tasks: Preprocessing might involve removing links, html tags, or other artifacts from raw text; tokenization involves some decisions about which tokens to include, how to deal with named entities, stopword removal, or hyphenation collapsing; document representation storage involves placing the parsed dataset into a database (plug for my package doctable), spreadsheet, pickled file, or some other storage medium that can be accessed. Then an (the?) algorithm operates on those document representations to create models, which are again stored in a database or other file for creating visualizations or running statistical models against. There may be more aspects to this: hand-coding, metadata analysis, etc. But most of these follow the same pattern and the exceptions are the minority (and probably could be put into this form if you try hard enough anyways).

The point of learning these tools is to develop a series of design patterns. Most of us (speaking to social scientists here) are not software engineers, nor have we ever even been paid to write code that someone else will read or use. If we were, we would know that coding design patterns are all about predictability. Despite the incredibly large range of possible algorithm designs that could be used, the software engineer seeks to be consistent: consistent project structure, consistent design architecture, and even consistent use of syntax for basics like loops and conditionals. Someone has to read that code, so the goal is to make it as easily understood as possible. For us, we are (often) the only ones to read our code. So learning to write research code is about developing our way of doing the same boilerplate tasks and organizing projects in a way that we can recognize easily in the future.

There aren’t that many new tools for text analysis in the social sciences. Let’s start with an easy one: Latent Dirichlet Analysis (LDA). This one algorithm spurred on a (perceived) revolution of social scientists doing text analysis for interpretation. It hit everywhere: sociology (see poetics special issue), corpus linguistics (see my long-ish discussion), and the digital humanities especially. Right now, Word2Vec is HOT. Same deal, different algorithm. Now, I’m not against this: in fact, I’ve used both of these tools in research projects, and I think they can be incredibly useful and provide important insights. My argument is not that they are not useful – only that they are not particularly novel for social scientists.

While LDA has by become by-far the most popular topic modeling algorithm, it does pretty much the same thing as its matrix factorization equivalent Nonnegative Matrix Factorization (NMF). They both start with the same model of the texts: documents are bags of words represented as rows in a document-term matrix. NMF is similar to Singular Value Decomposition (SVD) except that it works on matrices with only positive entries. The thing is, NMF is really old. For whatever reason, it was LDA that spurred interest in topic modeling for the social sciences. Same thing goes for word embeddings. While Word2Vec became hugely possible, subsequent works showed that a simple Pointwise Mutual Information (PMI) calculation based on word frequency with SVD could produce results similar to those from Word2Vec, but PMI-SVD has been around for a really long time. With the exception of parsetree and named entity extraction, I’m not totally convinced that we’ve seen anything new that really changes the way we can analyze texts.

If most text analysis pipelines are very similar and we’re using basically the same tools as we’ve had for the last few decades, where does that leave text analysis researchers? It leaves us with substance. The fact that these algorithms are so readily available and we have so many tools for working with texts means that we can focus less on methods and more on substantive analyses. In my opinion, the value in doing computational text analysis is more than the cool factor: we can answer questions that were difficult to answer before. In an increasingly digital society the study of digital texts has never been so important. We just have to be willing to ask the questions before we pick up python.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s