Last month I did a workshop on text analysis in Python for a computational text analysis group we started in Duke Sociology (see workshop GitHub page), which was a follow-up to a workshop I did for the UCSB Broom Center for Demography last spring. The tutorial covers some basics of parsing text in SpaCy and using matrices to manipulate document representations for analyses.
I had two thoughts while creating this workshop: (1) around 60% of most text analysis projects are the same. The key is to come up with a system and design pattern that works for you. (2) There aren’t that many new tools for text analysis in the social sciences. Most algorithms we’ve picked up are simply more efficient or fancier (read: Bayesian) versions of very old algorithms. Now I’ll elaborate.
I created a public GitHub repo to share a cleaned version of the US National Security Strategy documents in plain text. Each presidential administration since 1987 is required to produce at least one document per term, so you can easily compare the documents by administration or party. By adding it to a public repo, I’m hoping to make it easier to use for text analysis demos. Use the download_nss() function in the example script to download and read all or some of the NSS documents into python.
The choice of NSS documents was motivated by one of my all-time favorite articles co-authored by my former advisor John Mohr, Robin Wagner-Pacifici, and Ronald Breiger. In addition to the documents analyzed in that piece, I also copy/pasted the Trump 2017 NSS document to make the new document.
Mohr, J. W., Wagner-Pacifici, R., & Breiger, R. L. (2015). Toward a computational hermeneutics. Big Data & Society, (July–December), 1–8. (link)
I just published a Python package called “EasyText” that was funded as part of a UCSB undergraduate instructional development grant with John W. Mohr. The project came about as a follow-up of Dr. Mohr’s Introduction to Computational Sociology course I helped with in Spring 2016 (more about that). This project was created with the goal of bringing a broad range of text analysis tools into a single interface, particularly one that can be run from the command line or using a minimal amount of Python code. Try it out using pip: “pip install easytext” (see PyPi page).
The command line interface is particularly focused on generating spreadsheets that students can then view and manipulate in a spreadsheet program like Excel or LibreOffice. Students can perform interpretive analysis by going between EasyText output spreadsheets and the original texts, or feed the output into a quantitative analysis program like R or Stata. The program supports features for simple word counting, noun phrase detection, Named Entity Recognition, noun-verb pair detection, entity-verb detection, prepositional phrase extraction, basic sentiment analysis, topic modeling, and the GloVe word embedding algorithm.
I recently created my first Python library, called DocTable, a library which I’ve been using for most of my text analysis projects. I’ve uploaded it to PyPi, so anyone can install it using the command “pip install doctable”. You can think of DocTable as a class-based interface for working with tables of data.
Using DocTable, the general flow of my text analysis projects involves (1) parsing and formatting raw data and metadata into a DocTable, (2) preprocessing the text into token lists at the document or sentence level, (3) storing the processed Python objects back into the DocTable, and (3) querying the DocTable entries for ingestion into word embedding, topic modeling, sentiment analysis, or some other algorithm. This process makes storing large-ish datasets fairly easy, combining the ease of working with Pandas DataFrames with the advanced query and storage capabilities of sqlite.
This Fall, John Mohr and I ran a pilot program for teaching Sociology undergraduates how to use topic modeling in their projects. The pilot program lasted only about 4 weeks and students were asked to prepare a text corpus of approximately 100 documents using LexisNexis (or copy-paste from the web) and perform analysis using Excel or Google Sheets. Past mentoring projects of both John and I showed that undergraduates can come up with some pretty creative ways to use these computational analysis tools, even if they can’t write the code to do it themselves (see my summer mentorship project). Beyond the technical, the most challenging part of this work is getting students to think about what information they can get from large corpora and how to use the tools to answer questions of interest. It is clear that the era of Big Data and access to internet has changed the way social processes occur on a large scale (think Fake News), so we need to train social scientists to use new tools and think about data differently.