Last month I did a workshop on text analysis in Python for a computational text analysis group we started in Duke Sociology (see workshop GitHub page), which was a follow-up to a workshop I did for the UCSB Broom Center for Demography last spring. The tutorial covers some basics of parsing text in SpaCy and using matrices to manipulate document representations for analyses.
I had two thoughts while creating this workshop: (1) around 60% of most text analysis projects are the same. The key is to come up with a system and design pattern that works for you. (2) There aren’t that many new tools for text analysis in the social sciences. Most algorithms we’ve picked up are simply more efficient or fancier (read: Bayesian) versions of very old algorithms. Now I’ll elaborate.
Continue reading “Intro to Text Analysis”
I created a public GitHub repo to share a cleaned version of the US National Security Strategy documents in plain text. Each presidential administration since 1987 is required to produce at least one document per term, so you can easily compare the documents by administration or party. By adding it to a public repo, I’m hoping to make it easier to use for text analysis demos. Use the download_nss() function in the example script to download and read all or some of the NSS documents into python.
The choice of NSS documents was motivated by one of my all-time favorite articles co-authored by my former advisor John Mohr, Robin Wagner-Pacifici, and Ronald Breiger. In addition to the documents analyzed in that piece, I also copy/pasted the Trump 2017 NSS document to make the new document.
Mohr, J. W., Wagner-Pacifici, R., & Breiger, R. L. (2015). Toward a computational hermeneutics. Big Data & Society, (July–December), 1–8. (link)
I just published a Python package called “EasyText” that was funded as part of a UCSB undergraduate instructional development grant with John W. Mohr. The project came about as a follow-up of Dr. Mohr’s Introduction to Computational Sociology course I helped with in Spring 2016 (more about that). This project was created with the goal of bringing a broad range of text analysis tools into a single interface, particularly one that can be run from the command line or using a minimal amount of Python code. Try it out using pip: “pip install easytext” (see PyPi page).
The command line interface is particularly focused on generating spreadsheets that students can then view and manipulate in a spreadsheet program like Excel or LibreOffice. Students can perform interpretive analysis by going between EasyText output spreadsheets and the original texts, or feed the output into a quantitative analysis program like R or Stata. The program supports features for simple word counting, noun phrase detection, Named Entity Recognition, noun-verb pair detection, entity-verb detection, prepositional phrase extraction, basic sentiment analysis, topic modeling, and the GloVe word embedding algorithm.
Continue reading “EasyText Python Package”
I recently created my first Python library, called DocTable, a library which I’ve been using for most of my text analysis projects. I’ve uploaded it to PyPi, so anyone can install it using the command “pip install doctable”. You can think of DocTable as a class-based interface for working with tables of data.
Using DocTable, the general flow of my text analysis projects involves (1) parsing and formatting raw data and metadata into a DocTable, (2) preprocessing the text into token lists at the document or sentence level, (3) storing the processed Python objects back into the DocTable, and (3) querying the DocTable entries for ingestion into word embedding, topic modeling, sentiment analysis, or some other algorithm. This process makes storing large-ish datasets fairly easy, combining the ease of working with Pandas DataFrames with the advanced query and storage capabilities of sqlite.
Continue reading “DocTable Python Package”