I just published a Python package called “EasyText” that was funded as part of a UCSB undergraduate instructional development grant with John W. Mohr. The project came about as a follow-up of Dr. Mohr’s Introduction to Computational Sociology course I helped with in Spring 2016 (more about that). This project was created with the goal of bringing a broad range of text analysis tools into a single interface, particularly one that can be run from the command line or using a minimal amount of Python code. Try it out using pip: “pip install easytext” (see PyPi page).
The command line interface is particularly focused on generating spreadsheets that students can then view and manipulate in a spreadsheet program like Excel or LibreOffice. Students can perform interpretive analysis by going between EasyText output spreadsheets and the original texts, or feed the output into a quantitative analysis program like R or Stata. The program supports features for simple word counting, noun phrase detection, Named Entity Recognition, noun-verb pair detection, entity-verb detection, prepositional phrase extraction, basic sentiment analysis, topic modeling, and the GloVe word embedding algorithm.
I recently created my first Python library, called DocTable, a library which I’ve been using for most of my text analysis projects. I’ve uploaded it to PyPi, so anyone can install it using the command “pip install doctable”. You can think of DocTable as a class-based interface for working with tables of data.
Using DocTable, the general flow of my text analysis projects involves (1) parsing and formatting raw data and metadata into a DocTable, (2) preprocessing the text into token lists at the document or sentence level, (3) storing the processed Python objects back into the DocTable, and (3) querying the DocTable entries for ingestion into word embedding, topic modeling, sentiment analysis, or some other algorithm. This process makes storing large-ish datasets fairly easy, combining the ease of working with Pandas DataFrames with the advanced query and storage capabilities of sqlite.
This Fall, John Mohr and I ran a pilot program for teaching Sociology undergraduates how to use topic modeling in their projects. The pilot program lasted only about 4 weeks and students were asked to prepare a text corpus of approximately 100 documents using LexisNexis (or copy-paste from the web) and perform analysis using Excel or Google Sheets. Past mentoring projects of both John and I showed that undergraduates can come up with some pretty creative ways to use these computational analysis tools, even if they can’t write the code to do it themselves (see my summer mentorship project). Beyond the technical, the most challenging part of this work is getting students to think about what information they can get from large corpora and how to use the tools to answer questions of interest. It is clear that the era of Big Data and access to internet has changed the way social processes occur on a large scale (think Fake News), so we need to train social scientists to use new tools and think about data differently.
Last October I visited Erlangen, Germany to attend a workshop set up by Dr. Tim Griebel and Prof. Dr. Stefan Evert called “Texts and Images of Austerity in Britain. A Multimodal Multimedia Analysis”. Tim and Stefan are leading this ongoing project aimed at analyzing 20k news articles from The Telegraph and The Guardian starting in 2010 and leading up to Brexit. I’m working alongside 21 other researchers with backgrounds in discourse analysis, corpus linguistics, computational linguistics, multimodal analysis, and sociology to explore discourse between the two news sources across time from different perspectives.
I’ve been thinking recently about how we think and talk about the relationship between theory and methods in computational text analysis. I argue that an assumed inseparability of theory and methods leads us to false conclusions about the potential for Topic Modeling and other machine learning (ML) approaches to provide meaningful insight, and that it holds us back from developing more systematic interpretation methods and addressing issues like reproducibility and robustness. I’ll respond specifically to Dr. Andrew Hardie’s presentation “Exploratory analysis of word frequencies across corpus texts” given at the 2017 Corpus Linguistics conference in Birmingham. Andrew makes some really good points in his critique about shortcomings and misunderstandings of tools like Topic Modeling, and I hope to contribute to this conversation so that we can further improve these methods – both in how we use them and how we think about them.