This Fall, John Mohr and I ran a pilot program for teaching Sociology undergraduates how to use topic modeling in their projects. The pilot program lasted only about 4 weeks and students were asked to prepare a text corpus of approximately 100 documents using LexisNexis (or copy-paste from the web) and perform analysis using Excel or Google Sheets. Past mentoring projects of both John and I showed that undergraduates can come up with some pretty creative ways to use these computational analysis tools, even if they can’t write the code to do it themselves (see my summer mentorship project). Beyond the technical, the most challenging part of this work is getting students to think about what information they can get from large corpora and how to use the tools to answer questions of interest. It is clear that the era of Big Data and access to internet has changed the way social processes occur on a large scale (think Fake News), so we need to train social scientists to use new tools and think about data differently.
Last October I visited Erlangen, Germany to attend a workshop set up by Dr. Tim Griebel and Prof. Dr. Stefan Evert called “Texts and Images of Austerity in Britain. A Multimodal Multimedia Analysis”. Tim and Stefan are leading this ongoing project aimed at analyzing 20k news articles from The Telegraph and The Guardian starting in 2010 and leading up to Brexit. I’m working alongside 21 other researchers with backgrounds in discourse analysis, corpus linguistics, computational linguistics, multimodal analysis, and sociology to explore discourse between the two news sources across time from different perspectives.
I’ve been thinking recently about how we think and talk about the relationship between theory and methods in computational text analysis. I argue that an assumed inseparability of theory and methods leads us to false conclusions about the potential for Topic Modeling and other machine learning (ML) approaches to provide meaningful insight, and that it holds us back from developing more systematic interpretation methods and addressing issues like reproducibility and robustness. I’ll respond specifically to Dr. Andrew Hardie’s presentation “Exploratory analysis of word frequencies across corpus texts” given at the 2017 Corpus Linguistics conference in Birmingham. Andrew makes some really good points in his critique about shortcomings and misunderstandings of tools like Topic Modeling, and I hope to contribute to this conversation so that we can further improve these methods – both in how we use them and how we think about them.
Word2Vec models that have been pre-trained on large corpora are invaluable because they contain all of the semantic and contextual information in a lookup dictionary of only a few million words. They tend to perform well on synonym and analogy tests at around 300 dimensions, and can be applied to a number of machine learning applications. The challenge with these large models is that they take a long time to load into memory when your program starts and the lookup algorithms are intense to the point where you may not want to run them on your desktop computer. I’ve written a python library called word2vecserver that allows one to load a pre-trained model onto a server and use the client library to make requests for vector representations or analogy tests from another computer.
To use the library, download the pre-trained Google News file and load it into memory using Gensim.
I’ll add updates as I begin to use it in different contexts. Feel free to update as needed – if you make useful commits I’ll accept them!
Intuitive demonstrations of Word2Vec like synonym generation and analogy tests provide compelling evidence that semantic representations are not only possible but also meaningful. While they may hold many opportunities for machine learning on text data, little work has gone into exploring texts on a smaller scale. If one is interested in comparing how texts use concepts in different contexts, small and contextually sparse corpuses may be sufficient. In this work I propose several methodological advancements for comparative semantic analysis and look at some of the biggest challenges that have yet to be addressed.