Last October I visited Erlangen, Germany to attend a workshop set up by Dr. Tim Griebel and Prof. Dr. Stefan Evert called “Texts and Images of Austerity in Britain. A Multimodal Multimedia Analysis”. Tim and Stefan are leading this ongoing project aimed at analyzing 20k news articles from The Telegraph and The Guardian starting in 2010 and leading up to Brexit. I’m working alongside 21 other researchers with backgrounds in discourse analysis, corpus linguistics, computational linguistics, multimodal analysis, and sociology to explore discourse between the two news sources across time from different perspectives.
Are you new to blogging, and do you want step-by-step guidance on how to publish and grow your blog? Learn more about our new Blogging for Beginners course and get 50% off through December 10th.
WordPress.com is excited to announce our newest offering: a course just for beginning bloggers where you’ll learn everything you need to know about blogging from the most trusted experts in the industry. We have helped millions of blogs get up and running, we know what works, and we want you to to know everything we know. This course provides all the fundamental skills and inspiration you need to get your blog started, an interactive community forum, and content updated annually.
I’ve been thinking recently about how we think and talk about the relationship between theory and methods in computational text analysis. I argue that an assumed inseparability of theory and methods leads us to false conclusions about the potential for Topic Modeling and other machine learning (ML) approaches to provide meaningful insight, and that it holds us back from developing more systematic interpretation methods and addressing issues like reproducibility and robustness. I’ll respond specifically to Dr. Andrew Hardie’s presentation “Exploratory analysis of word frequencies across corpus texts” given at the 2017 Corpus Linguistics conference in Birmingham. Andrew makes some really good points in his critique about shortcomings and misunderstandings of tools like Topic Modeling, and I hope to contribute to this conversation so that we can further improve these methods – both in how we use them and how we think about them.
I spent Summer of 2017 with my colleague Marcelle Cohen living in and studying the conflict and peace process in Colombia. Our objective was to explore how political discourse as cultural practice creates entrenched ideologies and contentious politics there, and how those discourses relate to other populist movements happening around the world. From a methodological perspective, I’m interested to see how we can use interview data in tandem with computational text analysis and quantitative network methods. We performed interviews with politicians and diplomats, attended political rallies in Bogota and more rural communities, and made connections with some local peace organizations and universities. Our interviews will allow us to give agency to the political elite and understand discourse at a point of production as it is embedded in a political institution. Ultimately I had a great experience that allowed me to test the lenses of cultural and political theory, learn about qualitative methods, and dive deeper into the political culture in Colombia.colombia_overview1
This article is more about my meta-impressions – see the academic presentation Political Culture in Colombia for some depth.
This summer I had the opportunity to work with sociology undergraduate student Emma Kerr as part of her summer research internship with the UCSB IGERT program. Emma proposed a project investigating whether or not news coverage of Betsy DeVos was more focused on her personal life or her policy initiatives relative to other SoE. The summer program is designed to introduce big data and network science to students with interdisciplinary backgrounds. Emma had taken a computational sociology class at UCSB with John Mohr working on Twitter analysis and really enjoyed it, so I thought she would be a good fit for the program.
Word2Vec models that have been pre-trained on large corpora are invaluable because they contain all of the semantic and contextual information in a lookup dictionary of only a few million words. They tend to perform well on synonym and analogy tests at around 300 dimensions, and can be applied to a number of machine learning applications. The challenge with these large models is that they take a long time to load into memory when your program starts and the lookup algorithms are intense to the point where you may not want to run them on your desktop computer. I’ve written a python library called word2vecserver that allows one to load a pre-trained model onto a server and use the client library to make requests for vector representations or analogy tests from another computer.
To use the library, download the pre-trained Google News file and load it into memory using Gensim.
I’ll add updates as I begin to use it in different contexts. Feel free to update as needed – if you make useful commits I’ll accept them!