Word2Vec models that have been pre-trained on large corpora are invaluable because they contain all of the semantic and contextual information in a lookup dictionary of only a few million words. They tend to perform well on synonym and analogy tests at around 300 dimensions, and can be applied to a number of machine learning applications. The challenge with these large models is that they take a long time to load into memory when your program starts and the lookup algorithms are intense to the point where you may not want to run them on your desktop computer. I’ve written a python library called word2vecserver that allows one to load a pre-trained model onto a server and use the client library to make requests for vector representations or analogy tests from another computer.
Word2VecServer GitHub Page
To use the library, download the pre-trained Google News file and load it into memory using Gensim.
I’ll add updates as I begin to use it in different contexts. Feel free to update as needed – if you make useful commits I’ll accept them!
Intuitive demonstrations of Word2Vec like synonym generation and analogy tests provide compelling evidence that semantic representations are not only possible but also meaningful. While they may hold many opportunities for machine learning on text data, little work has gone into exploring texts on a smaller scale. If one is interested in comparing how texts use concepts in different contexts, small and contextually sparse corpuses may be sufficient. In this work I propose several methodological advancements for comparative semantic analysis and look at some of the biggest challenges that have yet to be addressed.
Continue reading “Comparative Semantic Analysis: Operational Strategies”
I’ve recently become interested in Word2Vec as a way to represent semantic relationships between words in a corpus. In particular, I’m interested in making comparisons between corpuses: how do different texts organize concepts differently? Here I attempt to sketch a theoretical basis for word2vec drawing from early structural linguistics and sociology. Then I examine some basic results from training a word2vec model on the Gutenberg texts built into the nltk python library. Might this approach have utility for understanding how authors organize different concepts in a text?
Continue reading “Word2Vec for Comparative Semantic Spaces”
Last year I came across a working paper for AJS on Belief Network Analysis by Andrei Boutyline . The paper looks at American National Election Survey data and examines two theories for the process of political opinion formation: Lakoff’s Theory of Moral Politics and Campbell’s Theory of Political Identity. This project, in collaboration with Sujaya Maiyya, was focused on extending BNA to the American National Election Survey timeseries data to test some of the claims Andrei made in response to an investigation from Delia Baldassarri using Relational Class Analysis . The original work was performed by analyzing survey data from the year 2000, but we argue that no claims can be made about this process unless we make a longitudinal investigation.
Continue reading “Belief Network Timeseries Analaysis”
Several of my projects over the last few months have leaned in the direction of longitudinal studies. The question every sociologist asks is “how did we get here?”, so it makes sense that one would like to explore how things have been changing before now. My conclusion is that if networks provide meaningful investigation into the types of questions we are trying to answer, then we need to understand how these networks change over time.
This Python library is useful because it allows one to easily transition from a traditional networkx object to simple time series dataframe representations in Pandas and Numpy. I’ve already used it in several projects and I hope you can use it too!
Continue reading “NetworkXTimeseries: A Python Library for Network Timeseries Data Structures”