I’ve recently become interested in Word2Vec as a way to represent semantic relationships between words in a corpus. In particular, I’m interested in making comparisons between corpuses: how do different texts organize concepts differently? Here I attempt to sketch a theoretical basis for word2vec drawing from early structural linguistics and sociology. Then I examine some basic results from training a word2vec model on the Gutenberg texts built into the nltk python library. Might this approach have utility for understanding how authors organize different concepts in a text?
Early Structural Linguistics and Sociology
One of the original conceptions of ‘culture’ came with Earnst Cassirer’s ‘relational’ approach (1910). He posited, in response to the approach to scientific inquiry of the time, that true meaning cannot be organized into any reasonable taxonomy; instead he suggested we use a relational approach – one that offers no more than the ability to compare concepts with one another. Thus, our cultural world is made up of relationships between concepts, and variations in cultures are essentially differences in the organization of those concepts.
Six years later, Ferdinand Saussure sketched, in his Course in General Linguistics (1916), the idea of the linguistic sign composed of the signifier (sound-image) and the signified (concept), and proposed that shifts in language occur through “a shift in the relationship between the signified and the signifier.” Saussure argued that language cannot be studied simply by examining the relationships between sign and signifier though; “it is from the interdependent whole that one must start and through analysis obtain its elements.” Relationships between every sign constitute the whole, organized by tensions of distinction and similarity. Despite the post-structuralism movement that occurred a few decades after Saussure’s lectures, some of the fundamental concepts may still hold merit – particularly if there are new quantitative methods we can use to measure this systematic whole.
Word2Vec Examined: Dimensions of Meaning
While others have given a more precise description of word2vec, I aim to examine how, intuitively, the method might fit within the theoretical basis of early structural linguistics. Essentially, word2vec is unsupervised machine learning which aims to project every word (qualitative signifiers) into an n-dimensional “semantic” vector space (quantitative signifier space) given a corpus of text. From an algorithmic perspective, each word is first given a random position in semantic space, and continuously moved closer to those words that appear more frequently in the same sentence until the “optimal” position is reached. Assuming sentences constitute some level of contextual delineation, this creates some kind of semantic or contextual space. Because it attempts to optimize all of the distances at once, two words with a high co-occurrence may or may not appear close to one another in semantic space depending on the other contexts in which the word appears. In this way, a word’s position in semantic space contains information about all of the different contexts of use for that particular word, and all the words that words from similar contexts appear in (constrained by number of dimensions as a parameter).
Theoretically, if one were to train a word2vec model on all texts ever created, it would, in the semantic space, represent every multiplicity of context in which that word might be used. Given that the dimensionality of the semantic space is arbitrary (provided as a parameter to the algorithm), it could always expand into more dimensions to capture more contexts (an important analogy I’ll discuss later). Firstly, it is important to note that computer scientists and engineers love vector spaces (I should know – I am one haha). In 2013, Google released a massive word2vec model trained on the entire Google News corpus: it contained over a million words projected into a three-hundred dimensional space attempting to capture all of the contexts in which words are used (it was so big I had to build a server program to load it all into memory). Engineers love this vector space representation because of the ease by which linear operations can be performed: dot products (cosine similarity) for a measure of contextual similarity, addition for adding contextual multiplicities and subtraction for differences in contextual similarities. The famous example is the simple semantic arithmetic: “king” – “male” + “female” results in a vector which is very close to that of “queen”. This implies that the vector for “male” captures the word’s relevance to each of the abstract dimensions in the space. The direction of the “male” vector captures the direction through that space giving some kind of scale of contextual similarity. That scale only has meaning to us because we know the position of the word “male”.
Contextual Multiplicity: Dimensions of Meaning
The problem of contextual multiplicity was one of the major issues that popularized the post-structuralist focus on meaning through discourse championed by Foucalt and later Bourdieu. This movement offers the question for word2vec that is not how can we find a representation for each word, but rather how can we understand, through the representation, all the contexts that the words are used in and how they relate to other words (also measured by contexts). Saussure, however, would argue that this is not enough: we need to be able to examine the holistic structure by which all the words are organized. Thus, we are left with antagonistic approaches from structuralist and post-structuralist periods: how can we understand differences between texts at the contextual and holistic levels?
Given the method for construction, you might expect (rather accurately), that a word2vec model would be biased towards the corpus it was trained on. Semantic positions would represent the multiplicity of contexts in which the particular texts used them, but perhaps no more. I believe then, that word2vec may be useful for comparing texts by comparing their semantic or contextual organization of words. This seems intuitive with powerful implications, but there are subtleties to the method that make it quite challenging to understand how this might work.
Case Example: The Gutenberg Corpus
Exact word positions in the space have no meaning because the projections are arbitrary (read: rotation-invariant). What really matters is the distances between each of the words. Depending on what questions are asked about corpus differences, a simple collection of distances may be appropriate – and easiest to ask. Word2Vec libraries often contain fast lookup algorithms for finding the n words that are the closest to a given word.
I used the Gutenberg Corpus that comes built-in as part of the nltk corpuses to try some preliminary analysis. I used three Jane Austen works: Persuasion, Emma, and Sense and Sensibility, and three Chesterton works: The Man Who Was Thursday, Father Brown stories, and The Ballad of the White Horse. I took each of these pieces and divided them into collections of sentences from which I removed ‘stop words’ and special characters. I trained a Word2Vec model for each corpus (using the python gensim library), then I simply observed the top five words closest to the word ‘love’. Implicit here was the parameter describing the number of dimensions that the words should be projected into; I tried a range of options from 50 to 100 to understand the sensitivity of the results to this parameter (displayed here are results from 50 and 150 dimensions).
words closest to "love" in 50 dimensions. chesterton-ball ['would', 'said', 'could', 'round', 'got'] chesterton-brown ['said', 'close', 'got', '),', 'either'] chesterton-thursday ['cannot', 'line', 'got', 'gave', 'said'] austen-emma ['able', 'bad', 'would', 'perhaps', 'done'] austen-persuasion ['would', 'could', 'lady', 'though', 'might'] austen-sense ['would', 'much', 'though', 'perhaps', 'gave'] words closest to "love" in 150 dimensions. chesterton-ball ['us', 'one', 'even', 'could', 'whole'] chesterton-brown ['beyond', 'us', 'one', 'garden', 'table'] chesterton-thursday ['us', 'saw', 'could', 'one', 'like'] austen-emma ['time', 'made', 'mind', 'beyond', 'without'] austen-persuasion ['one', 'could', 'woman', 'might', 'first'] austen-sense ['could', 'one', 'mother', 'time', 'mind']
As you can see, very few words are in both lists – meaning that the semantic projections, at least around a single topic, are highly sensitive to this parameter. However, due to the possibly-skewed distribution of distances from “love”, it could simply be that this one measure of consistency is weak for considering sensitivity of the overall structure.
Another blogger also examined the use of word2vec for clustering books in the Gutenberg project. The clustering algorithm was applied to a network where distances are measured using the similarity of top n results from every word in the space. The results from that experiment were actually quite compelling: the Jane Austen and Arthur Conan Doyle works ended up in their own cluster and other clusters were also intuitively sensible. The results I found from my simpler analysis showed a possible sensitivity to the number of dimensions, but given that every book was modeled using the same number of dimensions, the sensitivity may not have been as high as I expected. Overall, it seems to be a straightforward and productive method of analysis.
Beyond the Micro
While I’ve found the top n metric to be popular among many studies (and experiments), I think we can do better to understand these contextual spaces on a holistic level. Saussure would argue that the meaning of one sign can only be understood by looking at the totality of other signs – each one organizing the others. At a more abstract level, one might be interested words that have the most “influence” in organizing other words; are some words more foundational to overall structure? If so, how can we quantify this difference?
More than asking the quantifiable differences between words, what exactly are the differences in contexts where they are used? If examination of top-n closest words is sensitive to noise or parameter variation, what other options are there? Should we search in a direction determined by another concept instead of in a hypersphere? Should we look at co-occurrence or look for other algorithms for extracting narrative?
Finally, I think it is fair to say that context is not meaning; what do these context-driven structures tell us about a text, and how can we use that intuition to develop methods for comparison of semantic projections?
Cassirer, E. (1910). Substance and Function.
Saussure, F. de. (1916). Course in General Linguistics.