Intuitive demonstrations of Word2Vec like synonym generation and analogy tests provide compelling evidence that semantic representations are not only possible but also meaningful. While they may hold many opportunities for machine learning on text data, little work has gone into exploring texts on a smaller scale. If one is interested in comparing how texts use concepts in different contexts, small and contextually sparse corpuses may be sufficient. In this work I propose several methodological advancements for comparative semantic analysis and look at some of the biggest challenges that have yet to be addressed.
Meaning from Context
Because the Word2Vec semantic space is actually built on contextual relationships, it attempts to capture all of the contextual multiplicities in which each word may be used in hopes that this captures the meanings of the words. One can first imagine performing a comparison between two arbitrary documents using this approach. As the two documents become larger and more contextually diverse, their associated semantic spaces capture more and more of those multiplicities; as they approach infinity, they might both contain all of the same contexts and thus end up with the same semantic spaces. This is the kind of behavior that is desired for Word2Vec: to capture the essence of words by capturing contexts.
More specifically, three assumptions are typically made by Word2Vec practitioners: a) text windows can capture context, b) meanings of words are entirely captured through contexts, and c) the training dataset contains enough contextual variety and quantity to capture all of the versatility of the words. In Word2Vec, text windows are typically sentences or some window of word proximity that are specified to meet the first assumption. These assumptions are necessary if one attempts to go from word windows all the way to semantic meanings of words, but fewer assumptions may be needed if one simply desires to compare between texts how words are used differently. It is exactly this kind of application that may be most helpful for social scientists and digital humanists – all the power of these algorithms with fewer theoretical assumptions; but what exactly can they tell us about texts?
Substance and Function
While Saussure’s conception of language also described language as a process, the metaphor used here is derived from the unsupervised learning process used in Word2Vec. The algorithm works by randomly placing words in the semantic space and then continuously updating them until small distances correspond to high co-occurrence frequency in the text windows. This process can be thought of as a series of point masses (words) connected by springs and moving until a global equilibrium point is reached. Smaller springs are used where co-occurrence is larger, and no spring is ever pushed past it’s f=0 equilibrium point; springs only pull words together. In that sense, a series of forces organize the structure – to understand the structure we need to know the forces that designed it. These forces are measurable as the window co-occurrence and do not correspond directly to distances because the whole system is taken into account for those measures.
The most common method for measuring semantic similarity is referred to as cosine similarity. This is calculated as the dot product (or distance) of two normalized semantic vectors. In geometric terms, this means that semantic vectors are projected onto a hypersphere centered on the origin (coordinates (0,0,…,0)) before the distance is calculated. The Word2Vec algorithm itself, however, does not enforce any sort of vector normalization – vector norm information is totally lost in one of the most common metrics used by practitioners. Some evidence (Schakel 2015) suggests vector norms could tell us a significant amount about the structure of the semantic space when combined with word frequency. Smaller vector norms correspond to more geometrically central terms; vector norms are positively related to word frequency but inversely related to contextual diversity (see Figure 1). If a word has a small vector norm relative to appearance frequency it is more important for the organization of the whole system (Schakel 2015).
Figure 1: The plot below shows a plot of vector norms vs frequency for each word in the arXiv corpus – Figure 3 from (Schakel 2015). This shows a downward curve observed somewhere starting at a frequency between 20 and 50. Observed results from the Gutenberg texts (later in this article) do not show this distribution, but it could be due to corpus size. Vectors that are smaller than average for the frequency bin are considered to be more significant by this measure.
Developed now is a structure of relations in semantic space with measurable centrality of words (vector norms) and determining forces between them (text window co-occurrence). It is now possible in some sense to develop an idea of how words or terms may be structurally organized – both in a global and local sense. The missing element from these formulations is the focus on discourse – how these relationships are articulated. It will be especially important to be able to draw text windows that are representative, in a sense, of the discourse that led to the particularly strong organizing forces. Discourse is the basis of any structural conception of a text; without an interpretive sense of this discourse, the meaning behind the structure is lost.
Setting theoretical implications aside for a moment, several methodological strategies are proposed to a) get a brief idea of the totality of these semantic structures, b) semantically compare texts, and c) to investigate deeply the structural position and discourse around a single word or phrase. The strategies all build on two major questions: what forces drove the formation of the semantic structures as they are, and how does context compose those forces? The first question is founded in structuralism and hermeneutics (I discussed this in an earlier post), while the second is an attempt to go beyond relationships and representations to look at important discourse in the texts.
Macro-level features attempt to capture information about the totality of organization of the texts. The primary goals of this mode is to a) get a quick glance at important structural/discourse features or b) compare texts for different types of similarity – like, say, if one wanted to group texts by similarity or develop some abstract continuum each text might lie on.
If one is interested in understanding the individual semantic structures, these metrics could be most useful:
- words most important to the semantic structure of each text
- underlying substructures determining the organization of the whole system
- text windows most exemplary of the strongest forces in the structure
Words most important to the overall structure could easily be measured using the vector norm vs frequency metric (Schakel 2015), but it could also be possible to perform a sensitivity analysis: find out which words, when perturbed, would result in the most structural change. Underlying substructures would be built from strong semantic relationships between important words. Exemplary text windows would need to capture the ways that these substructures work together to form the basis of the semantic structures.
If one is interested in a comparative analysis between texts at the macro level, the following comparative metrics might be useful:
- distances between each pair of words
- vector norm vs frequency distributions
- analysis of a difference structure created from two individual structures
- discourse analysis exemplary of differences between texts
While distances between each pair of words is straightforward (albeit computationally difficult), it doesn’t necessarily capture why two texts are different. Semantic distances could be different for a variety of reasons, but it may have more to do with smaller differences in underlying substructures than, say, the words with the maximum difference in distances to the other words in the structure. Vector norms vs frequency distributions hope to capture the ‘importance’ of words in this totality, but they are somewhat untested – and what does ‘importance’ mean in this sense anyways?
Interpretive discourse analysis is important for any text comparison, but the text window samples would need to somehow be both important and diverse in their function and meanings. They need to somehow link together some of the major themes from the texts in all their varieties and contexts. Only then can a total comparative analysis be performed.
The micro features are those that help one examine the organization of a small portion of the semantic structure – usually the neighborhood around a single word (see my last post as an example). The simplest and most common micro feature is the synonym generator: provide a word and an algorithm will return the n closest (and presumably semantically similar) words. To really understand the semantic structure around a particular word it is be important to examine what forces contributed to the positioning of the word and how discourses compose these forces.
To answer these two questions, four specific strategies are proposed:
- words most significantly organize the word in question
- words that are most significantly organized by the word in question
- co-occurrence frequencies that most determined the position of the word
- sentences which best exemplify the semantic relationships positioning this word
The difference between first two strategies is subtle and lies in the sensitivity analysis type. In theory, the words that organize the word would be identified by examining sensitivity of its placement subject to perturbations of other words. The words organized by the word in question would be identified by those words whose position is most sensitive to perturbations of the word in question.
Somewhat similar is the co-occurrence frequency (forces) that strongly contributed to the placement of the word. This would examine how major co-occurrences in the total structure contribute to the word’s positioning. In essence, this would mean a sensitivity analysis that perturbs co-occurrence frequencies to see how word position is changed.
If language has meaning through context, discourse analysis should provide a few representative samples from each of the major semantic categories in which the words are used. An algorithm would need to a) identify different semantic categories to which discourse might belong and then b) produce sample text windows that illustrate their use. This measure needs to capture how the word is being used relative to major themes in the text; how the word fits into the total structure of meaning found in the text. Admittedly this portion of the strategy is less developed; more experimentation is needed.
Empirical Results on Gutenberg Texts
As a precursor to the actual implementation of these strategies, I wanted to pursue the “word importance” measurement based on vector norm relative to word frequency (Schakel 2015). Figure 1 shows a distribution that peaks around a word-frequency around 20-50 and states that the downward decline is due to high frequency words appearing in a larger variety of contexts. The upward curve occurs simply because low-frequency words do not have a strong relevance to any one context. While my analysis (shown in Figure 2) did not find this drastic downward curve, this could be due to the text under analysis; they analyzed a single subject class in the arXiv database.
Figure 2: Scatter-plot of all words in the chesterton-brown text from the Gutenberg archive.
The method suggests that words are more important if they have a relatively low vector norm for their frequency bin. Words that appear on the bottom portion of the curve are considered to be more important then. The frequency axis on which the words lie should also be taken into consideration. Because the words with the highest frequency are “the” and “if”, it might be fair to say that words frequently and in many contexts are more abstract, like function words or verbs; there is evidence for this in (Schakel 2015) based on part-of-speech tagging as well.
Figures 4-6 show a zoomed-in version of Figure 2 in four locations: high low-frequency lower-side, low-frequency higher-side, high-frequency lower-side, and low-frequency higher-side. These are displayed with labels to anecdotally examine the possibility of using this importance measure.
Figures 3 and 4 show more and less important sides (respectively) of the low-frequency part of the curve in Figure 2. The supposedly more important side shows words like “felt”, “yes”, “alone”, “part”, “fact”, “nose”, “no”, “believe”, and “speak”.The supposedly less important words are “clothes”, “trick”, “captain”, “itself”, “near”, and “followed”. This quick examination suggests that more abstract words may actually be on the upper end of the curve.
Figure 3: Low frequency words on the lower side of the curve are considered to be more concrete and more important.
Figure 4: Low frequency words on the upper side of the curve are considered to be more concrete and less important.
Figures 5 and 6 show words on the lower and upper sides of the high-frequency area of the curve. Supposedly more important words are “said”, “father”, “is”, “as”, “had”, “there”, “man”, “one”, and “none”. Supposedly less important words are “down”, “into”, “up”, “from”, “their”, “back”. These words do not immediately strike one as more or less important, but the purpose of this analysis is for the comparison of texts. If “said” and “father” are more important in the chesterton-brown corpus, might they be less important in another text? And might that difference be meaningful? All of these are open questions.
Figure 5: Low frequency words on the lower side of the curve are considered to be more abstract and more important.
Figure 6: High frequency words on the upper side of the curve are considered to be more abstract and less important.
While this is purely anecdotal evidence, it was convenient to examine the words through a simple visual analysis. It is unclear what the relative positions of words on this curve actually mean, but it could be useful when combined with some of the other methods proposed in this post – particularly the structural importance measures and the discourse analysis.
Open Methodological Questions
Finally, there are several open methodological questions surrounding the pre-processing of texts and Word2Vec configuration parameters.
- what effect does the dimensionality parameter have on structure?
- how does corpus size affect the metrics described here?
- how would window sizes and skip-gram models affect structural organization or discourses?
Schakel, A. M. J., & Wilson, B. J. (2015). Measuring Word Significance using Distributed Representations of Words. Eprint arXiv:1508.02297, cs.CL. Retrieved from http://arxiv.org/abs/1508.02297v1%5Cnpapers3://publication/uuid/267F5F9C-4D31-4B5D-B7F8-1A6359BB7ED7