EasyText Python Package

I just published a Python package called “EasyText” that was funded as part of a UCSB undergraduate instructional development grant with John W. Mohr. The project came about as a follow-up of Dr. Mohr’s Introduction to Computational Sociology course I helped with in Spring 2016 (more about that). This project was created with the goal of bringing a broad range of text analysis tools into a single interface, particularly one that can be run from the command line or using a minimal amount of Python code. Try it out using pip: “pip install easytext” (see PyPi page).

The command line interface is particularly focused on generating spreadsheets that students can then view and manipulate in a spreadsheet program like Excel or LibreOffice. Students can perform interpretive analysis by going between EasyText output spreadsheets and the original texts, or feed the output into a quantitative analysis program like R or Stata. The program supports features for simple word counting, noun phrase detection, Named Entity Recognition, noun-verb pair detection, entity-verb detection, prepositional phrase extraction, basic sentiment analysis, topic modeling, and the GloVe word embedding algorithm.

While there are debates about the role of topic modeling and other algorithmic approaches to text analysis requiring interpretation (see one of my responses here), our undergraduate students have shown enthusiasm and diligence in considering the limitations and strengths of such tools (see an example of a student I mentored). In many ways, their experiences with text analysis algorithms have forced them to think beyond the familiarity of p-values and confidence intervals to establish different kinds of patterns in the social world – ones that may be partially out-of-reach with classical sociological research methods.

The full introduction to the Python API can be found on the README.md page and the command line interface documentation can be found at Command_Reference.md).

As an example scenario, consider a time when you have a spreadsheet of document names and texts called “mytextdata.xls”. Let’s assume that the column name of document names is “title” and the column of texts is simply “text”. To run a topic model of this text data with 10 topics that outputs to “mytopicmodel.xls”, we would use the following command:

python -m easytext topicmodel -n 10 mytextdata.xls --doclabelcol "title" --textcol "text" mytopicmodel.xls

The topic model output spreadsheet contains four sheets: “doc_topic”, “topic_words”, “doc_summary”, and “topic_summary”. 

While doc_topic contains rows as documents and columns as topic probabilities and topic_words contains topics as words and word probabilities as columns, the doc_summary and topic_summary sheets are meant to assist with interpretation; the topics most closely associated with each document and the words most closely associated with each topic, respectively.

Any topic model interpretation of course relies on referring back to the text of the original documents themselves, but this spreadsheet is designed to help with the process of linking the statistical topic model with the content and form of texts.

Further documentation is needed to push this into an instructional tool, but this is a good first step towards that end.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s