I recently created my first Python library, called DocTable, a library which I’ve been using for most of my text analysis projects. I’ve uploaded it to PyPi, so anyone can install it using the command “pip install doctable”. You can think of DocTable as a class-based interface for working with tables of data.
Using DocTable, the general flow of my text analysis projects involves (1) parsing and formatting raw data and metadata into a DocTable, (2) preprocessing the text into token lists at the document or sentence level, (3) storing the processed Python objects back into the DocTable, and (3) querying the DocTable entries for ingestion into word embedding, topic modeling, sentiment analysis, or some other algorithm. This process makes storing large-ish datasets fairly easy, combining the ease of working with Pandas DataFrames with the advanced query and storage capabilities of sqlite.
DocTable works by providing a single base class to be subclassed (see below the example “Articles” class). The user-created class defines the schema, index tables, and other table information for the underlying database table. It also inherits .add(), .update(), .delete(), .get(), and .getdf() functions from DocTable for storing and retrieving data.
There are two useful aspects of this library which make it useful for text analysis: (1) it handles the storage of Python objects using the pickle library to transfer between the python objects and the “blob” type supported by sqlite, and (2) queries can be returned as Pandas DataFrames that fit within a regular text analysis workflow.
The example below shows an example DocTable called Articles. You can see from the first line that it inherits from the DocTable class, and also calls the base class __init__ constructor to define the database file name, table name to be used, and column schema information (this is the super().__init__() function call). The column schema parameter is the exact string that will be used in the sqlite table creation query, allowing for transparent access to features of sqlite. You can see the colschema allows for “blob” type fields which will be automatically converted to and from Python objects when adding, updating, and querying data.
from doctable import DocTable class Articles(DocTable): def __init__(self, fname): ''' Class for interfacing with DocTable object. ''' tabname = 'articles' super().__init__( fname=fname, tabname=tabname, colschema=' id integer primary key autoincrement, \ file_id int, \ category string, \ raw_text string, \ UNIQUE(file_id), \ ', ) self.query("create index if not exists idx1 on " \ + tabname + "(category)")
To use the DocTable, simply make an instance of the class with the desired filename (which we allowed to be used as the sole parameter in the Articles constructor). This instance will maintain an sqlite database connection while creating a new cursor for every query call. If no existing database file is found, it will create a new .db file with one empty table (called “articles in this example).
art = Articles('articles.db')
P.S. Shoutout to Joel Barmettler for their tutorial on publishing to PyPi.