Dr. Stephen Wittek, Postdoctoral Fellow, McGill
i. Project Goals and Description
iii. Analysis and Visualizations: Voyant
iv. Topic Modelling: The Topic Modelling Tool
v. Topic Visualisation: Paper Machines
The following is a brief description of some of the work I have been doing over the past five months in my capacity as a digital humanities researcher for the Early Modern Conversions project at McGill University. My academic specialization is in literature—specifically, early modern drama and early modern news. I hope the following record of my recent experiments will be of interest to academics with a similar background and similar research interests.
i. Project Goals and Description
My project began with the basic goal of conducting textual analysis on a vast corpus of playtexts and news documents from 1590 to 1630 in pursuit of patterns or other discoveries that might lead to more detailed research questions. In addition, I simply wanted to get a hands-on sense of the benefits computer-based textual analysis might have to offer. Generally speaking, I am happy with the results I have achieved on both fronts, but there is clearly a very long way to go.
The tight, forty-year delimitation for my project (1590-1630) brings exclusive focus to a period of extraordinary evolution in news culture, beginning with the first instance of serialized news in England (John Wolfe’s semi-regular series of reports on the French wars of the 1590s), and ending in the decade that saw the emergence of news as a regular, permanent feature of the marketplace (a development ushered in by the corantos and news pamphlets published by Nathaniel Butter and Nicholas Bourne in the 1620s). Notably, this period of expansion in news culture coincided with an era of theatrical production and innovation, an era that has left us with the plays of William Shakespeare, Ben Jonson, and Thomas Middleton.
In my book, Shakespeare and the Idea of News, I argue that commercial theatre and commercial news followed a symbiotic, rather than mutually exclusive, course of development. Both forms helped to build a culture market based on new ways of knowing and new ways of being social, and both helped to usher in one of the most important literary innovations of the past four hundred years: the modern idea of news. Rudimentary concepts of news had existed for millennia before Shakespeare’s time, of course, but the categorical distinction between ‘news’ and ‘history’ did not emerge until regular mechanisms of dissemination, including theatre, gave rise to the notion of a massive, ongoing conversation among strangers—a rush of information constantly pouring into a public ‘now’.
To augment my research, I wanted to use digital tools to get a macroscopic view of a corpus that would take several years to analyze by conventional means. Franco Moretti has famously described this method of textual analysis as “distant reading” [1]. I had already done a little bit of very basic research along these lines using the Shakespeare Concordance, a remarkably useful website that facilitates simple textual analysis of Shakespeare’s works. (See here for an example of how I integrated the results into a critical argument). For my digital humanities project, I envisioned something similar, but significantly bigger in terms of complexity and scope.
My intuition was that the subtle confluence between news and drama might become more apparent with the help of tools that could identify semantic resonance and track it over time. The validity of this hypothesis is still an open question, but the modest success I have achieved thus far offers grounds for cautious optimism—as I hope the following description will demonstrate.
Corpus building is the process of gathering together, formatting, and configuring the body of electronic files that will become the subject of textual analysis. Early modern texts present a special challenge for this process in a number of areas.
First, full text transcriptions are not available for a significant portion of surviving materials. Only 40,061 (or 32%) of the 127,606 texts available through Early English Books Online (EEBO) exist in a machine-readable, full-text version. The remaining 68% exist only as GIF files (or electronic photographs). I had hoped that Optical Recognition (OCR) software might offer a convenient solution to this difficulty, but the challenge is much more significant than I imagined. Graphic irregularities, severe degradation, and the generally poor quality of photographs combine to make early modern texts impervious to OCR. For a quick illustration, here is an example of the sort of text that I might want to transcribe:
Here it is again as translated by my OCR tool, ABBY Finereader Express:
In short, the work of editing automatic transcriptions would require almost as much work—or even more work—than transcribing the same document manually.
For the record, I should note that in all likelihood acquiring full-text versions of early modern texts will not be an issue ten years from now, thanks in large degree to efforts on behalf of The Text Creation Project, an initiative that aims to create standardized, accurate XML/SGML encoded electronic text editions of all documents on EEBO. On a similar note, optical recognition software for early modern texts is also likely to get much better in the near future with the benefit of research from the Early Modern OCR Project.
Unfortunately however, I cannot afford to wait ten years.
Originally, my plan was to develop two corpora: one that comprised all commercially printed drama from 1590 to 1630, and another that comprised all commercially printed news products from 1620 to 1625 (the first five years of regularly syndicated news). As it turned out, full text versions of everything I needed for the drama corpus were available on EEBO, but there was very little material in terms of full text for news. After spending a good deal of time trying to devise a viable solution to this dilemma, I eventually decided to expand the scope of the second corpus from ‘Commercial News 1620-25’ to ‘Commercial Print 1620-25’, thereby moving to a collection of everything EEBO had to offer in terms of full-text for the five-year period: ballads, pamphlets, sermons, official proclamations, corantos, drama, etc. This compromise shifted the research focus from ‘conventional news products’ to ‘news as it became manifest across commercial print in general’. The adjustment actually suits me quite well because my argument hinges on the assertion that the idea of news evolved in conjunction to interconnectivity and transference among multiple forms of public discourse. Thus, from a theoretical perspective, there is no good reason why ballads and sermons could not conceivably count as ‘news’.
The process of harvesting materials from EEBO was extremely labor-intensive. The site is set up to make automated mass downloading difficult (although I’m sure a researcher with more technical savvy could probably find a way). Unable to come up with any better solution, I found that I had to download each document individually, following an intensely tedious workflow. Here are the steps, for those of you who may want to try something similar:
-From the EEBO search page, select “items with keyed full text” in the ‘LIMIT TO’ field and set the temporal parameters in the ‘LIMIT BY DATE’ field.
-Each item in the list of search results will have a check box in the top left hand corner. Click the box for the item you want to download in order to add it to your ‘marked list’.
-Select ‘MARKED LIST’ from the toolbar at the top of the screen. This will take you to a list of the documents you have checked. The title for each document will be a hyperlink. Control-click on the hyperlink (or right-click on a PC) and select ‘Download linked file’. This will save an HTML version of the full-text document to your downloads folder.
-You should find an HTML file entitled “full_rec” in your downloads folder. To rename it, select and copy the title of your document from the Marked List in EEBO and paste in to the title field for the HTML file in your Downloads folder.
-Repeat 1,173 times (or as required).
By the time I had finished harvesting, I had compiled two corpora:
But the corpus-building phase of my project was by no means complete. The next task on the agenda was to convert all of the HTML files to Plain Text, which is the best format for textual analysis. I managed to get this job completed quite easily and quickly with the help of the Document Orderly Converter, a program from the Mac App Store that enables batch conversions for files in a variety formats.
The files also required some tidying up. All text files on EEBO include cataloguing information at the top of the document (for example, “6Kb, A Text Creation Partnership digital edition TCP Phase II Added to EEBO in June 2011”). This material has to be stripped out or it will seriously skewer the word frequency analysis results. I was able to get rid of most of it automatically using a program called TextSweep, but a little bit of manual editing was also required.
The next challenge to overcome was spelling variance. Standardized spelling had yet to emerge in early modernity—writers had the freedom to spell however they pleased. To take a famous example, the name ‘Shakespeare’ has 80 different recorded spellings, including ‘Shaxpere’ and ‘Shaxberd.’ As one might imagine, variance on this scale seriously undermines the efficacy of frequency analysis. How is it possible to track the incidence of a specific word, or group of words, if any given word could have an unknown multiplicity of iterations?
Fortunately, Dr. Alistair Baron at Lancaster University in the UK has developed software that directly addresses the problem of spelling variance in early modern texts, and he was kind enough to share it with me for research purposes. His program, VARD 2, helps to improve the accuracy of textual analysis by finding candidate modern form replacements for spelling variants in historical texts. As with conventional spell-checkers, the user can choose to process texts manually (selecting a candidate replacement offered by the system), automatically (allowing the system to use the best candidate replacement found), or semi-automatically (training the tool on a sample of the corpora). I ran both of my corpora through VARD 2, and found that it corrected spelling variance to a remarkable (though not entirely perfect) degree—thank-you very much, Dr. Baron!
This is what the interface for Vard looks like:
Having met the challenges of harvesting, converting, editing, and correcting for spelling variance, the only corpus-building task that remained was the relatively straightforward job of configuring various versions of the corpora to meet the demands of different sorts of research questions. For example, I created one version of the drama corpus that organized the files into folders according to author, another that organized the files according to performance year, and yet another that organized them according to publication year. In addition, for each corpus, I used a concatenation tool called Gelatin to create a single ‘super file’ that brought all of the files into a single document.
Thus, the task of corpus building was finally complete.
But before moving on to a description of my preliminary analysis, I should briefly note some of the features of my corpora that necessarily delimit the sort of conclusions textual analysis might potentially support. First, it is important to remember that despite my orientation toward macroscopicity, any view of early modern print culture is inevitably partial. This point is particularly pertinent in regard to inexpensive, ephemeral material such as playtexts and news products, which have survived in relatively slight numbers that do not represent anything more than an unknown fraction of the total body of texts in circulation during the period. Blanket generalizations such as “early modern printed texts mentioned Mercury more often than any other planet” are therefore impossible to support. On a similar note, my ‘Commercial Print’ corpus is arbitrarily limited by the availability of full text documents on EEBO. This situation will undoubtedly improve, but as circumstance would have it, my only option was to proceed with the corpus I had, not the corpus I wanted. Finally, it is important to remain mindful of a certain degree of bibliographical inaccuracy in regard to authorship and date. Drama presents a special problem in this regard because it usually has two dates (a date for performance and a date for publication) and is very often the product of collaboration. To keep matters simple, my corpora uses all bibliographical information as given on EEBO, even though I am certain the information is not 100% correct.
iii. Analysis and Visualizations: Voyant
My first stop in the analytical process was Voyant, a web-based reading and analysis environment for digital texts developed by Stéfan Sinclair, digital humanities professor here at McGill and one of my colleagues on the Early Modern Conversions project.
Voyant is an extremely powerful suite of analytical and visualization tools. Despite its deep complexity however, the knowledge threshold for getting started is very low. One can begin to get results immediately without reading any instructions or having any prior experience with textual analysis. Uploading a corpus is no more difficult than appending an attachment to an email.
For my first test, I uploaded a special version of my drama corpus that concatenated the files by performance year (thus creating a total of forty ‘super’ text files, one for each year from 1590 to 1630). When the files had uploaded, I went to Settings to enter a list of stop words specially tailored for use with early modern texts. ‘Stop words’ are terms that one wants the software to ignore—little words like ‘and’ or ‘the’ that obscure the results of frequency analysis. You can view my list of early modern stop words here.
The following are the results of my first test:
Beginning in the top left corner, one sees a word cloud—a fairly common visualization that represents the frequency of keywords in terms of font size. (Click here for a larger image). At glance, it shows that the highest frequency words in my drama corpus were ‘good’ and ‘come’. More interestingly, perhaps, the words ‘love’, ‘isle’, and ‘king’ also figure very prominently.
Below the word cloud, there is a summary that lists statistics for basic categories such as word count, vocabulary density, word frequency, etc. (Click here for a larger image). Most interestingly, the summary also lists words that have a notably high frequency for each year: ‘Rome’ and ‘death’ appeared with particular frequency in 1594, while ‘virtue’ and ‘envy’ stood out in 1612. One can also begin to recognize the influence of individual plays in this data. For example, ‘Iago’ and ‘Cassio’ are the top distinctive words for 1603 (Othello is one of nine plays in the corpus for that year).
Moving to the bottom left corner, one sees an ordered list of frequencies for each word in the corpus accompanied by a thumbnail graph that tracks the frequency of words over the forty-year delimitation. At a glance, the tool shows a significant spike for the word ‘knight’ in 1624. What could be going on?
To find out, all one has to do is click on the word ‘knight’ and consult the ‘Corpus Reader’ tool in the middle of the page. This very handy application enables users to drill down into the corpus to examine the context for particular terms. After selecting ‘1624’ from the multicolored vertical strip on the right (which looks something like a paint swatch), I was able to scroll through the document and survey highlighted instances of the word, ‘knight’. Within a few minutes, I had solved the mystery: 1624 was the performance year for Middleton’s, A Game at Chess, which features a number of characters with the word ‘Knight’ in their name (The Black Knight, The White Knight, The Black Knight’s Pawn, etc.). Here is an example:
To the right of the screen, the ‘Word Trends’ tool shows a larger version of the frequency graph. By selecting multiple terms, a user can compare the frequencies of target words across the corpus. For example, here’s a graph that shows the relative frequencies of the words ‘knight,’ ‘true’, and ‘rich’:
My examples thus far are really only the tip of the iceberg—Voyant can do much, much more. See here for a full list of all available tools.
iv. Topic Modeling: The Topic Modeling Tool
Topic modeling works something like the frequency analysis in Voyant, but rather than tracking a single word, it identifies groups of words that frequently cluster together. ‘Topic’ is a misleading term for these clusters, in my opinion, but that does not make the results any less astounding. For an eerily impressive example of what topic modeling can do, consider the following list of topics that Matt Jockers harvested from a collection of 4,342 British, Irish, and American novels [2]:
It does not take much imagination to see that Topic 12 centers on Aboriginal Americans, and Topic 64 centers on Ireland. These results are exciting because they point toward a possible strategy for identifying the ‘subtle confluences between news and drama’ that I set out to study.
Before popping the cork on a bottle of Champagne, however, it is important to note that the results of topic modeling analysis are not always as coherent as Jockers’ example may suggest, and the process is considerably more complex than a straightforward frequency search. Most crucially, there are a number of parameters that one must tinker with in order to tease out meaningful clusters. Settings such as the number of topics, the number of iterations, the number of topic words printed, the topic proportion threshold, etc. all have an impact on the eventual outcome—and the optimal configuration of settings varies from corpus to corpus. The process is a little bit like adjusting the graphic equalizer on a stereo to get the perfect setting for a specific song.
I conducted my first cluster-frequency experiments with David Newman’s “Topic Modeling Tool,” a Java implementation of the MALLET tool for LDA (Latent Dirichlet Allocation) topic modeling. Here are the results that I got from the ‘Commercial Print 1620-25’ corpus, using my early modern stop words list and the default settings (# of iterations: 200; # of topic words printed: 10; topic proportion threshold: 0.05):
These results are fairly encouraging for an initial trial run. One can begin to see the contours of a coherent theme in at least seven of the ten topics: Latin (topic 1), Religion (topics 2, 4, 6, and 8), the monarchy (topic 5), and money (topic 7). Topic 5 is of particular interest to me because it seems to bear a close relation to news discourse of the period. Is this a semantic cluster for discussion surrounding the beginning of the Thirty Years War?
Note that each title in the list of topics is a hyperlink. By clicking on a link, users can navigate to a list of top-ranked documents for the selected topic. Here are the results for Topic 5:
Not surprisingly, the documents at the top of the list are mostly political pamphlets by polemicists such as Thomas Scott, Edmund Bolton, and John Reynolds. But one does not have to go too far down the list to find a play by one of the newsiest dramatists of the period, Ben Jonson. Other plays by Middleton and Massinger show up not too far thereafter. (Click here for the full list).
By clicking on a title, a user can get data for a specific document. Here are the results for Jonson’s Epicoene, or the Silent Woman:
As in Voyant, the user can drill down to the source document, but this capability seems less useful when one is dealing with semantic clusters rather than individual words. In addition, one cannot scroll through the entire document, as in Voyant, and the topic words are not highlighted. Overall however, the results of my experiments with the Topic Modeling Tool were very encouraging. I started to look around for answers to two very pressing questions: 1) Is there a better way to look at this data? and 2) Is it possible to chart and compare the progression of topics over time?
v. Topic Visualization: Paper Machines
Paper Machines is a suite of visualization tools that functions something like Voyant, but also has topic-modeling capabilities. It works in conjunction with Zotero, a bibliographic management program that enables users to collect, organize, cite, and share research sources with ease.
One of the great benefits of Zotero is that it allows for the creation and storage of extensive metadata for each text file in the corpus. ‘Metadata’ is all of the extra information that helps to keep documents organized: publication details, catalogue numbers, ISBN codes, notes, tags, etc. Recall that when I was using Voyant, I had to create different configurations of my corpora and re-title the files accordingly depending on the sort of question I wanted to ask. Such measures are less necessary with Paper Machines because Zotero keeps track of metadata automatically—the information is not stored in the file itself or the file title, but in a small, appended file that Zotero uses to keep the file organized. (It may help to think of the metadata file as an analog to the label on the spine of a library book.)
Of course in my case, all files for both corpora were metadata-free, which meant I would have to tackle the chore of entering publication information for a thousand or-so individual files. The steps I followed were as follows:
-Upload files to Zotero library
-Control-click on the name of a file and select ‘Create Parent Item’.
-A field for entering data to the parent item (or metadata file) will appear in the column on the right. A user can enter information for the author, publication date, call number, publisher, etc. In addition, by clicking one of the four tabs that run across the column on the right, a user can also enter metadata such as tags or notes.
-Because I had more than a thousand files to process, I only entered the publication year for each document and a tag to identify it as ‘PRINT’ or ‘DRAMA’.
There was now one main collection in my Zotero library entitled ‘CORPORA’. Within this collection, there were two subcollections entitled ‘1.PRINT.all’ and ‘2.DRAMA.all’. I copied the material from these subcollections to create a third subcollection entitled ‘3.PRINT and DRAMA 1620-25.’
Next, I ran Paper Machines topic modeling ‘By Time’ on Subcollection #3 (PRINT and DRAMA 1620-25), using the default settings in conjunction with my list of early modern stop words. Here are the results, shown as a stream graph:
The stream graph is quite striking, but the results are difficult to read. There are also some complications with the timeline on the x-axis: the column for 1624-25 is missing, and in its place we get a mysterious extra column for ’19-1620′ (?!). Another difficulty is that Paper Machines will only produce topics of three terms. Ideally, I would prefer topics of ten or more terms, as in the example from the Topic Modeling Tool above. Generally speaking, Paper Machines leaves something to be desired in terms of functionality and user-friendliness.
Difficulties notwithstanding, I found that the results from Paper Machines came closer than anything else I had experimented with to answering my central research questions. Here is a list of the top fifty topics. The graph above represents #1 in blue, #2 in orange, #3 in green, #4 in red, and #5 in purple. Note that Paper machines uses ‘stemming’, which accounts for the truncating of some terms (e.g., ‘princ’ instead of ‘prince’).
There are a number of intriguing possibilities to follow up on in this list, but the topic that I was most curious about was #9: ‘princ, Spain, England’, which seems as though it may correspond to discourse surrounding Prince Charles’ trip to Spain—the most important news story of the period. I brought the topic into the graph by clicking on it, and then switched to the ‘line graph’ visualization to get a better sense of how the topic developed over time:
The brown line represents ‘princ, spain, england’. As I had suspected (or hoped), it shows a sharp rise in the topic beginning around 1623. Unfortunately, Paper Machines does not enable users to drill down to see the documents associated with a specific topic, which is really a very disappointing limitation. Instead, by clicking on the column for any given year, one can view a list of the documents from the year featuring the topics represented. (The usefulness of such a list is not immediately obvious.)
Paper Machines also enables users to view topics organized by tag. Recall that I tagged all of the files in my Zotero library to identify each item as either ‘PRINT’ or ‘DRAMA.’ The results for Subcollection #3 (PRINT and DRAMA 1620-25) are shown as a bar graph below. The topics are different, but very similar, to the topics from the test above. For example, the top topic is now ‘heaven, mind, fair’, rather than ‘love, fair, mind’. This variance worries me because I am unsure how to account for it—but it presumably has something to do with the shift in parameters.
The five topics represented above are: #1 ‘heaven, mind, fair’ (blue), #2 ‘love, natur, affect (orange), #3 ‘majesti, person, kingdom’ (green), #4 ‘poor, fool, money’ (red), and #5 ‘christ, faith, iesus’ (purple). Print is on the left, and drama is on the right. The bar graph is not quite as visually arresting as the stream graph, but it does a better job of showing the relative frequencies of topics across file sets. For example, one can see that the topic ‘heaven, mind, fair’ (blue) is more prominently manifest in drama than it is in print.
Finally, here is a stream graph for all drama from 1590 to 1630:
The five topics represented above are: #1 ‘fool, delight, fate’ (blue), #2 ‘sword, queen, victory’ (orange), #3 ‘triumph, london, flower’ (green), #4 ‘money, fool, whore’, (red), and #5 ‘dauid, israel, saul’ (pink). One can notice much more variance in the frequency of topics over time by looking at forty years rather than five. Note, for example, that ‘dauid, israel, saul’ (pink) has a spike around 1595, while ‘triumph, london, flower’ (green) has spikes around 1615 and 1625.
As noted above, my experiments thus far are encouraging, but also point to an enormous amount of work yet to be done. The list of major remaining tasks is as follows:
-Develop a full-text corpus specifically for news documents from 1620-1625 (something based on Dahl’s Bibliography of English Corantos and Periodical Newsbooks 1620-1642 [3]).
-Complete comprehensive training with Vard to improve corrections for spelling variance.
-Add more metadata to the Zotero library to enable functionalities such as topic modeling by author, publisher, or performance date.
As one might imagine, I am also looking forward to working with tools of even greater sophistication. My ideal program would combine the user-friendliness of Voyant with the functionality of the Topic Modeling Tool and the visualization capabilities of Paper Machines. Surely something along these lines will emerge at some point in the not-too-distant future.
Finally, there is the most important question any researcher must eventually address: So what? … Is there a practical application for these graphs and lists of words? How does the methodology cash out?
I have given these questions a good deal of thought over the past few months. It seems to me that the practical applications break down into two basic categories. First, the ‘macro-view’ offered by distant reading might point toward trends or peculiarities that a researcher could follow up on using more conventional methodologies. In other words, textual analysis may not be able to identify confluence between news and drama as precisely as I had hoped, but it can make solid predictions about likely places to look. On a similar note, a visualization could also serve as a conceptual aid to help a researcher come to terms with a large amount of textual material. In a scholarly essay or book chapter, this style of application would render the analytical process more-or-less invisible.
Second, in a wide variety of projects, there is tremendous potential to use statistics or visualizations derived from textual analysis to complement a critical argument. In this style of application, the process of analysis would be much more visible.
In the final analysis, digital tools do not strike me as fundamentally different than any other tool, such as a hammer, a pencil, or a paintbrush: results primarily depend on the imagination and skill of the user, not on any sort of magic inherent in the tool itself. Although my own skills are still in the early stages of development, I already feel as though I would want to use computer-based textual analysis in the beginning phases of any big research project, just to see the results that turned up. That may not seem like enough of a payoff to some researchers, but as far as I am concerned, the relatively minor investment of time and effort I have put into these methodologies has been extremely worthwhile.
[1] Moretti, Franco. Distant Reading. London: Verso, 2013.
[2] Jockers, Matthew L. Macroanalysis: Digital Methods and Literary History. University of Illinois Press, 2013.
[3] Dahl, Folke. A Bibliography of English Corantos and Periodical Newsbooks, 1620-1642. London: Bibliographical Society, 1952.