modVers: Lessons Learned from Building
I’m currently wrapping up a first round of work on a tool—currently being referred to as modVers—that creates an HTML interface for versioned texts encoded with TEI. To allow for flexible use, the tool is implemented as a jQuery plugin—a working prototype is available on GitHub, and there’s technical documentation there as well. Comments and questions to firstname.lastname@example.org are always appreciated.
This point in the development process seemed like a good time to think about some broad lessons learned and future directions for work.
1. TEI needs visualization. TEI makes visualization difficult.
The diversity and form of the information encoded in TEI files point to the need for visualization. In part, the XML tagging structure that TEI relies on does not present information in a way that is conducive to easy human understanding. While this structure is necessary for presenting a text using a logical structure that can be manipulated by a computer, this simultaneously interferes with the human act of reading. Further, while a text can encode a large amount of semantic information related to a range of topics, it is not always desirable to deal with all these properties at the same time. Thus, one of the main goals of visualization is to reduce that information, creating a visual representation that is useful to a viewer. Because this reduction of information is highly dependent on context, creating visualization software that meets the needs of a wide range of researchers is a challenge. For example, one researcher might want to see relationships between the various people and places in a text without being distracted by information about the text’s editorial history; another may care only about the editorial changes irrespective of the people and places in the text.
A related issue stems from the size of the TEI element set. Compared to an XML schema like HTML, the TEI defines a relatively large number of elements and attributes, all of which are presumably needed for various editors to describe the texts they work with. This quantity presents significant problems for visualization. For example, additions to the text might be rendered in green and deletions in red. Given the restricted visual vocabulary available, especially in web projects that rely on CSS, limits are quickly reached related to the number of elements that can be visually represented at once. Further, visualization methods will always need to seek a balance between the quantity of information rendered visually and the overall comprehensibility of the visualization produced.
The Mandala Browser, an application that visualizes various features of XML files as colored circles, uses a very limited visual vocabulary but allows users a wide choice of what elements from a text to display. In this way, the problem of filtering information is given to the user to solve as is appropriate in a given context. The modVers prototype takes a different approach, greatly reducing the elements that are recognized and made visually distinct. Unlike the Mandala Browser, the prototype visualizes text as text—this seems to favor the recreation of some visual properties of the original text while also, perhaps, reducing the utility of drawing attention to semantic entities such as geographic locations. Viewed from the distant perspective that the Mandala Browser provides, these dispersed entities can form larger patterns. Viewed at the near-facsimile level of the modVers prototype, the visual properties of the text and the ability to reveal small-scale difference seem of primary concern.
However, modVers does allow users to modify the way entities are visualized by translating all TEI entities into HTML entities that are unstyled but can be selected. For example, a researcher with a special interest in geographic locations (or any of TEI’s other specialized entities) could add CSS rules for that purpose.
Data visualization outside the humanities (and often within) is an art of drawing dots and lines. The visual limitations of these systems tend to be related to color and shape differences that can be easily identified by a user, and these same problems certainly arise in projects like modVers. (I would be grateful, for instance, to anyone who would like to share research on generating color sets that are both distinct and not hideous, for example.) Still, because visualization of texts encoded in TEI also attempts, at times, to recreate or demarcate physical features of texts, I would suggest that there are problems of visualization unique to this domain. Because modVers visualizes text as text, it becomes problematic to distinguish between the text being visualized and any text that might refer to that text. ModVers approaches these problems by, first, not attempting to recreate exact physical appearance. By standardizing some features and using iconic representations of some physical traits (for example, inserted text is indicated with an arrow pointing up rather than by placement), the system attempts to make visualization manageable. Second, the system uses shifts in visual aesthetics (see, for example, the screenshot below that includes an icon indicating the presence of an annotation ) to mark different levels of textuality.
2. Encoding methods have consequences.
The current iteration of modVers was intended to be a working prototype—a way to work through real problems but with the knowledge that all the insights gained might not be implemented as part of a finished product. To that end, I’m going to suggest in the following few paragraphs that there seem to be good reasons to consider using a method of encoding textual variants other than parallel segmentation—all the while, modVers requires that editors use parallel segmentation.
Parallel segmentation is a method of encoding multiple textual versions in one file such that the complete text of any witness can be easily reconstructed. All variants are included within an <app> element, making the concept of a base text unnecessary. This arrangement is also fairly simple for editors to write and read; however, this ease of use is accompanied by limitations to precision. According to the TEI guidelines, parallel segmentation “will become less convenient as traditions become more complex and tension develops between the need to segment on the largest variation found and the need to express the finest detail of agreement between witnesses” (“12 Critical Apparatus – TEI P5”). Parallel segmentation has been used because it is relatively easy to implement and encourages a vision of the textual editor that has some commonalities with the pre-digital tradition.
The alternative to the parallel segmentation method is the double end-point attachment method (or the location-referenced method, although the two are fairly similar—I’ll just refer to the double end-point attachment method below). This method requires that a base text be identified and permits textual variants to be recorded in either the same or in an external file. Like the parallel segmentation method, the double end-point attachment method allows for the complete reconstruction of the text of any witness. The major advantage of the double end-point attachment method is that it allows overlapping variants and, in general, provides a higher degree of precision. This is achieved by inserting location references in the base text, a process that has the potential to create a complicated file that is extremely difficult for humans to read or write. The TEI guidelines note, “Because creation and interpretation of double end-point attachment apparatus will be lengthy and difficult it is likely that they will usually be created and examined by scholars only with mechanical assistance” (“12 Critical Apparatus – TEI P5”). While there is currently no software designed to help editors work with complex texts of this kind, the inclusion of this method in the TEI guidelines does indicate the potential of different methods.
The system design implied or allowed for by the double end-point attachment method is quite different than that encouraged by the parallel segmentation method. By allowing for textual variants to be divided among multiple files and for those variants to refer to a base text that is not directly modified, the double end-point attachment method creates the possibility for witnesses to be distributed, either in a database or as separate files, and combined as needed to form new digital editions. When combined with the method’s reliance on a base text for standardization, this distributed quality would also allow for individual editors to more easily contribute to an edition or repository of textual variants. A system designed in this way would produce editions that bring together the work of multiple editors, and encoded texts could be reused in multiple editions without additional labor. For example, researchers at various institutions, working from one base text, might produce a repository of fifteen versions of a play by Shakespeare. Individual editors could then combine these versioned texts in flexible ways. A complete edition of all the versions could be produced by one editor, while an edition focusing on the changes across just two versions could be produced by another. The main idea here is that encoded texts, rather than serving as a final product that belongs to one researcher, could be utilized in many ways and the labor of producing these editions could be shared more easily. This model questions the conception of the textual editor as a lone worker, devoting significant time to the work of gathering and collating witnesses.
3. Interfaces are temporary instantiations.
One of the assumptions that influenced the production of modVers is that TEI files should have more than one life. We should produce them not like bespoke editions but as data—albeit data that is necessarily the product of interpretation—that can be used in various ways.
The assumption that TEI files would be used to generate multiple interfaces also led to a questioning of what purpose the interface would (and could) serve. The obvious answer was that modVers would be useful for creating digital editions, but I also thought that it would be interesting to experiment with using TEI files to create small examples that could be set within an academic article. So, instead of the versioned texts standing alone, or with annotations, sections of the texts could be inserted into and referenced by a larger, structured argument. ModVers includes the option to output an interface that aims to be functional within a column of text and also allows users to specify which parts of a text and which versions to display, enabling an excerpt of a complete TEI text to be drawn on.
At some basic level, there is functionality in presenting versions side by side that cannot be recreated in a single text block. While the implementation of this feature in modVers might or might not be immediately useful, the possibility of using this kind of presentation within articles seems worth pursuing.
“12 Critical Apparatus – TEI P5: — Guidelines for Electronic Text Encoding and Interchange.” Web. 18 Feb. 2013.