Digitizing Dictionaries: Introduction

Introduction

Now that you've learned the basics of XML, we have some good news and some bad news for you. The good news is that you are ready to start marking up your dictionaries. The bad news is that XML on its own is not enough if you want to make sure that your dictionary follows the best practices established in the retrodigitization community.

If you are not fundamentally opposed to the art of stretching a metaphor, you could think of XML as a language without words. XML comes with a basic set of grammatical rules on what words should or shouldn't look like, and how you could combine them, but it is up to you to invent the words themselves.

Making up your own words is great if you are planning only to talk to yourself. For instance, you could decide that each dictionary entry should be wrapped in a tag called <gobbledygook> and each lemma in a tag called <hullabaloo>. As long as your XML was well-formed, you would be ok. But if you wanted your dictionary to "talk" to other dictionaries, you would have to invest a great deal of effort into translation. One person's <gobbledygook> would be another person's <smorgasbord>.

Another question is that of consistency. How do you make sure that your dictionary uses the same encoding rules throughout? We've mentioned before that you can impose certain constraints on your XML structures in terms of where certain elements and attributes go. Wouldn't it be nice if you didn't have to write these rules from scratch every time you started a new dictionary project?

By the mid-1980s, academics, librarians and archivists from North America and Europe realized that they needed a common vocabulary and a common ruleset for encoding texts in the humanities. That's how Text Encoding Initiative (TEI) was born.

The TEI consortium was established in 1987 as an international research project to develop a set of guidelines to "facilitate the creation, exchange, and integration of textual data in machine-readable form." The goal was to create a mechanism that would support the encoding of "all kinds of texts, in every human language, from every historical or social context." A challenging goal!

The TEI recommendations are continuously updated and occasionally major releases are published. These major releases are numbered incrementally starting with TEI P1 (in 1990) to the latest release TEI P5 (in 2007). Since, 2011 TEI is also registered as its own media type (RFC 6129).

TEI is not a standard in the sense that it is prescribed by the international standardization bodies like ISO or W3C. Since the first draft of the TEI Guidelines was released in the 1990s, however, TEI has developed into a community-driven infrastructure and one of the most important de facto standards within the humanities. It has been used in the preparation of countless digital editions of literary and dramatic texts, historical documents, manuscripts, corpora and dictionaries.

The first TEI Guidelines (P1 to P3) were based on SGML, while the more recent revisions – TEI P4 (June 2002) and TEI P5 (November 2007) – have used XML.

In the rest of this unit, we will learn how to use TEI in general, and how to use it to encode dictionaries, in particular.

Further reading

Lou Burnard, The Evolution of the Text Encoding Initiative: From Research Project to Research Infrastructure, in Journal of the Text Encoding Initiative, 2013, http://jtei.revues.org/811

Lou Burnard, What is the Text Encoding Initiative, 2014, http://books.openedition.org/oep/426?lang=en

Nancy M. Ide and C. M. Sperberg-McQueen, The Text Encoding Initiative: Its History, Goals, and Future Development:http://www.cs.vassar.edu/~ide/papers/teiHistory.pdf

Last modified: Sunday, 19 March 2017, 12:03 PM