Digitizing Dictionaries: Introduction

Introduction

A model is a selective representation of an object or a process with an essentially epistemological goal: to use “a likeness to gain knowledge of its original” (McCarty 2007: 392). Following Geertz, however, McCarty distinguishes between two different types of models -- models of and models for:

A model of something is an exploratory device, a more or less ‘poor substitute’ for the real thing (Groenewold 1961:98). We build such models-of because the object of study is inaccessible or intractable, like poetry or subatomic whatever-they-are. In contrast a model for something is a design, exemplary ideal, archetype or other guiding preconception. Thus we construct a model of an airplane in order to see how it works; we design a model for an airplane to guide its construction. (McCarty 2002: 393)

The two types of models have different trajectories: the former takes an existing object and creates its likeness, whereas the latter creates a vision of an object that is yet to be created. One is an interpretation, the other — a projection. In either case, a model is an abstraction from the object it represents: it can never be equal to the object itself.

A data model of a dictionary is both a model-of and a model-for: it is an interpretation of the way the lexicographic source is structured and it is a projection of some of the functionalities that an electronic edition of the given dictionary could have. For instance, if you want to make sure that your users can eventually retrieve all words of Latin origin in your dictionary, you have to make sure that your dictionary entries containing etymological information consistently mark up etymons from Latin in a way that they can be automatically retrieved.

Computers are incredibly efficient machine, but not very smart on their own. Without additional input, a computer can't tell a lemma from a cross-reference or an etymology: for the computer, all words in a dictionary entry are just strings of characters. Whether "Latin" refers to the name of an ancient language or one part of the Americas, or a particular quarter in Paris is not exactly self-evident to non-humans. If we want to get meaningful results from old dictionaries, we need to teach the computer to distinguish between things that we ourselves recognize in them. But let's not be too harsh on poor, old computers either: what we recognize in dictionary entries is no intrinsic, intuitive knowledge: it's acquired for ourselves too.

Imagine you were an alien from the Hollywood blockbuster "Arrival". Amy Adams, who plays a linguist trying to decipher your extraterrestrial language, presents you with a page from a dictionary. What you would make of it? Probably very little. All you would see would be different shapes without any meaning.

Aliens in "Arrival" use logograms to communicate

Unlike an alien, a computer will know that your dictionary file consists of words -- because a text editor that contains the text of your dictionary has been programmed to recognize words as strings of characters between spaces. It will probably also be smart enough not to count periods as parts of words, so that if you double-click a word followed by a comma, it will probably select only the word. But that's about it. If all you have is plain text of a dictionary, all the words in that file will be treated the same by the computer. And you know very well that a dictionary entry is a hierarchical space. Different positions in this hierarchy are associated with different types of information. And only if you tell the computer: a Latin word starts here and ends there, or a translation equivalent starts here and ends there, will you be able to do type-specific searches and retrieve entries only based on their Latin origins or translation equivalents.

In the rest of this Unit, we'll look at mark-up and text encoding as ways of capturing dictionary structure and content.

Last modified: Monday, 9 December 2019, 2:14 PM