by David Kelly
We think of structured text and XML as being for content that changes rapidly, because that’s what those tools are good for. So why did we get involved in a conversion project where the content is from millenia-old papyri written in ancient Greek?
On a gray spring day, Simon Bate and I found ourselves in the small Papyrology Room in a sub-basement of Perkins Library at Duke University. I scanned the spines of old papyrological journals while we viewed arcane diagrams on a large flat-panel screen and listened to the meeting’s organizer. The screen flickered and went solid black. Cell phones came out, people went running out of the room…
We were a motley group, assembled for obscure purposes, hungry for information about the dirty task at hand. Leaning intently over the table were the Scottish professor from Heidelberg, a specialist from London College, and our hosts, scholars of Classical Studies from Duke University and New York University. The rest came from around the States with a variety of software backgrounds. An unlikely group, indeed. What ancient secrets were we poised to unleash?
The story deepened as we listened. In 1931, Leiden, the Netherlands, a select group of classical scholars met to establish conventions for indicating the conditions of texts when transcribing the text from papyri to type (http://en.wikipedia.org/wiki/Leiden_Conventions). Later, Dr. David Woodley Packard (http://en.wikipedia.org/wiki/David_Woodley_Packard) would devise a form of SGML for transcribing the Leiden conventions into digital media, even building his own computers to store them (yes, he was the son of THAT David Packard). The SGML was given to Columbia University, and from there it went through an obscure trail, including merges and matches with other forms of digital markup from other universities.
Now our hosts were talking about a more generalized form of markup, the EpiDoc (http://idp.atlantides.org/trac/idp/wiki/EpiDoc) standard of the Text Encoding Initiative (TEI) (http://www.tei-c.org/index.xml). The activities are organized under the rubric of Integrating Digital Papyrology (IDP) (http://idp.atlantides.org/), which is described as “a collaboration between the Duke Data Bank of Documentary Papyri (DDBDP), the Heidelberger Gesamtverzeichnis der griechischen Papyrusurkunden Ägyptens (HGV), the Advanced Papyrological Information System (APIS), and several leading research institutions.” Various parts of the project are funded by the Andrew W. Mellon Foundation and the U.S. National Endowment for the Humanities.
Our heads were swimming.
NOTE: The following image is of a letter on papyrus from Oxyrhynchus, written in Greek dating from the second century AD. This image is in the public domain and was reproduced from this location on Wikipedia: http://hsb.wikipedia.org/wiki/Dataja:Ac_papyrus.png. Please see this website for additional information about this papyrus.
Despite all the signs of an archeological pulp-story adventure, we were there for a typical business reason. Huge amounts of tagged text needed cleanup after having been converted to EpiDoc.
One of the big problems was with Greek numbers. The ancient Greeks had a curious numeric notation system before those neat Arabic numbers came along. The complexity of numbers, combined with the complexity of the Leiden textual markup conventions, had given fits to an earlier conversion of the markup. It was time to call in the Pros from Dover. (Insert film clip of David and Simon as Hawkeye and Trapper John entering the operating room in a military hospital in Japan, golf bags in tow, umbrella at the ready.)
In ancient Greek, numbers are represented by the letters of the alphabet – with a twist. The character used depends on what place the number is in. For example, the numeral one in the one’s place is a lowercase alpha (α). In the tens place it is a lowercase iota, which looks like a one (ι). In the hundreds place it becomes a lowercase rho (ρ). And in the thousands place, it becomes an uppercase Alpha (А). So the number 1111 is Αρια. Pity the poor schoolchildren.
It gets worse: the character for the one’s-place number 6 doesn’t occur in the standard alphabet, but is called Stigma, and you have to install special fonts to represent it. A 90 is a Qoppa, which is in the extended UTF-8 character set – and which has four characters associated with it, depending on whether it is being used as a number or a letter. A 900 is a Sampi, sort of a left-leaning crescent moon with two strokes in the middle of the inside curve.
One kind of problem comes when Leiden convention characters are introduced into a number. For example, the original typographical markup in Leiden conventions might have been:
In Arabic numbers this is 1847, with the “” part being supplied by the editor. (The reason for being able to supply the “47” might be internal clues within the document or references to other documents.)
Now the SGML conversion takes place, resulting in the following:
The square brackets still appear, and because there is no space between the <num> elements and the open square bracket, the syntax still represents the fact that this is a single string of numerals, i.e., 1847. But the place values associated with the Greek characters have been “lost in translation.” Look what happens when this number gets converted to EpiDoc in the first round:
<num value=”18″>ιη</num><supplied reason=”lost” cert=”high”><num value=”47″>μζ</num></supplied>
The EpiDoc conversion is based on the value of each <num> element, so the 18 gets incorrectly converted to an iota-nu, and the 47 gets correctly converted to a mu-zeta. The correct value for the “18” is actually 1800, so the Greek should have been, as in the original, Alpha-omega, or Αω. For scholars, we were told, the iota-nu is egregious nonsense.
Our transform had to find all instances where a <num> element was followed by (or contained) one or more <num> elements and did not have an intervening space. Having found these instances, the transform then reconstructed the initial number value and supplied the correct Greek characters with the correct tagging sequence. The original markup in Leiden conventions was 18. So the correct solution for EpiDoc would be as follows:
<num value=”1847″>Αω<supplied reason=”lost”>μζ</supplied></num>
In fact, I copied this string from a report that was prepared to show the proposed transforms before we applied them to the text. Once the report was approved, the transforms were applied to (deep breath here) 60,000 XML files representing the combined papyrological scholarship of the entire planet for the last 78 years.
We HAD to get it right – or the consequences could be millenial.
Stay tuned for Parts β and γ of this serial blog – coming soon to a Palimpsest near you.