XML Manuel de Codage for hieroglyphic texts

These pages are devoted to the discussion of a new Manuel de Codage for Egyptian hieroglyphic texts in XML. We want to end up with a decent, extensible, clean, standard, that would be usable by all programs that deal with hieroglyphs. Contributions are welcome.

To contribute, subscribe to hieroxml (send a mail to hieroxml-request@iut.univ-paris8.fr with subject "subscribe" or "subscribe you@your.address).

You can get help for hieroxml simply by mailing a request for help (subject should be "help")

Temporary document. Date : Thu Mar 2 13:19:25 CET 2000

Why a new Manuel de Codage
Why XML?
The current proposal
Particular issues
A note for computer scientists
Documentation

Why a new Manuel de Codage ?

The Manuel de Codage is a standard for encoding Egyptian hieroglyphic texts. It was written in the 80s, and much inspired by the most usable program available at that time, which was called glyph.

But the Manuel is getting old. For quite a long time now, nobody has been really using the manuel as such. Instead, people are using extensions proposed by the different hieroglyphic typesetting systems, like Winglyph or Macscribe. Their format are extensions of the old Manuel. They are needed to address fine typographical points, like sign positioning, or to correct a number of weak points in the original manuel, like hatching.

This would be fine if these extensions were compatible one with another, which is not currently the case (but should be fixed in next version), and if the Manuel was easily and logically extensible, which is not really the case. The first problem is a serious one, even if it get fixed one day : it makes program development difficult if one has compatibility in mind.

There's one thing the current Manuel is fine for : hand-made encoding. The manuel allows rather terse representation of a text, and for simple things, that's ok. For communicating a simple hieroglyphic text by the way of ASCII codes, you can't compete with the Manuel. It's a strength, but also a weakness. Hand encoding means that errors are made. It's like writing a computer program : here and there, you'll forget something. The problem is well known for web pages : most HTML code on the web is broken. Web browsers have to include error correcting system to deal with these broken pages.

This causes two problems : first, it's difficult to write a good error correction. Second, with a good formalization, it's quite likely that two different programs will have the same idea of the "meaning" of the same correct text. But for badly coded text, such a agreement is impossible to achieve.

So, if we want to have an encoding of hieroglyphic texts which

is here to stay
allows text sharing between programs

we don't need a terse format, we need a precise one. "Precise" means that both its syntax and semantics will be defined.

Why XML?

XML is a format which allow both to describe an encoding and to write encoded files.

It was chosen for a number of reasons. First, it's easy to extend an XML format. Second, it's easy to parse an XML file, an there are a lot of tools for it: people will be able to manipulate XMLMCD files without being graduate in Computer Science. Third, XML is being used for a growing number of applications --- for instance web browsers. Fourth, there's a user community for XML in the philological world : two interesting examples are the Text Encoding Initiative and the recent conference on XML and Ancient Near East.

Let's illustrate these points. In the current MCD, data about an individual sign is scattered around it. Look for example at :

      =A1\\r1 -i

It means "Sign Gardiner A1", as both grammatical and word ending, reversed, rotated. fine positional data, colour data, and more are hard to add. On the other hand, the current proposal would represent the same sequence as

      <hieroglyph code="A1" gramend="y" wordend="y" rot="90" reversed="y">
      <hieroglyph code="i">

Of course, it's much longer. But The format is not supposed to be directly manipulated by humans, so it's not a real issue. The important point is that it's possible to add data to the signs without breaking the whole encoding.

The current proposal

You will find the current working documents at http://webperso.iut.univ-paris8.fr/~rosmord/DTD. These documents will be the result of the group's work. Their current status is just being a starting point. The interesing files are the dtd and a test xml encoded file.

In particular, Hans van den Berg, from the CCER, has also created a format for Winglyph 2.0. He presented it at the conference mentionned above.

The goal is not to propose two standards, but to start a dynamic to improve the possible ones.

Particular issues

The goal of the new standard is twofold : it must be as expressive as the previous one, on the one hand, and on the other hand it must/might extend it in a number of areas.

Epigraphic comments and sign encoding

The most annoying problem with Egyptian encoding is that the list of sign is not a closed list : "new" signs may appear in a text. Plus, we are dealing with epigraphy : some sign reading may be doubtful. The current manual is oriented toward printing, which is perfectly logical. But XML would be an opportunity to improve its semantic power, while keeping the possibility to print the text in a standard way.

Philological comments

There's also a need for ways of representing philological comments. This is mostly true for printing. I think that for database purpose, philological comments should be external to the text representation. The rationale for this is simple : external representation allows to have multiple comments for one text, while internal representation doesn't, at least in an extensible way.

Graphical extensions

Hieroglyphs are drawings. There's a need to be able to represent a drawing in a hieroglyphic text, at two levels :

Low level drawing

Unknown/unread signs may need to be represented (think for example of P. BM. 10411, LRLC plates 1-4, in which part of the text concern two amulettes : the word for the said amulettes is replaced by a drawing). The sign representation may be internal to the document, or perhaps external. But in any way, the way to reference it and the format used should be documented.

Scene representation

When an Egyptian text appears along with a representation, it is well known that the relation between the two is rather close. For example, a full-scale representation often stands as determinative for some words in the text. In a number of cases, the relation between the scene and the text are even more complex.

In any case, it seems difficult to document a scene without stating which text refers to which scene element. A standard way of referencing would also be interesting here.

References

There should be also provision for multiple reference systems and even for multiple representation of a text. One might think for example, that a text from a temple scene would be displayed as a near fact simile if one browses the temple's texts, but that an excerpt of the same text, taken from the same file, would appear in a calibrated left-to-right form while quoted from a grammatical study.

A note for computer scientists

Computer scientists usually think of XML, SGML, and even in the early days HTML as semantical markup (old timers, do you remember the flame wars about Netscape's additions to the language and <BLINK; stuff ?). Roughly, you represent the meaning of a text, and it's the program's task to build a nice graphical presentation for this meaning.

The point is that the representation of hieroglyphic text should not follow this principle. What we want to represent first is the original document. Of course, it's not possible to achieve a completely trustful electronic representation (in this case, a scan would be better).

Documentation

On Sgml and the humanities, the TEI documents on the web are a very interesting source. Apart from the TEI site itself, I would suggest having a look at Talks and Papers on the TEI. One most interesting article which is of direct concern for Egyptological text is Textual Criticism and the Text Encoding Initiative.

On XML, the best starting points are the W3 and Oasis sites.