TKSESH

A hieroglyphic database system

Serge Rosmorduc

Équipe EC.art, laboratoire LRIA, Université Paris 8

Abstract

We introduce tksesh, a multiplatform editor, database and dictionary software. Tksesh is both intended to be a toolkit to build applications for the philologist and an example of such application. The core of tksesh is a hieroglyphic editor which understands "Manuel de codage" encodings. The edited texts can be saved in a database, and referenced by the dictionary system, via hyperlinks. The dictionary can handle complex definitions by multiple authors. Most of the text in the dictionary has precise meaning, not only for the reader, but also for the computer. This allows automated treatments and possibly complex searches. An important feature of the dictionary is that it can contain references to the text database, in a readable way, and that clicking on these pops up the referenced text. Of course, exhaustive searches in the database text are also an option. The Tksesh system is written in the Tcl/Tk language, which is freely available for both Windows, Mac, and Unix systems, allowing it to be very portable.

Introduction

The system we introduce here, called TKsesh, is a multi platform system (it works under windows 95, Unix, and should work on macintoshes as well) built around a hieroglyphic editor (compatible with the manuel de codage), and a database engine.

We started working on it quite a long time ago, but at that time it was a rather secondary work. Yet, it appeared that the system was potentially useful. Thus we decided to develop it further. What is presented here is a preliminary version of the software. We look forward for opinions and criticisms to improve it.

Context and goals of the system

While working on our computer-science thesis, whose subject was automatic syntax analysis applied to middle Egyptian, the question of what was to be done with the texts we worked on was left in the background. However, we ended up with the idea of an integrated environment which would allow to store most information one reads and produce while working on a text, to share them, and, most important, to retrieve them. The system could also be a testbed for the Natural Language Processing systems we worked on.

The goal of Tksesh is both to help the realisation of a complete database of Ancient Egyptian texts, and to be a working tool for its user, where one can keep notes, lexical files, and so on.

The components of the system

Tksesh is currently an integrated system : a number of quite different elements bound together in one program. We will now proceed to describe these elements, starting with the editor.

The editor

The hieroglyphic editor is a central element of tksesh; its use will be described here, and information on its code will be given in the system's documentation. The editor's primary function was typing texts for databases, not for printing. So our primary goal was ease and speed of text typing, versus accuracy of printed representation.

typing text

As any hieroglyphic editor, tksesh allows entering the signs by a menu or by code. Simple sign grouping can be done by use of the Manuel de Codage symbols ":" and "*", which allows fast typing; but complex grouping is done by menu. So it is impossible to enter incorrect codes in the system. The strong point of Tksesh as far as typing goes is that it is tolerant about codes. When a transliteration is typed, if the sign is not the expected one, a press on the spacebar will propose a new sign. For example, if I type "mr", I'll get

. If I want the pyramid-sign (O24), I'll press "space" a few times, and get

. Next time I'll enter "mr", the system will remember it and propose O24 first. Even better, when the list of possible signs is exhausted, the system looks in the dictionary, for words having the said transliteration. Thus, in the present state of the dictionary, typing "iw" and spacebar will propose :

Grammatical informations

Supporting the grammatical codes of the Manuel de codage was essential for a system whose primary goal was text databases. However, there's a problem with the manipulation of these codes. The display, and in general the way a user manipulate the text, is cadrat-oriented. Words separations and grammatical separations are sign oriented. Hence we have two problems : the first is to design how the user will manipulate the system to add these informations, and the second is how the system will extract words and the like. At the time being, grammatical markers are indicated by sign colors. Blue signs indicate word endings, yellow signs grammatical markers, and green signs word endings that are also grammatical markers (see Figure 1).

FIGURE 1 : WORDS ENDINGS IN LOUVRE C14 STELA

Currently, the user can only change the status of the last sign of a cadrat, typing "/" to make it a word ending, and "=" to make it a grammatical ending. If the marking is done at the time the text is entered, this is not a problem. However, it becomes one when the marking is done a posteriori. In these cases, one has to break the cadrats and rebuild them after marking. It is not convenient. Next version will include a sign-by-sign navigation mode, which will allow to navigate one sign at a time, and thus to change the status of the current sign.

A related problem is the difficulty to cut and paste a word. This possibility is highly desirable, since it would allow, for example, to add commands like "find the current word in the dictionary", enter the current word in the dictionary, and so on. To do this properly, we have to

be able to designate the current word --- and for this, the good unit is the sign, not the cadrat ;
be able to build a cadrat from parts of a cadrat.

This is not currently possible, but should be soon.

References

As our goal was to build an "intelligent" text database system, with hypertext links all over the place, we needed a way to refer to particular points in a text in an efficient way.

We thought that readable references, very much akin to those used while referring to paper editions, would be fine. This has many advantages. First, it allows the references to be used outside the base : we support things like "O. DM 1567 verso, 2". A second point was that long texts are not always entered from the first line of the first page onward. In fact, you can start typing an interesting part of a text, enter information in the base about its contents, and some time later, decide, for completeness's sake, to type the rest. Our current system allows to explicitly give the position of the current part of the text. For example, in Figure 2, the whole content of P. L2 wasn't entered. But as the first line is explicitly "page 2, ligne 6", references to parts of the text will still be exact, even if we type the first pages afterwards.

FIGURE 2 : EXPLICIT REFERENCES IN A TEXT

In order to create this system while keeping our files "Manuel de Codage"-compliant, we used the comment system. The first lines of L2 look like this once saved :

      ++TKSESH DATABASE FILE+s
      ++NAME Ptahhotep,L2+s
      ++COORDS=page 2, ligne 6+s
      -i-r-wn:n-n:k\-m-s-.-sSm

Our main problem now with this reference system is to make it really usable by the end-user. The interface to the reference-setting system is probably not very user-friendly. As an example, I give in Figure 3, a picture of the system for the stela Berlin 1157, from an example file of Winglyph. The stela has four zones : three called A, B, C; and the main text below, which is not designated by a letter. So the text is separated in "lines" (which could be columns), grouped into zones. The lines are simply numbered (the value NUM for coord 1), and the zones (usually used for pages) are separated into A, B etc. Hence the numbering system is A1, A2, B1, C1, 1, 2, etc.

FIGURE 3 : REFERENCE CREATING SYSTEM

Further needs

The main shortcomings of tksesh's editor stand in the domain of presentation. Improved cadrat rendering and column handling would be nice. More seriously, a font editor is absolutely necessary. We have already written one, but it runs only under UNIX, and thus can't be integrated in the whole system.

Another interesting addition would be the support of multiple reference systems. It might be interesting, for example, to be able to chose between a reference in the original source, or a reference in an edition (That is, for example, between P. Leyde I 350 verso 13 and KRI II,813,3).

The dictionary

Introduction

When we decided to transform tksesh into a work environment for studying texts, the need for a linked dictionary arose naturally. In the first version of the dictionary, entries were quite simple : three fields : "transliteration", "spelling", "translation", the latter being free text. We included also the possibility to add hypertext references to the text database.

However, a real dictionary entry is something both complex and very structured. Entering it as free text is not a very good option, because the structure is lost to the computer. The human reader might be able to reconstruct it, the system won't. Many automated processing that would be possible with a well structured lexicon are then impossible.

On the other hand, giving a very precise and rigid form to the dictionary would also be a problem, because it would force an artificial structure on all definitions.

Last, but not least, the structure proposed should be extensible, but no extension should break existing data.

Hence the structure we are going to describe now. This structure is supported by the editor built in the dictionary, which prevents unstructured entries to be made. It allows to enter many different style of dictionary entries, while keeping the maximum amount of structure information. We tested it by entering definitions from GARDINER's lexicon, FAULKNER, HANNIG, and of P. WILSON's A Ptolemaic Lexicon.

Structure of the dictionary

The fields in the dictionary are of roughly three types : base fields, which contain one type and only one type of information (for example, a transliteration), complex fields, that can contain mixed information (for instance text in transliteration and hieroglyphs), and the group and comment fields.

The group field

The group field is the main organizational device of the dictionary. Groups can be nested, to represent sub-meaning of a words, derived words, and so on. The basic point is that if a group contains multiple fields of the same kind, let's say multiple transliterations, they are supposed to be variants. In the case of translations, this would mean near-synonymous meanings.In Figure 4, all spellings for ip.t are supposed to be equivalent. In a likewise case, the completion system described above should be able to propose all these writings.

FIGURE 4 : REPRESENTATION OF VARIANT SPELLINGS

When some important information (transliteration, spelling, for example) is not available in a group, it is supposed to be inherited from its parent group.

Let's take, for example, the entry for Awi (Figure 5,). The groups (indicated by french quotes << >>) delimit a number of new entries. The first one is for the adjective-verb meaning be long, which has the same transliteration as the head word, but different determinatives. It inherits its transliteration, but changes the translation and the spelling.

Then comes the expression "ib=f Aw". Expressions and composite words are a tricky problem, whose representation might need some improvement. At the time being, we have a number of tags that can be used in expressions to represent the currently defined word, an animate, or an inanimate. The other words might be free text, or explicitly transliterations of words (in which cases they are indexed. For instance, in the current case, a search for the word "ib" will retrieve this definition).

FIGURE 5 : COMPLEX DEFINITIONS

The comment field

In some cases, a dictionary definition can be a true little monograph on a word. In this case, the dictionary entry structure is not very efficient. This is the reason for the comment field's existence. It is there for anything that can't fit in a definition. Many fields can appear in it, like in a true little text editor.

References

References are hypertext links to the text database. They are readable by the human reader, like in the example on the right, taken from Amenemope.

These links are made quite easily, by using the "copy reference" menu option in the editor, and pasting the reference in the dictionary. Afterwards, a click on the reference will load the text at the proper place.

Signature

An important feature for further use of the system will be the possibility to share texts and dictionary entries, and conversely to identify the author of these entries. The Signature field is supposed to be used for this. A further development would be to fill it automatically -- at least, at data exchange time.

Indexation and search

The system builds an index for each dictionary entry, into which it writes references for each transliteration, spelling, and translation. Any field of these types in dictionary entries is indexed, so even sub-definitions are entered in the database.

An important practical point is that text is treated to ease the search. For example, the hieroglyphs are save as Gardiner code, no matter their original form, and only the list of signs is saved. So, someone looking for "p*t:pt" and typing "p:t-pt" will find his word. The transformation could even be improved by suppressing redundant phonetic complements and the like, but this is future work.

Transliteration are also simplified for searches : all 'j' are made into 'i', all points suppressed, etc. Note that this is only made at search time. Any point entered in the dictionary entries will be retained.

Extensions

Looking at the current state of the dictionary, we see a possible generalization : the same mechanism can be used for freer text, for example for notes and the like. So we intend to reuse the code for the dictionary to allow the edition of general notes, which will benefit from the indexation mechanism of the dictionary.

Another interesting extension would be to add an reference system to parts of the dictionary. This would allow referencing a definition in another one. It would be very interesting in expression definitions, as it would allow to reference in a precise and explicit way the words which appear in the definition.

The transliteration and translation editor

This facility will be a central working point of the system once finished. What we present now is just a model. It allows parallel edition of the text translation and transliteration, which can be saved separately. The advantage is that this allows multiple translations to be edited for one text.

FIGURE 6 : TRANSLATION EDITOR

In the editor, all texts are synchronized : the edited line is always displayed in translation, transliteration, and hieroglyphs. We worked a little with the model, and it seems to be quite suitable. We linked it with the dictionary, and it is now possible to look for the word selected in the hieroglyphic window.

Search facilities

One of the most important facility a database can provide is the possibility to retrieve the information it contains. So our base allows to find a words (given in transliteration) in the texts. For texts which have been manually transliterated, it should look in the man-made transliteration, because it's supposed to be accurate. But (as we'll see later), we have a automatic transliteration program that produce a rough transliteration. It's quite fast on a basic Pentium computer, and can be used if no time is available to write a transliteration. The search made in this case is not complete nor sure : some occurrences might be lost, and some words founds can be errors. Yet, it can give a fast initial working base, and it should improve with our transliteration system. In Figure 7, we have the result of the search for the word Axt, and an example of a solution.

FIGURE 7 : SEARCH RESULT

Natural Language processing

The system includes (currently only in its UNIX version) a prolog interpretor which allows us to use natural language processing techniques more easily. We had sketched in [ROS94] a transliteration system. After having worked more on syntactic analysis problems, we had a student, L. KERBOUL, working on the subject . [KER97]. As the result was interesting, we decided to work again on transliteration, and if possible, to end up with a usable system. A detailed technical description does not fit here; let's only say that the basic principle is still the one described previously, but the performances have much improved. The main problem now is word cutting, which is a difficult problem. For this, the best solution would be an interactive one, the system proposing word-cuttings, and the user changing them. It is, interestingly, the solution used by some in editing Asian language (an example for Thai in MCB97)

Further developments

The developments axis of the system will be :

improvement of the editing system
improvement of the natural language processing system -- ultimately with the incorporation of grammatical information
creation of an exchange module --- to allow easy information sharing and exchange, using disks or even the net.

Conclusions

The system I've just described is still in his infancy. Its foundations are however quite sound, and it works well. Now what it needs is users, to make it live, to propose friendlier interfaces, and useful additions.

Appendix

TCL and TK

TCL/TK is a programming language developped by John OUSTERHOUT at Cambridge University (USA) and then in the SUN research department. It is a simple, powerful, cross-platform language, specially designed to be embedable in programs written in compiled languages like C or Pascal. Tksesh is based on a number of extensions, written in C, to TCL/TK. It runs under UNIX and Windows 95. Due to the flexible nature of TCL, it is possible to use tksesh to write little application that would need to display hieroglyphs.

Availability

The system will be available free of financial charges, as "textware": if the system is of some use to you, please contribute some texts. It would be definitly better to ask which texts are needed before sending them. At the time being, the software is quite young, and has had very few users. For this reason, I don't release it by simply putting it on a ftp server. You will have to register first. The details about obtaining the system will be available by next autumn on http://www.iut.univ-paris8.fr/~rosmord/EgyptienE.html.

References

BIL95 S. BILLET, 1995, Apports à l'acquisition interactive de connaissances contextuelles Thèse de doctorat de l'Université Montpellier II (another approach of automated translitteration).
KER97 F. KERBOUL, 1997, Translittération automatique des hiéroglyphes Rapport de stage de l'ENSTA
MCB97 S. MEKNAVIN, P. CHAREONPORNSAWAT and B. KIJSIRIKUL, 1997, Feature-based Thai Word Segmentation In Natural Language Processing Pacific Rim Symposium 1997, Phuket, Thailand
ROS94 S. ROSMORDUC, 1996, Traitement automatique du langage naturel en moyen égyptien In Robert VERGNIEUX, editor, Xieme conférence Informatique et Égyptologie
ROS96 S. ROSMORDUC, 1996, Analyse morpho-syntaxique de textes non ponctués Thèse de doctorat, École normale supérieure de Cachan,

Serge Rosmorduc