Muller's Digital CJK Dictionaries

The Structure and Function of the Interlinked Electronic CJK-English and Buddhist CJK-English Dictionaries

A. Charles Muller, Professor of Humanities, Toyo Gakuen University

May 13, 1998

Follow-up report, January 18, 1999

Our current age offers us dramatic new possibilities in terms of the exchange and development of textual research resources, as we can now gather and transmit information with an ease and rapidity which was inconceivable through earlier media. Although most people will no doubt always prefer to have a hard copy of a book or lengthy article to sit down and read, lexicographical and encyclopedic style reference materials, which have relatively brief and compartmentalized data formats, are extremely well-suited for the digital domain, as they can be furnished with search and retrieval capabilities which are impossible in paper form. The twin CJK lexicographical models presented here stand as an early model of the possibilities such digital reference materials. They have special relevance in the fact that they are not created by a professional software company, or as the fruit of a large, endowed research team. Rather, they have been developed to their current state almost exclusively by a lone humanities scholar possessed of only the most minimal of programming skills with simple HTML techniques.

History

The two digital lexicons are a (1) dictionary of pre-modern East Asian characters and compounds, compiled primarily from the literary, philosophical and historical fields of inquiry (hereafter abbreviated as CJKDict) and (2) a dictionary of East Asian Buddhist terms (hereafter abbreviated as BDict), both of which are compiled with the same structural format. I began the compilation of these two dictionaries in 1986 at the outset of my graduate studies. Since my focus was East Asian Buddhism, I needed to read original Buddhist and non-Buddhist documents in Classical Chinese. Also, since I had a fairly strong interest in Confucianism and Taoism and was studying all of the fundamental classical texts of these traditions, I was necessarily exposed to large amounts of non-Buddhist vocabulary.

I was well-aware as an undergraduate student of the extreme dearth of lexicographical materials available in English for my fields. For Classical Chinese, there were the Giles and Mathews dictionaries, both of which were venerable compilations, but clearly outdated, and limited in many ways. Similarly, for Buddhism, there was the Soothill dictionary, also over fifty years old, in many ways quite uneven, and lacking in pronunciations and scholarly references. There were also a few other more recently published Buddhist dictionaries, but these were generally limited in scope to a particular tradition, or sparse in terms of scholarly explanation. Because of the relatively small scope and inferior quality of these compilations, the English-speaking researcher in East Asian philosophy and religion is to the present day, forced to rely exclusively on East Asian reference materials. While the ability to work with East Asian secondary materials and reference works should always be a basic requirement in our field, the paucity of English-language reference works seems to be rather incommensurate with the present level of scholarship and general interest in these areas.

With this awareness in mind, I decided at the outset of my graduate studies that it might not be a bad idea to write down and save everything I looked up. Especially since, as a relative beginner, I would be investigating a lot of basic terms that an advanced specialist would not even think about. Working in this way, after just one year of study I had accumulated a considerable glossary, which I entered into digital format for fast access and compact storage. I continued to enter all Buddhist and non-Buddhist CJK terms into a single compilation like this for about four years. In 1990 at SUNY Stony Brook, I was asked to teach an undergraduate course in Classical Chinese. With Mathews and the Fenn Five-Thousand dictionaries both out of print at the time, I was in the awkward position having to teach such a course without the benefit of a comprehensive dictionary. But reflecting on my own compilation, I realized that I already had more than enough Classical Chinese terms on hand to suffice for such an introductory course, so I took the important step of culling out the non-Buddhist terms from the Buddhist terms and creating two separate compilations. I then checked through the draft of the Classical Chinese lexicon and supplemented it with all the characters which were going to appear in my introductory course texts.

The following summer, I completed my translation of five Chinese Classics,1 and in the course of this work, added all the KSC-representable2 CJK logographs in these works. It was at this point that it had really turned for the first time, into a full-fledged dictionary. At the same time, I was continuing in the work of research and translation of a wide range of East Asian Buddhist classical texts, and from these I continued to add to both dictionaries. This work of compilation proceeded in this fashion past my graduation until the summer of 1995, when I first became aware of the WWWeb and HTML publication. At this point I made my first primitive a conversion of BDict into HTML format and installed it on my newly-established web site, where it has remained since that time.3 CJKDict was during this time being prepared for commercial CD-ROM publication, a process which helped greatly to improve its overall quality.4 One year later, sensing a lack of interest on the part of the publisher in further update and development of the project, I installed it onto the Web site in HTML format for public access, alongside BDict.

Since then, I have been continuing to work towards the further development of both dictionaries in terms of both quantity and quality of content as well technical improvement, especially in regard to indexing and hyperlinking. As time passes, the degree of internal hyperlinking continues to increase. Of equal importance however, is the matter of external hyperlinking, not only between the two sister compilations, but between them and similar CJK lexicons around the Web. During the past several months, the CJK dictionary has been interlinked on an individual character basis with Jim Breen's WWW JDIC Server, Rick Harbaugh's Etymological Dictionary and Chuck Polisher's I Ching Lexicon. More such linkups are expected in the near future.

Setting aside for the moment their special digital capabilities, both dictionaries already surpass many of their hard-copy counterpart lexicons in terms of basic content. The number of single characters with full information contained in the CJKDict is already (to date) 8,000--more than is contained in Mathews (although Mathews still has many more compound entries). The content of the definitions in themselves are, for the most part, far more extensive and precise than any other current CJK-English dictionary with pre-modern focus, being derived from a wide range of authoritative Chinese, Korean and Japanese lexicons as well as through the direct reading of primary textual sources. They also possess the unique aspect of providing readings in all three prominent northeast Asian languages--Chinese, Korean and Japanese.

If BDict were printed out today with single spacing at 12 pt. on A4 paper it would come out to a little over 600 pages, which far exceeds Soothill or any of the other English language dictionaries which treat East Asian Buddhism--and it is continuing to grow rapidly. It is also distinguished by being the only dictionary in this area which deliberately strives for a balanced treatment of Chinese, Korean and Japanese Buddhism. And like CJKDict, it provides readings in all three languages.

Beyond this basic superiority in content, both dictionaries are able to take advantage of digital functionality--most important of which is hyperlinking. Here, there are two general categories: the hyperlinking from the indexes, and the hyperlinks attached to important terms within the definitions themselves. Hyperlink-based indexes are far faster than normal hard-copy indexes, since all one has to do is click--there is no searching for pages and then terms within a page. The hyperlinks attached to the important definition terms also allow one to go immediately to check related concepts--and one does not need to keep one's fingers stuck in three places in the dictionary to save the location of other terms still under investigation. Given this combination of attributes then, these dictionaries surpass already existent lexicons in many ways.

Structure of the Dictionaries

Both dictionaries are based on the same general format, with the only major difference being the fact that the master file of CJKDict is stored in a database program, while BDict is stored in marked up text format. However, when CJKDict is generated for publication, it first passes through a stage of text-markup almost identical to the storage format of BDict. The two different storage formats have their respective advantages and disadvantages. The database allows for quick sorting, selection of certain sets of information for comparison and for ready identification of errors of structure and content. It is extremely limited however, in the case where major global changes are necessary, or extensive work needs to be done with the definitions, or where definitions are unusually long. Since CJKDict is fairly well-established at this time in terms of format, accuracy, hyperlinks, and does generally not contain overly long definitions, I have found the database format to be the most useful. Also, working within Microsoft Office 97, it is easy to have direct and fast access to individual entries contained in Access while working within Word via VBA macros. Thus it can be used for reference and be updated instantaneously, as long as such amendments are relatively minor.

In the case of BDict however, the writing of any of the definitions is rarely a closed case, as most Buddhist terms, place names, school names, text names and personal names contained can always benefit from some kind of new information and editing. There are always new places to be found to add hyperlinks, and since these are not added by typing, but through macros, development of the compilation inside a database has proven to be impractical. Therefore it is stored in the text markup format which I describe below. The structure presently used, while not yet SGML-valid, is based on SGML principles such that it can be readily be set for SGML validity as the need arises.5 That is, structural and semantic elements of the text are tagged with numerous on <element> and off </element> bracket structures, which may be devised without limit by the user, as long as they are implemented consistently. Of course, this is the way that HTML markup is done, and many of the tags used are already HTML tags. With this kind of tagging structure within text format, the material can be easily converted into the desired publication format, whether it be hard copy, database, HTML or (hopefully soon) XML.

Let us now take a look at a sample entry at the text encodement level, to identify its components (the circled numbers (e.g. �@) included are temporarily placed for reference purposes and are not part of the original document).

------------------------------------------------

�@<entry ID="07501672C-061056027">�A <gph>�{��</gph> �B[w] <pron lang="ch-wg">pen-hsing</pron> [p] <pron lang="ch-py">ben3xing4</pron> [k] <pron lang="kmr">pons&obrev;ng</pron> [j] <pron lang="jp-rm">honsh&omacron;</pron> �C <sense>"�D<ind1=trm>Original nature</ind1=trm>," or "�D<ind1=trm>inherent nature</ind1=trm>" (�D<ind1=skt>prak&rdotblw;ti</ind1=skt>). An originally present fundamental quality of something, often equivalent to the concept of "self-nature" (�E<linc>��</linc> - Skt. �D<ind1=skt>svabh&amacron;va</ind1=skt>; Pali �D<ind1=pal>sabh&amacron;va</ind1=pal>). Buddhism, and especially Mah&amacron;y&amacron;na, generally rejects the concept of an inherent nature as being a mistaken perception. But on the other hand, in accordance with the general Chinese philosophical perception of the human nature as being originally good, certain texts will allude to the mind's inherent purity or quiescence. For one discussion of original nature, see the �D<ind2=txt-chn><wg>Yüan-chüeh ching</wg></ind2=txt> �E<linc>��S�S</linc> at <cancol>T</cancol> 842.17.913c.</sense>�F<ref>iwa750 ZGD1164b naka1263c</ref>�G<resp>acm(entry)</resp></entry>

-------------------------------------------------------

The above entry from the master text contains many of the typical elements. Let us go through the most important of these.

�@ In accordance with database and SGML principles, each dictionary entry has a unique ID number, as in the example above, the ID number for the character �{ is (07501672C). Both of my dictionaries use the same ID construction system, a nine-digit string. The first three (075) digits represent the traditional radical (��) number; the next two (01) indicate the number of strokes (�搔) after the radical and the final four are the character's Unicode hex number (672C). The structure of this number allows for sorting by traditional radical and stroke, my preferred method of searching and arranging CJK lexical information. The inclusion of the Unicode hex number allows for linking with external dictionaries, as the usage of the Unicode hex number has become standardized among our small group of interlinked internet dictionaries.

�A The next element, indicated by the tags <gph></gph> demarcates the character graph(s) of the entry. This tag may be converted to a font value or anchor at the time of HTML publication, or to a field for database import. Also, when I use the dictionary for my own research work, I normally use the text-master version (rather than HTML version) by means of an extensive array of macros, so it is also necessary for the head word to be distinctively tagged to enable the proper function of search macros.

�B The third section is that of the romanized readings, of which there are presently four: Chinese Wade-Giles (still quite popular in the field of English-language Buddhology), Pinyin, Korean (McCune-Reischauer) and Japanese (Hepburn). If the necessary unified font system can be used in the future, I will add Korean han'gul and Japanese kana. There is also a strong possibility of Vietnamese readings being added in the future.

�C The next is the main explanatory portion of the entry, bracketed by the TEI-recommended tags <sense></sense>. Included in this section are all meanings, synonyms, Indic and Tibetan equivalents and cross references.

�D A large number of the elements of the "sense" section are marked with tags for indexing purposes: person names, place names, English renderings of technical terms, Indic and Tibetan terms. These are also cross-classified by cultural/linguistic regions of Indic, Chinese, Korean and Japanese. Since there is no single sample entry which includes all of these tags, we have selected one which provides a few: English equivalent, Sanskrit and Pali. Index tags are subdivided into two general categories: "ind1" and "ind2", the difference in number indicating the indexing level. Level one indicates that the indexed term is a direct reference/synonym or translation of the head word, whereas level two just marks the fact that it is a technical term appearing in the definition. The difference between the two can be compared to hard-print lexicographical indexes which differentiate by boldface and plain-face text.

�E One of the most important facets of this kind of dictionary is the possibility of hyperlinking. The method by which this is handled at the level of the master document is to mark terms which are to be hyperlinked with <linc></linc> tags. When the document is published in HTML format, a macro is run which searches for these tagged terms and replaces them with the appropriate HTTP links provided in a separate character ID properties table. This type of format helps to keep the master document in a more simple and readable format. Also, since it is quite likely that the eventual form of the master document will be that of XML/SGML, this type of tagging leaves those options open. If, indeed, after investigation, an XML format seems to be most efficient for long-term storage and development, these links can be permanently changed so that they link to ID numbers according to SGML rules.

�F Following the <sense> division of the entry, we come to the references section, in which all occurrences of the entry term in the major Buddhist reference works are included. I was a little bit lax about including these during the earlier stages of my work, and so am now going back and filling many of these in. In the above sample, we have references to three works: The Iwanami Bukkyou Jiten ��g: ��T (p. 750) The Taishuukan Zengaku Dai Jiten ��C��: �T�w�厫�T (p. 1164b) and Hajime Nakamura's Bukkyogo Dai Jiten ��: �ŋ��厫�T (p. 1263c).6

�G The final part of the entry indicates the person(s) responsible for its contents, using initials which are identified in the front matter of the work. In the case of this term, it was I who created the entry, and no other persons edited or added to any aspect of it, so no other names are included. There are other entries however, which have as many as three names attached.7

HTML Publication

The dictionary is stored, edited and added to in the above format, sorted by the ID numbers. Conversion of the entire dictionary is done by running a complex macro which changes all tags as necessary to support and optimize HTML display, add hyperlinks and create the indexes. At its present stage, the entire production takes about an hour and a half using VBA macros on a 300MHz system. The underlying code in the above presented sample, after the HTML publication macro, looks like this:

------------------------------

<entry ID="07501672C-061056027"><a name="�{��">�{��</a> [w] pen-hsing [p] ben3xing4 [k] ponsông [j] honshou. "Original nature," or "inherent nature" (prakr.ti). An originally present fundamental quality of something, often equivalent to the concept of "self-nature" (<a href="132.htm#��">��</a> - svabhaava). Buddhism, and especially Mahaayaana, generally rejects the concept of an inherent nature as being a mistaken perception. But on the other hand, in accordance with the general Chinese philosophical perception of the human nature as being originally good, certain texts will allude to the mind's inherent purity or quiescence. For one discussion of original nature, see the Yüan-chüeh ching <a href="031.htm#��S�S">��S�S</a> at T 842.17.913c.[References] iwa750 ZGD1164b naka1263c[Responsible] acm(entry)

-----------------------------------

Which in HTML display, appears as:

-------------------------

�{�� [w] pen-hsing [p] ben3xing4 [k] ponsông [j] honshou. "original nature," or "inherent nature (prakr.ti)." An originally present fundamental quality of something, often equivalent to the concept of "self-nature" (�� - svabhaava). Buddhism, and especially Mahaayaana, generally rejects the concept of an inherent nature as being a mistaken perception. But on the other hand, in accordance with the general Chinese philosophical perception of the human nature as being originally good, certain texts will allude to the mind's inherent purity or quiescence. For one discussion of original nature, see the discussion in the ��S�S at T 842.17.913c. [References] iwa750 ZGD1164b naka1263c [Responsible] acm(entry)

---------------------

The generation of the Classical Chinese dictionary works in a similar process, except that it is first exported from the database and converted into the above tagged form. CJKDict also differs in nature in that it is aimed to deliver much more specific information on characters themselves, and therefore has many more detailed fields containing stroke, radical, variant, coding, readings information, etc.

�@

HTML vs. CGI

A prominent difference between these dictionaries and others of similar type available on the internet is that it relies completely on a text-generated hyperlinked format, as opposed to a CGI-forms input format. The initial reason for this difference was simple: the compilers of those lexicons had sufficient technical knowledge and support to implement a CGI-forms arrangement, while I did not. But now the reason is quite different, as it is now fully within my means to create such a setup if I so desired. I have decided not to, for a few important reasons. First of all, the design of the dictionaries was originally created for the kind of research where one can investigate a broad range of knowledge concerning a topic. Or, when perusing the dictionaries lightly, with no special aim in mind, one can just continue to point and click--to points of new curiosity without end. So with HTTP "travel" when a person comes across the term shown in our example "�{��" he or she can also naturally see on that page, with just little bit of scrolling, all the other compounds available that start with the character �{. When information is delivered by a CGI script, it is usually a fragmentary and disconnected piece of information. The CGI method also tends to move away from the graphical nature of the Han character. It is quite often the case when we are looking up a Han character which we have not seen or written for some time that one may not have a precise recollection of its graphical form or pronunciation, but can quickly recognize it once seen. In this sense, various kinds of graphics and pronunciation-based indexes are much easier to use than the rigid form-entry system. Another related virtue of the HTML approach is precisely its technical simplicity: the researcher with a minimum of computer skills may easily download the entire dictionary to his or her local system and run it as is. Except in the case of a person with considerable computing expertise, this is not possible with CGI-forms.

�@

The Next Stage: XML

Of course, HTML has plenty of its own limitations, which is why the master version of the dictionary is not stored in HTML. HTML presently just does not have enough complexity and flexibility to adequately support a compilation of this sort. SGML, along with the guidelines of the Text Encodement Initiative (TEI), on the other hand, has everything we need for such a lexicon and more. The problem with SGML though, is that it is presently only used by a small group of specialists, and is only supported by a few, little-known forms of browsers which the average humanities researcher cannot possibly have the time, energy, know-how and money to implement on his/her own. It is for this reason that up till now I have not made an effort maintain the dictionaries in well-formed SGML.

It appears that this situation will change in the near future as the WWW is preparing to open up the functionality of the internet considerably through the introduction of XML (Extensible Markup Language). Like HTML, XML is a kind of subset of SGML, but it is a subset which makes much greater use of SGML's power. It does this mainly by allowing the freedom to publishers of making the parameters of their documents self-defining, and by providing functions which permit much more database-like operation. Thus, in using XML we will have much more leeway in designing or our fonts, types of links, embedded programs, indexing functions and in working with large-size documents. Along with this will be included all of HTML's present functions (most importantly, hyperlinking), so nothing will be lost in the process. The increased functionality of XML will also allow for more complete interoperability with popular word-processing software. In fact, it is conceivable that a fully XML-based word-processor could be developed (if those companies should choose to take advantage of the fact). The implementation of XML is well in progress, and it is expected that version 5 of the two major browsers will fully support XML.

If XML implementation becomes widespread as expected, it is quite likely that I will convert the storage format into "well-formed" XML. This means that users will still be able to download the dictionaries to their own systems as fully functional digital lexicons, but will much greater functionality. The adoption of a practical, interoperable and widely used format will also be of great importance for the further development and usage of such dictionaries, since it will encourage the adoption of similar formats by those who are creating related compilations, which will in turn enhance the possibilities for mutual cooperation in large projects. For example, Classical Chinese and Buddhist Classical Chinese dictionaries which are composed in Japanese, Korean, Chinese, or any other language could be easily integrated just by the mere fact of being structured by XML and using a compatible ID system.

Notes

1. See http://www.hm.tyg.jp/~acmuller/mullertext.htm#fiveclassics Return to text

2. At this point I was working with an early Korean word-processing program developed by IBM-Korea called KWP. Return to text

3. Presently located at http://www.hm.tyg.jp/~acmuller/index.html Return to text

4. Published by EAST KK, Tokyo Japan, in August 1996. At the time of this writing still available for order through the internet at http://www.est.co.jp Return to text

5. The reason I do not yet keep the work in fully compliant SGML markup at the moment is simply a lack of need, as many SGML markup structures can be represented more simply in the case where an SGML browser is not in use. If indeed XML support becomes standardized in the major Web browsers in the near future, that will be enough incentive to change the whole work into fully XML (very close to SGML)-compliant. Return to text

6. The references for the Iwanami and Taishuukan dictionaries are taken from the extensive index of Buddhist dictionaries developed by the IRIZ project at http://www.iijnet.or.jp/iriz/irizhtml/irizhome.htm Return to text

7. For the initial establishment of much of the above structure I am indebted to Dr. Christian Wittern of Goettingen University (http://www.gwdg.de/~cwitter/), who found my dictionary on the Web when it was in its earliest and most primitive state. Christian converted the entire work into SGML format and also provided many Pinyin readings. Since that time I have taken some of the tags out of strict SGML format for the sake of simplicity. In the event of return to SGML/XML format, these can be easily reconverted. Return to text