Digital Editor's Preface to the Web Publication of the Descriptive Catalogue

Charles Muller

October 20, 2004

Table of Contents

1. The Obstacles to Digitization
2. Lewis Lancaster
3. Makeup of the Descriptive Catalogue
4. The Digitization Process
5. The Present and Future Status of the Project
6. Acknowledgments

1. The Obstacles to Digitization

At the time of the completion of this project of the digitization and web release of the Descriptive Catalogue (October 2004), the Internet in its status as the emerging universal library for cultural and literary information stands at a somewhat ambiguous juncture. On one hand, it is clear that a decade after the birth of the scholarly resource as web site, both the volume of reliable information available on the Web and the sophistication of the techniques for their presentation are continuing to grow at a rapid pace. Hence the task of trying to put together a complete and well-organized guide of sites offering serious and reliable information, even limited to the field of Buddhist Studies, would already be daunting. Resources already on line are being well-used by academic researchers, their students, and people from all sorts of backgrounds and occupations. For many types of research projects, it is perfectly appropriate for a student to begin the task with a search on the web for keywords.

Yet on the other hand, at least from the perspective of those who are more personally involved in the development of professional-quality web resources, the pace at which they are making their way into cyberspace can at times seem excruciatingly slow. Although most versions of the Buddhist canons have now made it on to the Web, along with a smattering of reference works, the bulk of the dictionaries, encyclopedias and so forth that Buddhologists need for their everyday work remain lodged in book form, with most of them having little prospect for digitization and ready accessibility on the web in the near future.

The main obstacle here lies in problems associated with intellectual property rights. Most dictionaries, indexes, concordances, maps and so forth are owned by companies, institutions, or individuals who have invested time and energy into their production, and according the economic model that has become part and parcel of modern scholarly research, need to continue to see a profit for their distribution. Thus, there are relatively few reference works that may legally be digitized and freely published, and those that are free for these purposes are usually dated materials.

Even in cases where IP issues regarding a particular text have been resolved, the labor involved the proper digital preparation of a significant reference work is considerable in scope. The scanning and OCRing of source materials must be done with care, and subsequent editing must be done by skilled persons who understand the content of the materials with which they are working. And since any digitization task that stops at the stage of the mere rendering of digital documents into a simple plain text format risks the forfeiture of most of its value, the vitally important task of markup must be carried out—thoroughly and accurately. This work can only be done by persons who not only understand the content of the subject text, but who have also gained a sufficient measure of aptitude in a computer markup language such as XML.

2. Lewis Lancaster

Leaving aside for the moment his work in compiling The Korean Buddhist Canon: A Descriptive Catalogue, Lewis Lancaster will no doubt go down in history as the leading pioneer in our time in terms of his energy and foresight in encouraging, organizing, and networking a wide range of the first generation of digitized Buddhist Studies textual and cultural resources. As the key organizer of the Electronic Buddhist Text Initiative (EBTI) in the early 1990's, Prof. Lancaster helped to create a forum wherein various digitization projects (initially, mostly projects focused on the digitization of the Buddhist canon) could share ideas, technology, funding, and energy. Through the auspices of bleeding-edge techno-academic gatherings like EBTI, and later, the Pacific Neighborhood Consortium (PNC) and then the Electronic Cultural Atlas Initiative (ECAI), Lancaster's stimulating energies moved on from digitization of the basic set of canonical texts, to the development of a wide range of textual, graphic, multimedia, GIS-based, and other resources that spill far out beyond the confines of Buddhist studies, to the cultures of every region.

Thus, there is probably not a single person who could have understood the value of the digitization and Internet presentation of the Descriptive Catalogue better than its own author. As it turns out (and again, much is due here to Lancaster's own influence) the Korean Canon itself (under the title Tripiṭaka Koreana) was one of the first Buddhist canons to be fully digitized, and the efforts made in this pioneering task also contributed much to the digitization of the other East Asian canon which is in great part based on the Tripiṭaka Koreana — the Japanese Taishō.1 As Lancaster explains in the introduction to the Descriptive Catalogue, the texts compiled in the Tripiṭaka Koreana woodblocks hold a pivotal role in the development of the East Asian canon as we know it today.

3. Makeup of the Descriptive Catalogue

As I mentioned in my introduction to the digitization project we completed last year, that of the Soothill-Hodous Dictionary of Chinese Buddhist Terms, the task of digitization forces the editor-in-charge to read every line of the subject work with a kind of attention not required by the average reader. One is also, in constructing the markup scheme, forced to fully apprehend the original author's organizing principles. Essentially, one has to follow the traces of the author's development of the project. This sort of scrutiny will expose possible missteps and errors, but will also, in the case of a careful and exhaustive work such as the Descriptive Catalogue, lead one to a great appreciation for their degree of mastery over the content, and the great attention to detail that goes into the production of a well-done reference work. Having completed the task of digitizing the Descriptive Catalogue, I came away with a deep sense of respect for Lancaster, Sung Bae Park, and all the other collaborators in this projects for their painstaking efforts.

Providing detailed descriptions of the provenance of some 1600 canonical works, including authors, translators, compilers, dating, geographical sites, monasteries, equivalent and related titles in Sanskrit, Pali, and Tibetan, the Descriptive Catalogue has been for over two decades an indispensable source for those requiring information on the texts contained not only in the Korean Canon itself, but the larger East Asian canon. The publication of the Descriptive Catalogue had the further effect of being one of the earliest works in the West to seriously acknowledge the existence of the Korean Buddhist tradition at all, much less its vitally important canon.

My own interest in digitizing the Descriptive Catalogue came in connection with the continued development of my own Buddhist studies reference work—one that has been in digital format from the outset and has been published only on the Web—the Digital Dictionary of Buddhism (DDB). While seeking to provide as broad a coverage as possible of information on Buddhist terms, personages, texts, and so forth, the de facto focus up to this point, based on the character of my own research inclinations towards translation and exegesis, has been on technical terms. The assimilation of the data from the above-mentioned Dictionary of Chinese Buddhist Terms did much to round out our already sizeable collection of East Asian terms, additionally allowing us to offer the guarantee of the full coverage of a well-known, albeit dated, reference work.

Despite the fact that in the process of our work on the DDB to date, we had already succeeded in providing a significant amount of coverage of canonical East Asian Buddhist texts (several times more than than any comparable English-language dictionary), it has always been my belief that the ideal aim of a reference work of this type should be, wherever possible, to guarantee complete coverage of certain areas, such that the user knows with some certainty before consulting the reference work whether or not the information sought should be there. With this ideal in mind, as the DDB grew in scope, it seemed that it would be a great enhancement to the project if we could eventually provide complete coverage of the texts contained in the East Asian canon, the core Chinese part of which can be said to be well covered in the Korean Canon. For this purpose, the Descriptive Catalogue would be an ideal initial resource, since, although it does not treat the thematic content of each work, it does provide the basic facts of the provenance, translation, and references related to the East Asian versions of all the most important texts. With this information as a starting point, we can, over time, develop the thematic information. However, the task of assimilating this information to the DDB in any sort of systematic and efficient manner could never be carried out until the Descriptive Catalogue was made available in digital format, including some sort of markup scheme that could be reasonably well mapped to that in use in the DDB.

The well-conceived organization of the data, along with the wealth of indexed information, make it a perfect text for digitization and Web publication. It was, from a 21st century perspective, a textual database, waiting to be put into its most useful format. Prof. Lancaster was no doubt the first person to realize that, and was already seeking to get it into digital form in the early EBTI days of the 90's, and thus he and I discussed the possibilities on a number of occasions. From 1996 to 2001, I was lucky enough to be able to get a string of JSPS grants,2 which enabled me to digitize a wide range of East Asian Buddhism-related indexes and reference materials. In addition to the digitization of his own Descriptive Catalogue (for which he now held the copyright), Lew felt strongly that the public-domained Soothill dictionary should also be digitized. Knowing well of my prior digitization work, he encouraged me to go for another grant and digitize both. Along with the obvious fact that the Internet publication of the Descriptive Catalogue and the Soothill dictionary as stand-alone documents would be extremely useful for researchers in our field, it was with the above-outlined intent to use its data to develop the DDB that brought me to the point where I decided to digitize them. Thus, I applied for and obtained one more grant from the JSPS to digitize, mark up, and publish the Descriptive Catalogue, along with the Soothill-Hodous dictionary.

4. The Digitization Process

Scanning, OCR, Editing

Since a text-digitized version of the Catalogue was not available, we had to start from scratch in scanning the pages which we did using the Japanese OCR program OK Reader.3 Being used with mixed roman and CJK scripts, this program produced text with a fair recognition rate of the Chinese characters (approx. 90+ %), and a so-so recognition rate (80-90%) of the latin characters, the difficulties with the latter result compounded by the large number of words containing diacritical marks. Pages 1-89 were scanned, OCRed, and first-time edited by Megumi Katahira (an undergraduate student assistant at Tōyō Gakuen University). Pages 90-499 were scanned, OCRed and first-time edited by Maki Miyaji (my wife). Maki also did the full scanning, OCRing and editing of the front matter and bibliography, all in all, an exhausting job.

After I ran a macro to add in computer-generated Hangul readings for the Chinese titles, the second-stage proofing on pages 1-200 was done by Tom Dreitlein (a Shingon monk based at the time in Koyasan), while pages 301-500 were handled by Dane Alston and his wife Heajin Park-Alston (graduate students living and studying in Tokyo). At this stage the entire text was rechecked for errors, and all diacritics were added for Sanskrit, Tibetan, Wade-Giles, and McCune-Reischauer romanization. Again, a painstaking job. Dane and Heajin, fluent in Korean, also proofed the macro-generated Hangul for their pages, while I did this for pages 1-200.


Until a reference work of this sort has had thorough content markup applied to it, it probably realizes less than ten percent of its potential value. It is the markup that allows for meaningful searches, generation of indexes, precise and uniform web presentation via style sheets, and efficient usage of the data in interaction with other resources. There is, of course, nothing that matches XML in terms of usefulness for this purpose, and so from this point, I gradually worked through the text, first applying general structure through macros and search/replaces, then proceeding on with a lot of painstaking manual input. As with our other recent projects, we used the guidelines established by the Text Encoding Initiative (TEI) to establish the basic tag set, adding just three additional tags to the teixlite DTD to achieve satisfactory results. Of course, the process of XML markup provides another opportunity for the editing of textual errors both in terms of the visual process of looking at the texts, as well as through the validation errors that are turned up in the process of finishing off an XML document. Then, with a few minor tweaks to some XSLT sheets that were developed for prior purposes, we were able to generate these web pages in fairly short order. About 80% of the XML tagging and validation work was done with TEI-Emacs, with the rest being handled by the <oXygen/> XML editor, both running on a Debian Linux desktop. After full markup, an entry in the Catalogue looked something like this at the XML level, which in turn transforms into HTML like this.4

5. The Present and Future Status of the Project

Among the many characteristics of digital documents that makes them different from paper, one thing always stands out: they are never finished. Or, more precisely, they don't have to be finished if one does not want them to be. As Prof. Lancaster is quick to acknowledge, there were many typos to be seen in the Descriptive Catalogue the day it came into print. We have caught and corrected many of these already, but this number may yet not surpass that of the new errors that may have been introduced during the digitization process. For sure, then, during the upcoming months, we will be correcting these as we find them ourselves and receive reports from users, and therefore will want to republish the document from time to time.

Beyond the matter of simple typos, however, will also be the question of amendments and annotations to be supplied based on research that has occurred in the twenty-five years since the Descriptive Catalogue was first published. These will no doubt be significant, but whether or not they will be applied to our source text and whether it will again be periodically republished remains to be seen. Someone also might be inclined to add additional fields to the various readings data, such as a Wylie field for Tibetan, or a Pinyin field for Chinese, or a field for the new Korean romanization system being promulgated by South Korea's Ministry of Culture.

6. Acknowledgments

In addition to the above-named assistants who worked diligently in bringing this project to completion, I would like to thank a number of other people without whose help its realization would have been impossible. First, I would like to mention my relationship with the original compilers of the Descriptive Catalogue. Among the causes and conditions that led to this project, Sung Bae Park is instrumental. He was Lancaster's graduate student at Berkeley during the 70's and foremost collaborator on the Descriptive Catalogue. He would end up going on to become professor of Buddhist Studies and Korean Studies at SUNY Stony Brook, where I would later on have the good fortune to encounter him and enjoy the privilege of having him as my own mentor—which makes me Prof. Lancaster's "grand-student," as it were. When I first met Lew in 1995 at an AAR meeting, we talked about digital resources immediately, and have been close friends ever since. I would like thank Lew for bringing me into the fold of EBTI, and for supporting and advising my projects since that time.

I would also like to give special thanks to my longtime XML-TEI mentors, Michael Beddow and Christian Wittern. Fortunately, I did not need to bother either one of them extensively to help me through this particular project (although I did have to ask Michael for a bit of help with the XSLT at the very end), but their guidance through my earlier attempts at XML/XSL and computing in general had much to do with my ability to carry out this present project. In general, the core TEI members who generously offer their steady help on the TEI list, including Syd Baumann, Lou Burnard, Sebastian Rahtz, and Wendell Piez, have also been extremely helpful. I would also acknowledge the extensive help I received from Tetsuya Katsumura ("TK") in securing numerous JSPS grants, and in general, learning what a foreigner should do to succeed in the Japanese academic world.

Most important of all, however, has been the untiring support of my wife Maki, who has not only contributed greatly in the physical work involved in the accomplishment of this and prior projects, but who has been patient with and supportive for many years during my long hours in front of the computer screen. I could have done none of this without her help.


1. The Japanese Taishō is now available to Buddhist scholars in versions edited by both the SAT Project and CBETA.

2. My ability to continue to secure successive grants during this period owes much to the support and discipline I received from Professor Katsumura Tetsuya of the Jinbunken, who expended much of his valuable time with me in the writing and rewriting (and rewriting, rewriting, rewriting) of these grant applications so that they would accord with the demands of the JSPS referees. Since Prof. Katsumura has now passed away, I will never have the chance to thank him adequately for the incredible assistance he gave to me at a critical juncture in my professional career.

3. Unfortunately, this fine OCR product seems to have reach EOL status, as neither upgrades nor significant new information have appeared since 2000.

4. For tutorials and sample packages for experimenting with TEI, please see the TEI web site, and especially the very helpful "A Very Gentle Introduction to the TEI," as well as my packages of tools and samples for writing articles and bibliographies in XML.

Copyright © Charles Muller— 2009