Digitizing specialized language dictionaries

« previous post | next post »

[The following is a guest post by David Dettmann.  The "Schwarz Uyghur dictionary" to which he refers in the third paragraph is this:  Henry G. Schwarz, An Uyghur-English dictionary (Bellingham, Washington:  Center for East Asian Studies, Western Washington University, 1992).]


It is a bit of a nerdy obsession of mine to customize my computers to comfortably use languages that I've studied.

About 10 years ago, I got relatively proficient with using optical character recognition (OCR) software and scanner hardware. Any time I found an essential dictionary for the languages I studied, I converted them to unicode OCR scans in pdf format (i.e., converting images of pages to text). I later used that data to create dictionary content files that would work together with the Mac OS dictionary application. I did this process with several dictionaries that I found essential while I studied Kazakh, Uzbek, and Uyghur.

This process was particularly useful for me to use the Schwarz Uyghur dictionary. I could not get used to the alphabetical order that he favored (which was different from typical Latin order AND Uyghur Arabic script order). As a result, any lookup would just take forever. That said, the formatting of each page was quite pleasant, and there were some nice illustrations of plants of traditional Uyghur medicine as well as handy keys at the bottom of each page to explain abbreviations.

To maintain that page structure and bring the Schwarz dictionary content to the Mac OS dictionary app, I took a long weekend to edit a complete list of headwords from each paper page, paired simply with the page number as the definition. Then I created indexes by converting Schwarz’s alphabet (i.e., changing the letters he favored such as ş, ç, ä, ñ) to a Latin script that was easily entered on a US keyboard (sh, ch, ae, ng) AS WELL AS a conversion to the unicode Uyghur Arabic script (ش،چ،ە،ڭ). Finally, in the code for the Mac OS dictionary I changed the page number definition to instead point to an image of each respective page.

This may be best explained with some screenshots. See below for how the dictionary responds to a lookup of an example word in Arabic script and Latin script:


I also typically use that Schwarz dictionary for reverse lookups. I have a copy that I keep as a pdf that is an OCR of just the English definitions. That way I can quickly locate words in Uyghur that I would not be able to find by thumbing through the book. Here is an example search using the word “pepper”:

I find this backwards lookup much more useful than dictionaries meant as "English-Uyghur" (which are perhaps most useful to Uyghur learners of English). Typically, those books don't include words relating to Uyghur culture (for example like "long pepper" above, or pilpil in Uyghur).

You can see how crazy I got with my dictionaries in the image below of the list in my Mac dictionary app. The Kazakh, Mongolian, Uzbek, Ottoman Turkish and Uyghur dictionaries are ones that I processed to work natively in the app.

The Schwarz dictionary's full page images was kind of an exception to my usual method, which was to work with compiled tables of headwords and definitions.


P.S. by VHM:  The hefty (1,105 pages; 10" X 7.5" X 2.75"), green-covered Schwarz Uyghur-English dictionary is one of my prized possessions.  I treasure its lovingly compiled contents, including the abundance of little drawings of plants, animals, and all sorts of things, plus tiny but still useful maps, indication of Persian, Arabic origin, calligraphic headings, etc.  Like David, I find the arrangement (order of entries) in the Schwarz dictionary to be extremely frustrating, but I usually am willing to spend lots of time leafing through its pages to find what I'm after.  Unfortunately, I am severely computer challenged, so I would never in a million years be able to do what David can accomplish with his scanner, OCR manipulation, and apps in a matter of weekends.  All that I can do is look on in awe, and occasionally humbly ask him to find something for me.

Here's a brief description of how Schwarz made his dictionary:

An Uyghur-English Dictionaryby Henry G. Schwarz: As the initial foundation of my dictionary I chose the Uyurçä-xänzuçä luät which, in its manuscript form and after 1982 in published form, had been my steady companion in Xinjiang. If I ever considered doing nothing more than translating it into English, I rejected this option almost from the start. For one thing, I often found the dictionary to be inaccurate in many respects, both large and small. More importantly, a direct translation from Uyghur to English reduces the number and the degree of infelicities that smudge the interface between two languages. My course of action thus lay before me: after an initial translation of the Uyurçä-xänzuçä luät into English, a task completed by the time my Tokyo apartment had turned pleasantly warm in the spring of 1985, I went to work to hunt down every Uyghur word, phrase, sentence, and saying in a large number of Uyghur sources published between 1954 and 1985. If I could not find it, I dropped it. If I could not verify a usage contained in the Chinese dictionary, I dropped it. On the other hand, if I could verify a usage but not in identical language, I kept the language used in the Chinese dictionary. If I discovered a new word, phrase, or saying, I added it to my collection. This stage of my labors took most of my spare time between March 1985 and the summer of 1991.

(Source)

BTW, David Dettmann is not only a consummate, nerdy lexicographer, he is also an outstanding foodie.  For impressive examples of his findings in that field, see "Asian Markets of Philadelphia".  I'm sure that one of the reasons he was so fascinated by the Schwarz dictionary is its inclusion of abundant, detailed materials concerning foodstuffs and foodways.



5 Comments

  1. Philip Taylor said,

    April 30, 2019 @ 11:13 am

    Oddly enough I was attempting to perform OCR on typeset text just today, but in my case the language was Hanyu Pinyin, and whilst the integrated OCR engine of Adobe Acrobat CC was perfectly able to handle hanzi (both simplified and traditional) I was unable to get it to handle tonal Pinyin with any success at all. Can anyone recommend a good OCR tool for tonal Pinyin, ideally one that will run in a Windows 7 64-bit environment ?

  2. George Lane said,

    April 30, 2019 @ 11:37 am

    I just find it comforting to know I'm not the only person in the world with a "nerdy obsession…to customize my computers to comfortably use languages that I've studied" (Currently running Ubuntu 18.04, configured for English, Japanese, Spanish, Korean, and Greek).

  3. Antonio L Banderas said,

    April 30, 2019 @ 12:22 pm

    @Philip Taylor
    That's a long-lasting issue
    https://forum.ocrsdk.com/thread/2420-finereader-pinyin-diacritics/

  4. David Dettmann said,

    April 30, 2019 @ 3:14 pm

    @Philip Taylor: about 10 years ago when I did OCR processing for most of my dictionaries, I found ABBYY's software most useful for custom characters in Latin and Cyrillic scripts. There used to be a function (hopefully this is still possible with more recent editions of the software) to teach the software new characters by finding the anomalies that the OCR wasn't recognizing and associating those with Unicode character counterparts. I know of no such tool for Acrobat's OCR. I was able to use that method in ABBYY to do OCRs for Uzbek and Kazakh texts and dictionaries using an otherwise Russian-oriented OCR.

  5. Philip Taylor said,

    May 5, 2019 @ 4:24 pm

    Thank you Antonio & David for your comments and recommendations.

RSS feed for comments on this post