Archive for Information technology

Literary Sinitic / Classical Chinese dependency parsing

We are keenly aware that, while advances in machine translation of Vernacular Sinitic (VS) (Mandarin) are quite impressive and fundamentally serviceable, they cannot be applied directly to the translation of Literary Sinitic / Classical Chinese (LS/CC).  That would be like using an Italian translating program for Latin, a Hindi translation program for Sanskrit, or a Modern Greek translation program for Classical Greek, probably even less useful than these parallel cases, because the whole structure and nature of LS/CC and VS are different from each other.

However, now there is available a LS/CC parsing program that takes us on a major step toward a functional system for the machine translation of the literary / classical written language (it is only a written / book language, not a spoken language).  It was developed by  YASUOKA Koichi 安岡 孝一 of Kyoto University's Institute for Research in Humanities (Jinbun kagaku kenkyūjo 人文科学研究所) and is available here.

Read the rest of this entry »

Comments (5)

Automated transcription-cum-translation

Marc Sarrel received the following message on his voicemail:

Read the rest of this entry »

Comments (7)

A virus that fixes your grammar

In today's Dilbert strip, Dilbert is confused by why the company mission statement looks so different, and Alice diagnoses what's happened: the Elbonian virus that has been corrupting the company's computer systems has fixed all the grammar and punctuation errors it formerly contained.

That'll be the day. Right now, computational linguists with an unlimited budget (and unlimited help from Elbonian programmers) would be unable to develop a trustworthy program that could proactively fix grammar and punctuation errors in written English prose. We simply don't know enough. The "grammar checking" programs built into word processors like Microsoft Word are dire, even risible, catching only a limited list of shibboleths and being wrong about many of them. Flagging split infinitives, passives, and random colloquialisms as if they were all errors is not much help to you, especially when many sequences are flagged falsely. Following all of Word's suggestions for changes would creat gibberish. Free-standing tools like Grammarly are similarly hopeless. They merely read and note possible "errors", leaving you to make corrections. They couldn't possibly be modified into programs that would proactively correct your prose. Take the editing error in this passage, which Rodney Huddleston recently noticed in a quality newspaper, The Australian:

There has been no glimmer of light from the Palestinian Authority since the Oslo Accords were signed, just the usual intransigence that even the wider Arab world may be tiring of. Yet the West, the EU, nor the UN, have never made the PA pay a price for its intransigence.

Read the rest of this entry »

Comments off

Is there a practical limit to how much can fit in Unicode?

A lengthy, important article by Michael Erard recently appeared in the New York Times Magazine:

"How the Appetite for Emojis Complicates the Effort to Standardize the World's Alphabets:  Do the volunteers behind Unicode, whose mission is to bring all human languages into the digital sphere, have enough bandwidth to deal with emojis too?" (10/18/17)

The article brought back many vivid memories.  It reminded me of my old friend, Joe Becker, who was the seminal designer of the phenomenal Xerox Star's multilingual capabilities in the mid-80s and instrumental in the organization and foundation of the Unicode Consortium in the late 80s and early 90s.  Indeed, it was Becker who coined the word "Unicode" to designate the project.

Read the rest of this entry »

Comments (34)

Easy versus exact

Ever since people started inputting Chinese characters in computers, I've had an intense interest in how they do it, which systems are more efficient, and why they choose the particular ones they adopt.  For the first few decades, because all inputting systems presented significant obstacles and challenges, I remained pretty much of an onlooker because I didn't want to waste my time struggling with cumbersome methods.  It's only after I discovered how simple and fast it is to use Google Translate as my chief inputting method that I became very active in entering Chinese character texts.

Read the rest of this entry »

Comments (31)

Awesome / sugoi すごい!

Comments (7)

Information content of text in English and Chinese

Terms and concepts related to "letters" and "characters" were used at spectacularly crossed purposes in many of the comments on Victor Mair's recent post "Twitter length restrictions in English, Chinese, Japanese, and Korean". I'm not going to intervene in the tangled substance of that discussion, except to reference some long-ago LLOG posts on the relative information content of different languages/writing systems. The point of those posts was to abstract away from the varied, complex, and (here) irrelevant details of character sets, orthographic conventions, and digital encoding systems, and to look instead at the size ratios of parallel (translated) texts in compressed form. The idea is that compression schemes try precisely to get rid of those irrelevant details, leaving a better estimate of the actual information content.

My conclusions from those exercises are two:

  1. The differences among languages in information-theoretic efficiency appear to be quite small.
  2. The direction of the differences is unclear — it depends on the texts chosen, the direction of translation, and the method of compression used.

See "One world, how many bytes?", 8/5/2005; "Comparing communication efficiency across languages", 4/4/2008; "Mailbag: comparative communication efficiency", 4/5/2008; "Is English more efficient than Chinese after all?", 4/28/2008.

 

Comments (7)

Veggies for cats and dogs

This video was passed on by Tim Leonard, who remarks, "real-time video translation at its best":

Read the rest of this entry »

Comments (8)

More Sinological suffering

[This is a guest post by Brendan O'Kane. See "Sinological suffering", 3/31/17, for background.]


I snapped this picture at the library today:

Read the rest of this entry »

Comments (28)

Siri and flatulence

An acquaintance of mine has a new iPhone, which he carries in a pocket that is (relevantly) below waist level. He has discovered something that dramatically illustrates the difference between (i) responding to speech and (ii) responding to speech as humans do, on the basis of knowing that it is speech.

Read the rest of this entry »

Comments off

The miracle of reading and writing Chinese characters

We have the testimony of a colleague whose ability to write Chinese characters has been adversely affected by her not being able to visualize them in her mind's eye.  See:

"Aphantasia — absence of the mind's eye" (3/24/17)

This prompts me to ponder:  just how do people who are literate in Chinese characters recall them?

Read the rest of this entry »

Comments (26)

Pick a word, any word

To access an article in the Financial Times yesterday I found myself confronted with a short market-research survey about laptops, tablets, and smartphones. Answer three our four layers of click-the-box questions, and I could get free access to the article I wanted to look at. A reasonable bargain: clearly some company was prepared to pay the FT for access to its online readers' opinions. And at the fourth layer down I faced a question which asked me to choose a single word that comes into my mind when I think of a certain Microsoft product.

My choice, from all the tens of thousands of words at my disposal, and the word I picked would go straight into the market research department of the one corporation, above all others, for whose products I have the greatest degree of contempt. Just choose that one evocative word and type it in, and I would be through to my article. A free choice. Which word to pick?

Read the rest of this entry »

Comments off

Kazakhstan HQ for the Buffett Foundation

I received an exciting email this afternoon from Perry Alexis, the chief accountant for the Warren Buffett Foundation. It seems I have been picked to receive a $1,500,000 donation — not a grant for research or anything, but a donation. And I notice it came from an email address in Kazakhstan.

Read the rest of this entry »

Comments off