Archive for Parsing

Mandarin tongue twister

Trending on Weibo, a Chinese microblogging website:

[So as not to give anything away, all syllables are separated and not divided into words.]

Nǐ de huò lā lā lā bù lā lā bù lā duō? Huò lā lā lā bù lā lā bù lā duō yào kàn nǐ de huò lā dé duō bù duō. Rú guǒ lā dé bù duō jiù lā nǐ de lā bù lā duō, rú guǒ lā dé duō jiù bù lā nǐ de lā bù lā duō.

"你的货拉拉拉不拉拉不拉多?货拉拉拉不拉拉不拉多要看你的货拉得多不多。如果拉得不多就拉你的拉不拉多,如果拉得多就不拉你的拉不拉多。"

Google Translate:

"Your cargo pulls, pulls, pulls, pulls, pulls, pulls, pulls, pulls, pulls, pulls, pulls, pulls, pulls, pulls, pulls more? If you pull too much, it won’t pull you.

Before turning the page, if you know Mandarin, try to parse and translate the above sentences.

Read the rest of this entry »

Comments (4)

Dependency Grammar v. Constituency Grammar

Edward Stabler, "Three Mathematical Foundations for Syntax", Annual Review of Linguistics 2019:

Three different foundational ideas can be identified in recent syntactic theory: structure from substitution classes, structure from dependencies among heads, and structure as the result of optimizing preferences. As formulated in this review, it is easy to see that these three ideas are completely independent. Each has a different mathematical foundation, each suggests a different natural connection to meaning, and each implies something different about how language acquisition could work. Since they are all well supported by the evidence, these three ideas are found in various mixtures in the prominent syntactic traditions. From this perspective, if syntax springs fundamentally from a single basic human ability, it is an ability that exploits a coincidence of a number of very different things.

The mathematical distinction between constituency (or "phrase-structure") grammars and dependency grammars is an old one. Most people in the trade view the two systems as notational variants, differing in convenience for certain kinds of operations and connections to other modes of analysis, but basically expressing the same things. That's essentially true, as I'll illustrate below in a simple example. But Stabler is also right to observe that the two formalisms focus attention on two different insights about linguistic structure. (I'll leave the third category, "optimizing preferences", for another occasion…)

This distinction has come up in two different ways for me recently. First, ling001 has gotten to the (just two) lectures on syntax, and because of the recent popularity of dependency grammar, I need to explain the difference to students with diverse backgrounds and interests, some of whom find any discussion of syntactic structure opaque. And second, someone recently asked me about whether anyone had used dependency grammar in analyzing music. (The answer seems to be "mostly not" — though see this paper —  but the relevant question really is what the advantages of dependency models in this application might be.)

Read the rest of this entry »

Comments (14)

Are you in the book today?

[This is a guest post by Nathan Hopson, who sent along the two screen shots with which it begins.]

Another splendid example of why punctuation matters and why machine translation is dumb…

Read the rest of this entry »

Comments (18)

Vietnamese without diacritics

From Reddit:

[Click to embiggen]

Read the rest of this entry »

Comments (7)

Words without vowels

Our recent discussions about syllabicity ("Readings" below) made me wonder whether it's possible to have syllables, words, and whole sentences without vowels.  That led me to this example from Nuxalk on Omniglot:

Sample

clhp'xwlhtlhplhhskwts' / xłp̓χʷłtłpłłskʷc̓

IPA transcription

xɬpʼχʷɬtʰɬpʰɬːskʷʰt͡sʼ

Translation

Then he had had in his possession a bunchberry plant.

This is an example of a word with no vowels, something that is quite common in Nuxalk.

Souce: Nater, Hank F. (1984). The Bella Coola Language. Mercury Series; Canadian Ethnology Service (No. 92). Ottawa: National Museums of Canada.

Read the rest of this entry »

Comments (35)

Automatic Pinyin annotation — state of the art

[This is a guest post by Gábor Ugray]

Back in 2018 your post Pinyin for phonetic annotation planted an idea in my head that I’ve been gradually expanding ever since. I am now at a stage where I routinely create annotated Chinese text for myself; this (pdf) is what one such document looks like.

Read the rest of this entry »

Comments (4)

HouseHold GarBage

Dick Margulis saw this in a hospital waiting room in the University of Hong Kong Shenzhen Hospital:

Read the rest of this entry »

Comments (13)

Literary Sinitic / Classical Chinese dependency parsing

We are keenly aware that, while advances in machine translation of Vernacular Sinitic (VS) (Mandarin) are quite impressive and fundamentally serviceable, they cannot be applied directly to the translation of Literary Sinitic / Classical Chinese (LS/CC).  That would be like using an Italian translating program for Latin, a Hindi translation program for Sanskrit, or a Modern Greek translation program for Classical Greek, probably even less useful than these parallel cases, because the whole structure and nature of LS/CC and VS are different from each other.

However, now there is available a LS/CC parsing program that takes us on a major step toward a functional system for the machine translation of the literary / classical written language (it is only a written / book language, not a spoken language).  It was developed by  YASUOKA Koichi 安岡 孝一 of Kyoto University's Institute for Research in Humanities (Jinbun kagaku kenkyūjo 人文科学研究所) and is available here.

Read the rest of this entry »

Comments (5)

The challenging importance of spacing in Korean

Fascinating article from BLARB (Blog // Los Angeles Review of Books:

"Our Language Battle: Korea’s Surprisingly Addictive Game Show of Vocabulary, Expressions, and Proper Spacing", by Colin Marshall (9/1/19)

This is the second paragraph of the article:

Having found myself living in the genuinely foreign country of Korea, I’ve lately also found myself watching Our Language Battle (우리말 겨루기), a game show that has aired every Monday evening on KBS since 2003. Though it occasionally invites celebrities, and this past July even brought on members of the National Assembly, it usually pits four everyday Koreans (or four teams of two, usually family) against each other in a test of their knowledge of the Korean language. It begins simply enough, with the contestants buzzing in to guess the words or phrases that fill in a crossword-style board, but soon the challenges get dramatically harder: separating folk spellings and regional variations from the officially standard, filling in words missing from old television and newspaper clips, and — most difficult of all, even for contestants who otherwise dominate the game — properly re-spacing a text whose words all run together.

Read the rest of this entry »

Comments (58)

The importance of proper parsing and punctuation

Currently circulating on Facebook and on Chinese social media are seemingly impenetrable sentences with the same character repeated numerous times.  When you first look at them, your eyes glaze over and you can't make any sense of them.  But if you slow down and think about such sentences, you usually can figure them out without too much effort.  In fact,  I could read some of the following right off upon first encounter.  Others required more effort before I was able to crack them.

Although it looks formidable, of the six sample sentences treated in this post, this one was easiest for me.  I could understand it at one go.  [N.B.:  In my treatment of these sentences, I first give the Pinyin with spaces between each syllable, then repeat the Pinyin with requisite parsing and punctuation.]

1.

míng míng míng míng míng bái bái bái xǐ huān tā dàn tā jiù shì bù shuō

明明明明明白白白喜欢他但他就是不说

Míngmíng míngmíng míngbái Báibái xǐhuān tā, dàn tā jiùshì bù shuō.

"Mingming clearly knew that Baibai liked her, but he just wouldn't say it."

Read the rest of this entry »

Comments (17)

"and himself jail"

In "More Cohen Businesses Coming to Light," on Talking Points Memo, Josh Marshall writes:

The biggest taxi operator in New York, Evgeny “Gene” Friedman, now manages Cohen’s 30+ NYC medallions or at least did the last time we spoke to him. Friedman has been struggling for the last year to keep his taxi businesses out of bankruptcy and himself jail.

The final three words of the boldfaced clause present a weird, and dare I say unusual, case of double ellipsis. The semantic content communicated by those three words (in the context of the sentence) is richer than you'd think could be expressed by only three words, especially given that one of them is merely the conjunction and. That content can be represented as follows, with the struck-through text standing for the content that the reader must infer:

Friedman has been struggling for the last year to keep his taxi businesses out of bankruptcy and to keep himself out of jail.

There's nothing unusual about the first omission; I don't see anything wrong with the clause to keep his taxi businesses out of bankruptcy and himself out of jail. But the omission of out of strikes me as very strange, and what's even stranger is that to my ear, the clause is worse if to keep is put back:

* Friedman has been struggling for the last year to keep his taxi businesses out of bankruptcy and to keep himself jail.

Read the rest of this entry »

Comments (31)

Pinyin in 1961 propaganda poster art

From Geoff Dawson:

On display in a current exhibition at the National Library of Australia.

Read the rest of this entry »

Comments (9)

A polysyllabic character that can be read in two different ways

Photo taken in Hangzhou by Nikita Kuzmin's Chinese teacher:

Read the rest of this entry »

Comments (5)