Language Log

Vignettes of quality data impoverishment in the world of PRC AI

February 23, 2023 @ 8:00 am · Filed by Victor Mair under Artificial intelligence, Computational linguistics, Data bases

Some snippets:

Limited data sets a hurdle as China plays catch-up to ChatGPT

Lack of high-quality Chinese texts on Internet a barrier to training AI models.

Ryan McMorrow, Nian Liu, Eleanor Olcott, and Madhumita Murgia, FT, Ars Technica (2/21/23)

…

Baidu struggled with its previous attempt at a chatbot, known as Plato, which analysts said could not even answer a simple question such as: “When is Alibaba co-founder Jack Ma’s birthday?”

Analysts point to the lack of high-quality Chinese-language text on the Internet and in other data sets as a barrier for training AI software.

GPT, the program underlying ChatGPT, sucked in hundreds of thousands of English academic papers, news articles, books, and social media posts to learn the patterns that form language. Meanwhile, Baidu’s Ernie has been trained primarily on Chinese-language data as well as English-language data from Wikipedia and Reddit.

…

Read the rest of this entry »

Permalink Comments (11)

Spaceless pinyin

February 22, 2023 @ 8:47 pm · Filed by Victor Mair under Alphabets, Punctuation, Writing

From the importer's label, carefully placed to obscure the safety instructions (the "do"s and "do not"s) of an electronic gas igniter:

Read the rest of this entry »

Permalink Comments (9)

Multi-modal writing among Hong Kong teens

February 22, 2023 @ 8:44 pm · Filed by Victor Mair under Writing, Writing systems

From Jenny Chu:

Knowing your interest in multi-modal writing systems, I thought you might be amused by the attached screencap. It is from a WhatsApp group chat of S6 (final year) students in Hong Kong; one of them is asking the others what they would like to do on the afternoon of their last day of classes:

Read the rest of this entry »

Permalink Comments (4)

A crook that protects your belongings

February 22, 2023 @ 8:30 pm · Filed by Victor Mair under Lost in translation

Read the rest of this entry »

Permalink Comments (4)

Do not major in the changing room

February 22, 2023 @ 7:56 pm · Filed by Victor Mair under Lost in translation

Read the rest of this entry »

Permalink Comments off

Uh-oh! DeepL in the classroom; it's already here

February 22, 2023 @ 1:22 pm · Filed by Victor Mair under Artificial intelligence, Computational linguistics

Yesterday in my Classical Chinese class, we were reading Ouyang Xiu's (1007-1072) "Discussion on 'Biographies of Eunuchs'" in the New History of the Five Dynasties (written 1036-1039, published 1072). Here's the relevant passage:

Móu zhī ér bùkě wéi. Wéi zhī ér bùkě chéng. Zhì qí shén zé jù shāng ér liǎng bài. ——“Xīn wǔdài shǐ huàn zhě chuán lùn”

謀之而不可為。為之而不可成。至其甚則俱傷而兩敗。 ——《新五代史宦者傳論》

[Because of the special circumstances of this post, I will not adhere to my usual custom of providing Pinyin Romanization, Hanzi transcription, and English translation all three together.]

Read the rest of this entry »

Permalink Comments (8)

Baozi: The stuffed, steamed bun becomes a meme

February 21, 2023 @ 6:15 pm · Filed by Victor Mair under Emojis and emoticons, Memes

So everybody knows what we're talking about:

Baozi (Chinese: 包子), Pao-tsih or bao, is a type of yeast-leavened filled bun in various Chinese cuisines. There are many variations in fillings (meat or vegetarian) and preparations, though the buns are most often steamed. They are a variation of mantou from Northern China.

(source)

Early on in his presidency, Xi Jinping picked this up as one of his nicknames, like Winnie the Pooh, both from his puffy shape. Both fall under the category of "rǔ bāo 辱包" ("disgracing the dumpling").

Read the rest of this entry »

Permalink Comments (2)

ChatGPT: Theme and Variations

February 21, 2023 @ 5:29 am · Filed by Victor Mair under Artificial intelligence, Computational linguistics

[This is a guest post by Conal Boyce]

Here I’ll recount some recent exchanges I had with ChatGPT. Given the scope of ChatGPT, and the fact that it’s in a self‑described intermediate state, our various impressions of it as of February 2023 must be like those of the three blind men examining an elephant — except the elephant is running. In the heart of the professional programmer, ChatGPT creates existential dread since it can spit out in a few seconds a page of code which would have required hours or days for him/her to write and debug — and that only after a lifetime of coding. For the rest of us, for the moment at least, it just provokes curiosity perhaps.

Read the rest of this entry »

Permalink Comments (21)

How to use "Six Skins" in a slogan to solicit business in the PRC

February 20, 2023 @ 10:11 am · Filed by Victor Mair under Language and business, Slogans

From the Twitter account of the famous popular science writer and muckraker, Fang Zhouzi / Fang Shimin:

先把外资都赶跑、吓跑了，再死皮赖脸地请回来？ pic.twitter.com/FepFOsJnpY

— 方舟子 (@fangshimin) February 20, 2023

Read the rest of this entry »

Permalink Comments (1)

Closestools, crappers, and horse buckets

February 19, 2023 @ 11:11 pm · Filed by Victor Mair under Lexicon and lexicography

Big news from China yesterday:

"2,200-year-old flush toilet — oldest ever found — unearthed at palace ruins in China"

Aspen Pflughoeft, Miami Herald / Yahoo
Thu, February 16, 2023 at 5:37 PM EST

What a gift to humanity!

All the terms in the title of this post mean one or another kind of toilet, but function differently and date from different times and places. We've talked about many types of toilets on Language Log before (for a few see "Selected readings" below). Here I want to focus on two Chinese models, one dating to two millennia ago that was recently discovered archeologically, so we don't have a proper name for it yet, and an archaic-sounding one, mǎtǒng 馬桶 ("horse bucket"), that is the current, conventional, common term for the toilet across China.

Read the rest of this entry »

Permalink Comments (1)

Ivan Enraged

February 19, 2023 @ 11:24 am · Filed by Victor Mair under Etymology, Names, Titles

A Russian friend of mine told me that "Terrible" is a common, well nigh universal, mistranslation for the nickname of Ivan IV Vasilyevich (Russian: Иван Васильевич; 25 August 1530 – 28 March [O.S. 18 March] 1584). He says that a closer translation would be "Enraged".

The English word terrible is usually used to translate the Russian word Грозный (grozny) in Ivan's nickname, but this is a somewhat archaic translation. The Russian word Грозный reflects the older English usage of terrible as in "inspiring fear or terror; dangerous; powerful" (i.e., similar to modern English terrifying). It does not convey the more modern connotations of English terrible such as "defective" or "evil". Vladimir Dal defines grozny specifically in archaic usage and as an epithet for tsars: "courageous, magnificent, magisterial and keeping enemies in fear, but people in obedience". Other translations have also been suggested by modern scholars, including formidable.

(source)

Read the rest of this entry »

Permalink Comments (27)

Chutzpah in Mandarin

February 18, 2023 @ 8:50 am · Filed by Victor Mair under Translation

Klaus Nuber stumbled upon this opinion piece in the Austrian newspaper Der Standard:

"Shoot 'em down – Ooops, einige Ballons waren doch keine chinesischen Spionageballons"

10 hours ago

Klaus says "It's about the downed balloons over Alaska. At the end the author asks a question":

"Ggibt es einen Ausdruck in Mandarin für "Chuzpe"?

Is there an expression In Mandarin for chutzpah?

Read the rest of this entry »

Permalink Comments (19)

Vocalizations of wolves and justices

February 17, 2023 @ 3:49 pm · Filed by Mark Liberman under Computational linguistics

Tessa Koumoundouros, "Adorable Study Tests How Dogs Respond to Wild Wolf Calls – And, Yes, There's Footage", ScienceAlert 2/12/2023:

Without convenient access to phones for pens for letter-writing, wolves must rely on howls to communicate long distances. These woeful wails allow the social mammals to maintain their territories as well as keep track of and stay in synchrony with other pack members. […]

A new study exposes family dogs to wolf howls to better understand why some of our canine companions no longer seem to bother with this seemingly important form of dog communication.

Read the rest of this entry »

Permalink Comments (9)

Language Log

Vignettes of quality data impoverishment in the world of PRC AI

Spaceless pinyin

Multi-modal writing among Hong Kong teens

A crook that protects your belongings

Do not major in the changing room

Uh-oh! DeepL in the classroom; it's already here

Baozi: The stuffed, steamed bun becomes a meme

ChatGPT: Theme and Variations

How to use "Six Skins" in a slogan to solicit business in the PRC

Closestools, crappers, and horse buckets

Ivan Enraged

Chutzpah in Mandarin

Vocalizations of wolves and justices

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta