Language Log

Another chapter in the history of the Chinese typewriter

August 14, 2021 @ 8:22 pm· Filed by Victor Mair under Information technology, Language and computers, Typography

Brian Merriman ran into this article and device when researching electronic typewriters from the 1980s:

Read the rest of this entry »

Tortured phrases

August 14, 2021 @ 5:50 am· Filed by Victor Mair under Artificial intelligence, Language and computers, Language and science, Translation

Article by Holly Else in Nature (8/5/21):

"‘Tortured phrases’ give away fabricated research papers

Analysis reveals that strange turns of phrase may indicate foul play in science"

Here are the beginning and a few other selected portions of the article:

In April 2021, a series of strange phrases in journal articles piqued the interest of a group of computer scientists. The researchers could not understand why researchers would use the terms ‘counterfeit consciousness’, ‘profound neural organization’ and ‘colossal information’ in place of the more widely recognized terms ‘artificial intelligence’, ‘deep neural network’ and ‘big data’.

Further investigation revealed that these strange terms — which they dub “tortured phrases” — are probably the result of automated translation or software that attempts to disguise plagiarism. And they seem to be rife in computer-science papers.

Research-integrity sleuths say that Cabanac* and his colleagues have uncovered a new type of fabricated research paper, and that their work, posted in a preprint on arXiv on 12 July¹, might expose only the tip of the iceberg when it comes to the literature affected.

[*VHM: Guillaume Cabanac, a computer scientist at the University of Toulouse, France]

Read the rest of this entry »

Permalink Comments (28)

4-digit numbers versus 5-digit numbers

July 27, 2021 @ 12:39 pm· Filed by Victor Mair under Language and computers, Writing systems

Phil H wrote these comments to "Uncommon words of anguish" (7/18/21):

The anguish is very real. My wife had a character in her name that most computers will not reproduce ([石羡]), despite it being relatively common in names in our part of the world, and has been refused bank accounts, credit cards, and a mortgage because of it. In the end she changed her name rather than continue to deal with the hassle. The character is in the standard, but it was too late for us.

…there have always been ways to get the character onto a computer, but any given piece of bank software might not recognise it, and any given bank functionary might be unfamiliar with them. We then had trouble when some organisations used the pinyin XIAN in place of the character, but that then made their documentation inconsistent with her national ID card (which had the right character on it) and so yet further bodies would not accept them… It was the standard "mild computer snafu + large inflexible bureaucracy = major headache" equation.

An anonymous correspondent, a computer scientist, sent in the following remarks:

Phil H is talking about a character which is in a "supplementary plane" in Unicode (and similarly in GB-18030). Unfortunately, an awful lot of software was only ever tested on Basic Multilingual Plane characters.

Read the rest of this entry »

Permalink Comments (16)

Uncommon words of anguish

July 18, 2021 @ 5:31 am· Filed by Victor Mair under Diglossia and digraphia, Information technology, Language and computers, Lost in translation, Writing systems

From a manual for a thermal printer:

Dǎyìn kòngzhì bǎn nèizhì GB18030 Zhōngwén zìkù, chèdǐ miǎnchú shēngpì zì de kǔnǎo

打印控制板内置 GB18030 中文字库,彻底免除生僻字的苦恼

Printer control panel built-in GB18030 Chinese character, thoroughly remove the uncommon words of anguish

(courtesy of Amy de Buitléir)

A more accurate English translation would be:

Printer control panel with built-in GB18030 Chinese character font, thoroughly removing the anguish brought about by uncommon / obscure characters

"GB" stands for "guóbiāo 国标" ("national standard"), and is used for many technical terms in the PRC (another instance of encroaching digraphia, for which see here and here [with extensive bibliography]).

Read the rest of this entry »

Permalink Comments (14)

Character confusion: three-child policy

June 1, 2021 @ 7:58 am· Filed by Victor Mair under Alphabets, Errors, Language and computers, Miswriting, Writing, Writing systems

生育还是生肓？ pic.twitter.com/GBPx07QSaT

— Chenyu_Liang (@chenyuliang) May 31, 2021

Read the rest of this entry »

Permalink Comments (13)

Annual wave of Anti-English sentiment in the PRC

March 21, 2021 @ 6:14 am· Filed by Victor Mair under Language and computers, Language and politics, Language teaching and learning

Article in official CCP media source:

"Chinese lawmaker proposes removing English as core subject"

By Liu Caiyu, Global Times (3/5/21)

Coming from GT, the hyper-nationalistic tabloid, this attack on English is not unexpected, and similar anti-English proposals come up every year around the time of the national meetings of the Liǎnghuì 兩會 (Two Sessions), annual plenary meetings of the national People's Congress and the national committee of the Chinese People's Political Consultative Conference that have just concluded in Beijing (March 4-11).

Here we go again:

Is English really that important? A Chinese lawmaker at the two sessions has proposed removing English as a core subject for Chinese students receiving compulsory education, triggering heated discussion on Chinese social media.

The proposal was made by Xu Jin, a member of the Central Committee of the Jiusan Society and also a member of the Chinese People's Political Consultative Conference (CPPCC). It has also been proposed by other lawmakers in previous years.

Read the rest of this entry »

Permalink Comments (15)

Sinographic inputting: "it's nothing" — not

February 22, 2021 @ 5:34 am· Filed by Victor Mair under Alphabets, Language and computers, Topolects, Writing

Last week in our Dunhuangology seminar, a student wanted to type "wǔ 武" ("martial; military") into the chat box, but instead out popped "nián 年" ("year"). I immediately said to her, "I'll bet you were using a shape-based inputting system", which left her a bit surprised.

Ever since information technologists began to wrestle with the problem of inputting, ordering, and retrieving Chinese characters in computers during the 70s, I have been intensely interested in the theoretical and practical obstacles they faced. To better understand the overall situation with regard to characters in computers, I organized an international conference at Penn in 1990 on the computerization of Chinese characters that resulted in Victor H. Mair and Yongquan Liu, eds., Characters and Computers (Amsterdam, Oxford, Washington, Tokyo: IOS, 1991).

Read the rest of this entry »

Permalink Comments (18)

Ted Cruz in big trouble

February 20, 2021 @ 2:13 pm· Filed by Victor Mair under Computational linguistics, Language and computers, Parsing

Ben Hull writes:

In our Computational Linguistics class we were discussing different methods of segmenting Chinese character texts. Today I came across a terrific example of the problems of segmenting left to right, in the first sentence of the attached image. I hope you find it as amusing as I did.

Read the rest of this entry »

Permalink Comments (6)

I'm milk

February 10, 2021 @ 11:31 am· Filed by Victor Mair under Artificial intelligence, Language and computers, Lost in translation

This has been making the rounds:

1. Go to Google Translate.
2. Set the input language to Spanish.
3. Paste in "soy milk"
4. Set the output language to English or X language.
5. Hilarity ensues.

The obligatory screen shot:

Read the rest of this entry »

Permalink Comments (36)

Google Translate Sabotage, part 2

January 17, 2021 @ 8:10 pm· Filed by Victor Mair under Language and computers, Lost in translation, Translation

This is all over the Chinese internet:

(source)

Read the rest of this entry »

Permalink Comments (9)

Translation loops

December 10, 2020 @ 8:23 pm· Filed by Victor Mair under Language and computers, Lost in translation, Translation

From Jeff DeMarco:

I’m sure you’ve seen the Facebook translation artifact where it repeats “and I’m going to go to the middle of the day.” This post does that and something similar with “of the 912th.” I keep advising Facebook that these are unintelligible, but they seem to be a low priority.

Read the rest of this entry »

Permalink Comments (2)

Thanks wasabi

October 21, 2020 @ 11:00 pm· Filed by Victor Mair under Language and computers, Lost in translation

Jonathan Silk wonders how this mistranslation from Latin to Dutch in Google Translate occurred the same way in English:

Read the rest of this entry »

Permalink Comments (20)

Alphabetical storage, ordering, and retrieval

October 18, 2020 @ 6:57 am· Filed by Victor Mair under Alphabets, Dictionaries, Information technology, Language and computers, Lexicon and lexicography

We just had a good discussion about a Sinitic language written with an alphabet:

"The look, feel, and sound of Dungan language" (10/15/20)

Under "Selected readings" below, there are listed additional earlier posts about writing Sinitic languages with Romanization.

One of the major advantages of the alphabet over a morphosyllabic / logographic ideopicto-phonetic writing system like the Sinographic script is that it is very easy to order and find / retrieve the entire lexicon with the former, whereas carrying out these tasks with the latter is toilsome at best and torturesome at worst. See:

Victor H. Mair, "The Need for an Alphabetically Arranged General Usage Dictionary of Mandarin Chinese: A Review Article of Some Recent Dictionaries and Current Lexicographical Projects", Sino-Platonic Papers, 1 (February, 1986), 1-31 pp.

Read the rest of this entry »

Permalink Comments (29)

Archive for Language and computers

Another chapter in the history of the Chinese typewriter

Tortured phrases

4-digit numbers versus 5-digit numbers

Uncommon words of anguish

Character confusion: three-child policy

Annual wave of Anti-English sentiment in the PRC

Sinographic inputting: "it's nothing" — not

Ted Cruz in big trouble

I'm milk

Google Translate Sabotage, part 2

Translation loops

Thanks wasabi

Alphabetical storage, ordering, and retrieval

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta