Archive for Language and computers

Language is not script and script is not language

Trying to clear up the confusion between the two is a battle we have been waging for decades, and nowhere is the problem more severe than in the study of Sinitic languages and the Sinographic script.  The crisis (not a "danger + opportunity"!) has come to the surface again this month with the appearance of a new book by Jing Tsu titled Kingdom of Characters: The Language Revolution That Made China Modern (Riverhead Books, 2022).

The publication of Tsu's book has generated a lot of excitement, publicity, and reviews.  Here I would like to call attention to the brief remarks of an anonymous correspondent (a famous, reclusive linguist) that are right on target:

Reimagining "antiquated" Chinese

Reproduced below is the text of a book review in Science that you may not have seen. It is classified as "Linguistics", though the reviewer is a historian at Cal State Poly, Pomona. Notice that Chinese is assumed to be "antiquated" and in need of being "reimagined"!  There is simply no sign of Science understanding the difference between a human language and a writing system. This is consistent with the way they have always treated linguistics; they have no idea what the subject really is.

Read the rest of this entry »

Comments (19)

AI cat and mouse robot censorship war

Now it's getting interesting:

"China’s internet police losing man-versus-machine duel on social media"

Stephen Chen, SCMP (11/14/21)

    Hordes of bot accounts using clever dodging tactics are causing burnout among human censors, police investigative paper finds
    Authorities may respond by raising a counter-army of automated accounts or even an AI-driven public opinion leader

Read the rest of this entry »

Comments (3)

rime-cantonese, a Cantonese lexicon for building keyboards and more

The following is a guest post by Mingfei Lau. A short intro about the author:

My name is Mingfei Lau, a member of The Linguistic Society of Hong Kong Jyutping Workgroup. I am a language engineer at Amazon and I work on different projects on Cantonese resource development in my spare time.


Today, Pinyin is undoubtedly the most popular way to type Mandarin. But what about Cantonese? This wasn’t easy until rime-cantonese, the normalized Cantonese Jyutping[1] lexicon appeared. Lo and behold, you can now type Cantonese in Jyutping just like typing Mandarin in Pinyin.

Read the rest of this entry »

Comments (4)

Massive long-term data storage

News release in EurekAlert, Optica (10/28/21):

"High-speed laser writing method could pack 500 terabytes of data into CD-sized glass disc:  Advances make high-density, 5D optical storage practical for long-term data archiving"

Caption

Researchers developed a new fast and energy-efficient laser-writing method for producing nanostructures in silica glass. They used the method to record 6 GB data in a one-inch silica glass sample. The four squares pictured each measure just 8.8 X 8.8 mm. They also used the laser-writing method to write the university logo and mark on the glass.

Credit

Yuhao Lei and Peter G. Kazansky, University of Southampton

Source

Read the rest of this entry »

Comments (9)

Difficult languages and easy languages, part 3

There may well be a dogma out there stating that all languages are equally complex, but I don't believe it, especially not if it has to be "drummed" into our minds.  I have learned many languages.  Some of them are exceedingly hard (because of their complexity) and some of them are relatively easy (because they are comparatively simple).  I have often said that Mandarin is the easiest language I ever learned to speak, but the hardest to read and write in characters (though very easy in Romanization).  And remember these posts:

"Difficult languages and easy languages" (3/4/17)

"Difficult languages and easy languages, part 2" (5/28/19)

Read the rest of this entry »

Comments (33)

The implications of Chinese for AI development, part 2

With this post, we are already acquainted with Inspur's Yuan 1.0, "one of the most advanced deep learning language models that can generate coherent Chinese texts."  Now, with the present article, we will delve more deeply into the potentials and pitfalls of Inspur's deep learning language model:

"Inspur unveils GPT-3 equivalent for Chinese language", by Wei Sheng, TechNode (1026/21)

The model is trained with 245.7 billion parameters—the number of weights in an artificial neural network, according to the company. This is more than the Elon Musk-backed GPT-3 language model for English, which has 175 billion parameters. Inspur said the Yuan model was trained with 5 terabytes of datasets.

Read the rest of this entry »

Comments (4)

The implications of Chinese for AI development

New article in EnterpriseAI (October 21, 2021):

"Language Model Training Gets Another Player: Inspur AI Research Unveils Yuan 1.0",  by Todd R. Weiss

From Pranav Mulgund:

This article introduces an interesting new advance in an artificial intelligence (AI) model for Chinese. As you probably know, Chinese has been long held as one of the hardest languages for AI to crack. Baidu and Google have both been trying for a long time, but have had a lot of difficulty given the complexity of the language. But the company Inspur just came out with a model called Yuan 1.0 that shows significant advances from previous companies' AIs.

Read the rest of this entry »

Comments (5)

Characterless Sinitic

Valerie Hansen is Director of Undergraduate Studies for East Asian Studies at Yale.  Yesterday she was talking to a sophomore who had taken 1st and 2nd year Mandarin online and is about to start 3rd year.  Valerie writes:

After a while, she told me that she did have one worry about taking 3rd year: she had never written a single character and she wondered if her teacher would expect her to know how to write characters.

She can read Chinese and uses the computer to write essays. So in essence she knows pinyin and can identify the characters she needs when she writes something.
 
Is this the future of Chinese? Only computers will know characters?

Read the rest of this entry »

Comments (19)

Another chapter in the history of the Chinese typewriter

Brian Merriman ran into this article and device when researching electronic typewriters from the 1980s:

Read the rest of this entry »

Comments (2)

Tortured phrases

Article by Holly Else in Nature (8/5/21):

"‘Tortured phrases’ give away fabricated research papers

Analysis reveals that strange turns of phrase may indicate foul play in science"

Here are the beginning and a few other selected portions of the article:

In April 2021, a series of strange phrases in journal articles piqued the interest of a group of computer scientists. The researchers could not understand why researchers would use the terms ‘counterfeit consciousness’, ‘profound neural organization’ and ‘colossal information’ in place of the more widely recognized terms ‘artificial intelligence’, ‘deep neural network’ and ‘big data’.

Further investigation revealed that these strange terms — which they dub “tortured phrases” — are probably the result of automated translation or software that attempts to disguise plagiarism. And they seem to be rife in computer-science papers.

Research-integrity sleuths say that Cabanac* and his colleagues have uncovered a new type of fabricated research paper, and that their work, posted in a preprint on arXiv on 12 July1, might expose only the tip of the iceberg when it comes to the literature affected.

[*VHM:  Guillaume Cabanac, a computer scientist at the University of Toulouse, France]

Read the rest of this entry »

Comments (28)

4-digit numbers versus 5-digit numbers

Phil H wrote these comments to "Uncommon words of anguish" (7/18/21):

The anguish is very real. My wife had a character in her name that most computers will not reproduce ([石羡]), despite it being relatively common in names in our part of the world, and has been refused bank accounts, credit cards, and a mortgage because of it. In the end she changed her name rather than continue to deal with the hassle. The character is in the standard, but it was too late for us.

…there have always been ways to get the character onto a computer, but any given piece of bank software might not recognise it, and any given bank functionary might be unfamiliar with them. We then had trouble when some organisations used the pinyin XIAN in place of the character, but that then made their documentation inconsistent with her national ID card (which had the right character on it) and so yet further bodies would not accept them… It was the standard "mild computer snafu + large inflexible bureaucracy = major headache" equation.

An anonymous correspondent, a computer scientist, sent in the following remarks:

Phil H is talking about a character which is in a "supplementary plane" in Unicode (and similarly in GB-18030).  Unfortunately, an awful lot of software was only ever tested on Basic Multilingual Plane characters.

Read the rest of this entry »

Comments (16)

Uncommon words of anguish

From a manual for a thermal printer:

Dǎyìn kòngzhì bǎn nèizhì GB18030 Zhōngwén zìkù, chèdǐ miǎnchú shēngpì zì de kǔnǎo

打印控制板内置 GB18030 中文字库,彻底免除生僻字的苦恼

Printer control panel built-in GB18030 Chinese character, thoroughly remove the uncommon words of anguish

(courtesy of Amy de Buitléir)

A more accurate English translation would be:

Printer control panel with built-in GB18030 Chinese character font, thoroughly removing the anguish brought about by uncommon / obscure characters

"GB" stands for "guóbiāo 国标" ("national standard"), and is used for many technical terms in the PRC (another instance of encroaching digraphia, for which see here and here [with extensive bibliography]).

Read the rest of this entry »

Comments (14)

Character confusion: three-child policy

Read the rest of this entry »

Comments (13)