Alignment charts and other low-dimensional visualizations

Comments (12)


"Will life be better in the coming year?"

So asks the Chinese colleague who sent me this photograph:

Read the rest of this entry »

Comments (8)


A museum for the languages of Taiwan

Language Log readers will be aware that "Chinese", i.e., "Mandarin" (Guóyǔ 國語), is not the only language on the island.  Indeed, it is a Johnny-come-lately, having become the official language of the Republic of China on Taiwan in 1945, and was strongly enforced as such after 1949 when the retreating mainland KMT armies of Chiang Kai-shek occupied the island.

The earliest indigenous languages of Taiwan (Formosa) were Austronesian.  And we should not forget that there was a period of partial Dutch rule (1624-1662), especially in the south, and Spanish Formosa (Formosa Española) was a small colony of the Spanish Empire established in the northern part of the island from 1626 to 1642.  Consequently, both Dutch and Spanish had an impact on the linguistic development of Taiwan during the 17th century.  The first Europeans to take notice of Taiwan, however, were the Portuguese who, passing Taiwan in 1544, recorded in a ship's log the name of the island as Ilha Formosa ("Beautiful Island").

Taiwan was a dependency of Japan from 1895 to 1945, during which period Japanese was the official language.  As such, it was important for the development of language on the island, and its significance lasts till today.

The influence of English in Taiwan has been enormous during the last two centuries.

See "Languages of Taiwan".

Read the rest of this entry »

Comments (9)


The Tocharian A word for "rug" and Old Sinitic reconstructions

There's a Chinese character 罽 (Mandarin jì, Old Sinitic *kràts), which means "rug, carpet; woolen textile; fish net").  On the basis of its sound, meaning, place, and date of occurrence, it would seem to be related to Toch. A kratsu "rug".

This raises two questions:

1. Does this Tocharian word have cognates in other IE languages?

2. Who borrowed it from whom?   Sinitic from Tocharian or Tocharian from Sinitic?

Read the rest of this entry »

Comments (10)


Most-hyphen-admired-space-men

Val Ross writes:

I am less scandalized by the fact Obama and Trump tied than I am by the hyphenation of most-admired. Have you ever written on this vexed issue of hyphens?

Read the rest of this entry »

Comments (26)


The League of Disappointing Authors

Comments (133)


Living fossils: Taiwan tea and salmon

Two articles in Chinese (here and here) recently brought news of an indigenous type of tea and referred to it as a rare type of salmon.  Trying to figure that out led to two linguistic puzzles:

1. Making sense of the unusual name for the salmon:  yīnghuā gōu wěn guī 櫻花鉤吻鮭 (lit., "cherry-hook-kiss / mouth-salmon"; i.e., the Formosan landlocked salmon).

2. Understanding how, even metaphorically, a kind of tea would be referred to as a type of salmon.

Read the rest of this entry »

Comments (4)


Throes?

"Dave Barry's Year in Review 2019"

… which begins with the federal government once again in the throes (whatever a “throe” is) of a partial shutdown, which threatens to seriously disrupt the lives of all Americans who receive paychecks from the federal government. 

Consulting the OED on throe (entry updated 2017), we learn that its orthographic history is interesting:

Of uncertain origin. Perhaps a variant or alteration of another lexical item. […]
The range of forms attested for this word is difficult to account for. […]
The current standard spelling throe […] is a 16th-cent. alteration of throw, throwe […] (compare with similar alteration the current forms of roe (earlier row , rowe ), hoe (earlier how , howe ), etc.), perhaps motivated by a desire to differentiate this word from throw.

Read the rest of this entry »

Comments (9)


New Years party themes

Today's xkcd:

The mouseover title: ""Off-by-one errors" isn't the easiest theme to build a party around, but I've seen worse."

Read the rest of this entry »

Comments (10)


An 8th-century Chinese epitaph written by a Japanese courtier

Here's news of a remarkable discovery:

"Ancient Chinese epitaph penned by Japanese found in China", THE ASAHI SHIMBUN (December 26, 2019 at 19:00 JST).

The article includes a photograph of a rubbing of the last line of the epitaph with the following kanji:

日本國朝臣備書

I can read that easily as Sino-Japanese "Nihonkoku chōshin Bi sho", which would mean "written by the Japanese courtier [Ki]bi".  The article says that the last line of the epitaph reads “Nihonkoku Ason Bi Sho", so it would appear that I am reading "朝臣" incorrectly as "chōshin" instead of as "ason".

Read the rest of this entry »

Comments (5)


Meanest pun of the year

From "Who's Bill This Time", Wait Wait…Don't Tell Me! 12/21/2019:

Peter Sagal: Mayor- Mayor Pete has been getting some heat.
I don't know if you saw this.
He attended a big fundraiser in Napa
at a winery with a, quote, "wine cave."
And everybody was so mad that he did this.
But why would you be mad about a wine cave?
It celebrates the two things Democrats are known for, whining and caving.

 

Comments (5)


Sweethoney dessert

Maidhc Mac Roibin sent in this photograph of the front of a dessert shop in Cupertino from Fintano's flickr site:

201908-PSP-R4-33 Sweethoney Dessert, SJ CA

Read the rest of this entry »

Comments (11)


Standardized Project Gutenberg Corpus

Martin Gerlach and Francesc Font-Clos, "A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics", arXiv 12/19/2018:

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3×109 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on 3 different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

Read the rest of this entry »

Comments (1)