Language Log

Archive for Data bases

Chinese Text Project augmented by AI translation

March 7, 2026 @ 7:01 pm· Filed by Victor Mair under Artificial intelligence, Corpus linguistics, Data bases, Translation

For those who don't know what "Chinese Text Project" (CTP) is, here's a:

Brief introduction:

The Chinese Text Project is an online open-access digital library that makes pre-modern Chinese texts available to readers and researchers all around the world. The site attempts to make use of the digital medium to explore new ways of interacting with these texts that are not possible in print. With over thirty thousand titles and more than five billion characters, the Chinese Text Project is also the largest database of pre-modern Chinese texts in existence.

You may wish to read more about the project, view the pre-Qin and Han, post-Han or Wiki tables of contents, or consult the instructions, FAQ, or list of tools. If you're looking for a particular Chinese text, you can search for texts by title across the main textual sections of the site.

(from the CTP homepage)

Read the rest of this entry »

Permalink Comments (1)

Language variation writ large

October 29, 2025 @ 5:05 pm· Filed by Victor Mair under Artificial intelligence, Data bases, Language and society, Language extinction, Sociolinguistics, Variation

The Vastness of Language Variation Across the Globe
Panel. AAAS 2026 Annual Meeting. Coming in February, 2026.

Organizer: Lenore Grenoble, University of Chicago, Chicago, IL
Co-Organizer: Jeff Good, University at Buffalo, Buffalo, NY
Moderator: Jeff Good, University at Buffalo, Buffalo, NY

Panelists

"Multilingual Language Ecologies and Linguistic Diversity",
Wilson de Lima Silva, Linguistics, University of Arizona, Tucson, AZ

"AI Approaches to the Study of Gesture, Prosody, and Linguistic Diversity",
Kathryn Franich, Linguistics, Harvard University, Cambridge, MA

"Sometimes Big Questions Call for Small Data",
Gareth Roberts, Linguistics, University of Pennsylvania, Philadelphia, PA

Read the rest of this entry »

Permalink Comments off

Digital Hittite

March 30, 2025 @ 3:16 pm· Filed by Victor Mair under Data bases, Language and technology, Writing systems

Cuneiforms: New digital tool for translating ancient texts, University of Würzburg, ScienceDaily (March 26, 2025)

Summary: Major milestone reached in digital Cuneiform studies: Researchers present an innovative tool that offers many new possibilities

We usually associate cuneiform (Classical Latin cuneus [“wedge”] + fōrma) with Sumerian and Akkadian, but this logo-syllabic script was actually used for many languages in the ancient world: Sumerian, Akkadian, Eblaite, Elamite, Hittite, Hurrian, Luwian, Urartian, Palaic, Aramaic, Old Persian. In this post, we focus on its use for writing Hittite, the first Indo-European language, as described in the article cited above.

Read the rest of this entry »

Permalink Comments (17)

A new look at sperm whale communication

May 8, 2024 @ 11:20 am· Filed by Victor Mair under Animal communication, Data bases, Language and animals

For as long as I can remember, I've been aware that whales, dolphins, porpoises, and other large mammals of the seas (the cetaceans) make whistles, clicks, calls, groans, songs, and other sounds / noises. These vocalizations are manifestly complex and nuanced, leading people to believe that they are communicating content, emotions, and so forth. What exactly they are conveying and how they do it have remained a mystery, but researchers never stop trying to figure out cetacean "language". A new study at MIT claims to have made progress in analyzing sperm whale sound systems.

Scientists document remarkable sperm whale 'phonetic alphabet'
By Will Dunham, Reuters (May 7, 2024)
[with 2:58 video]

I was hesitant to read this article at all because of the mention of a "phonetic alphabet". Even with the quotation marks around it, attributing this ability to sperm whales was a bit much for me.

Yet, since it was "scientists" doing the documenting, I forced myself to read the first two paragraphs:

The various species of whales inhabiting Earth's oceans employ different types of vocalizations to communicate. Sperm whales, the largest of the toothed whales, communicate using bursts of clicking noises – called codas – sounding a bit like Morse code.

A new analysis of years of vocalizations by sperm whales in the eastern Caribbean has found that their system of communication is more sophisticated than previously known, exhibiting a complex internal structure replete with a "phonetic alphabet." The researchers identified similarities to aspects of other animal communication systems – and even human language.

Read the rest of this entry »

Permalink Comments (7)

The language of spices

January 6, 2024 @ 6:49 am· Filed by Victor Mair under Announcements, Data bases, Language and food, Philology

Sino-Platonic Papers is pleased to announce the publication of its three-hundred-and-thirty-eighth issue:

“Mapping the Language of Spices: A Corpus-Based, Philological Study on the Words of the Spice Domain,” by Gábor Parti.

ABSTRACT

Most of the existing literature on spices is to be found in the areas of gastronomy, botany, and history. This study instead investigates spices on a linguistic level. It aims to be a comprehensive linguistic account of the items of the spice trade. Because of their attractive aroma and medicinal value, at certain points in history these pieces of dried plant matter have been highly desired, and from early on, they were ideal products for trade. Cultural contact and exchange and the introduction of new cultural items beget situations of language contact and linguistic acculturation. In the case of spices, not only do we have a set of items that traveled around the world, but also a set of names. This language domain is very rich in loanwords and Wanderwörter. In addition, it supplies us with myriad cases in which spice names are innovations. Still more interesting is that examples in English, Arabic, and Chinese—languages that represent major powers in the spice trade at different times—are here compared.

Read the rest of this entry »

Permalink Comments (10)

Data, information, knowledge, insight, wisdom, and Conspiracy Theory, part 2

April 15, 2023 @ 6:27 am· Filed by Victor Mair under Data bases, Information technology, Language and art, Language and philosophy, Language play, Linguistics in the comics, Logic, Memes

From Phillip Remaker:

Loved your deep dive on finding the provenance of the "conspiracy theory" image.

The one that claimed authorship clipped the edge of the unicorn tail.

The only version I have found that doesn't clip the edge of the unicorn tail is this one from farhan

I don't know if that means I found the original or if the author touched it up. The page is not archived on the Internet Archive.

It seems consistent with his other art.

Read the rest of this entry »

Permalink Comments off

Vignettes of quality data impoverishment in the world of PRC AI

February 23, 2023 @ 8:00 am· Filed by Victor Mair under Artificial intelligence, Computational linguistics, Data bases

Some snippets:

Limited data sets a hurdle as China plays catch-up to ChatGPT

Lack of high-quality Chinese texts on Internet a barrier to training AI models.

Ryan McMorrow, Nian Liu, Eleanor Olcott, and Madhumita Murgia, FT, Ars Technica (2/21/23)

…

Baidu struggled with its previous attempt at a chatbot, known as Plato, which analysts said could not even answer a simple question such as: “When is Alibaba co-founder Jack Ma’s birthday?”

Analysts point to the lack of high-quality Chinese-language text on the Internet and in other data sets as a barrier for training AI software.

GPT, the program underlying ChatGPT, sucked in hundreds of thousands of English academic papers, news articles, books, and social media posts to learn the patterns that form language. Meanwhile, Baidu’s Ernie has been trained primarily on Chinese-language data as well as English-language data from Wikipedia and Reddit.

…

Read the rest of this entry »

Permalink Comments (11)

Sinitic ideophones

May 14, 2022 @ 1:30 pm· Filed by Victor Mair under Data bases, Lexicon and lexicography, Phonetics and phonology, Tones

I have always felt that binoms are a key to studying early vernacular Sinitic. (See "Selected readings" below for useful references on this topic.) Now we have a valuable research tool for access to and analysis of premodern Sinitic binoms, which fall within the purview of the tabulated listings introduced here: Thomas van Hoey and Arthur Lewis Thompson, The Chinese Ideophone Database (CHIDEOD), Cahiers de Linguistique Asie Orientale, 26 Oct 2020.

Read the rest of this entry »

Permalink Comments (1)

Arabic and the vernaculars, part 3

March 9, 2022 @ 5:27 am· Filed by Victor Mair under Data bases, Topolects, Vernacular

For Arabic diglossia references, see the works of Mohamed Maamouri, e.g., here, here, here, here, here, here, and here (pdf).

Also consult the various Arabic datasets of the LDC (Linguistic Data Consortium), both MSA and colloquial.

An important point to make is that the regional Arabic "colloquials" have been developing in separate directions nearly as long as the regional Romance varieties have. So Moroccan Arabic is roughly as different from Gulf Arabic as (say) French is from Portuguese….

Read the rest of this entry »

Permalink Comments (7)

Language meets literature; rationality vs. experience; fiction vis-à-vis nonfiction

December 30, 2021 @ 8:24 am· Filed by Victor Mair under Changing times, Data bases, Environment and ecology, Evolution of language, Language and culture, Language and science, Language and society, Language change

New article in PNAS (Proceedings of the National Academy of Sciences of the United States of America), "The rise and fall of rationality in language", Marten Scheffer, Ingrid van de Leemput, Els Weinans, and Johan Bollen (12/21/21)

118 (51) e2107848118; https://doi.org/10.1073/pnas.2107848118

Read the rest of this entry »

Permalink Comments (3)

Archive for Data bases

Chinese Text Project augmented by AI translation

Language variation writ large

Digital Hittite

A new look at sperm whale communication

The language of spices

Data, information, knowledge, insight, wisdom, and Conspiracy Theory, part 2

Vignettes of quality data impoverishment in the world of PRC AI

Sinitic ideophones

Arabic and the vernaculars, part 3

Language meets literature; rationality vs. experience; fiction vis-à-vis nonfiction

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta