Archive for Data bases

Data, information, knowledge, insight, wisdom, and Conspiracy Theory, part 2

From Phillip Remaker:

The one that claimed authorship clipped the edge of the unicorn tail.

 
The only version I have found that doesn't clip the edge of the unicorn tail is this one from farhan
 
I don't know if that means I found the original or if the author touched it up. The page is not archived on the Internet Archive.
 
It seems consistent with his other art.

Read the rest of this entry »

Comments off

Vignettes of quality data impoverishment in the world of PRC AI

Some snippets:

Limited data sets a hurdle as China plays catch-up to ChatGPT

Lack of high-quality Chinese texts on Internet a barrier to training AI models.

Ryan McMorrow, Nian Liu, Eleanor Olcott, and Madhumita Murgia, FT, Ars Technica (2/21/23)

Baidu struggled with its previous attempt at a chatbot, known as Plato, which analysts said could not even answer a simple question such as: “When is Alibaba co-founder Jack Ma’s birthday?”

Analysts point to the lack of high-quality Chinese-language text on the Internet and in other data sets as a barrier for training AI software.

GPT, the program underlying ChatGPT, sucked in hundreds of thousands of English academic papers, news articles, books, and social media posts to learn the patterns that form language. Meanwhile, Baidu’s Ernie has been trained primarily on Chinese-language data as well as English-language data from Wikipedia and Reddit.

Read the rest of this entry »

Comments (11)

Sinitic ideophones

I have always felt that binoms are a key to studying early vernacular Sinitic.  (See "Selected readings" below for useful references on this topic.)  Now we have a valuable research tool for access to and analysis of premodern Sinitic binoms, which fall within the purview of the tabulated listings introduced here:

The Chinese Ideophone Database (CHIDEOD)
L’ ensemble de données des idéophones chinois (CHIDEOD)

In: Cahiers de Linguistique Asie Orientale (Brill)

Authors: Thomas VAN HOEY and Arthur Lewis THOMPSON

Online Publication Date:  26 Oct 2020

Read the rest of this entry »

Comments (1)

Arabic and the vernaculars, part 3

For Arabic diglossia references, see the works of Mohamed Maamouri, e.g., here, here, here, here, here, here, and here (pdf).

Also consult the various Arabic datasets of the LDC (Linguistic Data Consortium), both MSA and colloquial.
 
An important point to make is that the regional Arabic "colloquials" have been developing in separate directions nearly as long as the regional Romance varieties have. So Moroccan Arabic is roughly as different from Gulf Arabic as (say) French is from Portuguese….

Read the rest of this entry »

Comments (7)

Language meets literature; rationality vs. experience; fiction vis-à-vis nonfiction

New article in PNAS (Proceedings of the National Academy of Sciences of the United States of America), "The rise and fall of rationality in language", Marten Scheffer, Ingrid van de Leemput, Els Weinans, and Johan Bollen (12/21/21)

Read the rest of this entry »

Comments (3)