Vignettes of quality data impoverishment in the world of PRC AI
Some snippets:
Limited data sets a hurdle as China plays catch-up to ChatGPT
Lack of high-quality Chinese texts on Internet a barrier to training AI models.
Ryan McMorrow, Nian Liu, Eleanor Olcott, and Madhumita Murgia, FT, Ars Technica (2/21/23)
…
Baidu struggled with its previous attempt at a chatbot, known as Plato, which analysts said could not even answer a simple question such as: “When is Alibaba co-founder Jack Ma’s birthday?”
Analysts point to the lack of high-quality Chinese-language text on the Internet and in other data sets as a barrier for training AI software.
GPT, the program underlying ChatGPT, sucked in hundreds of thousands of English academic papers, news articles, books, and social media posts to learn the patterns that form language. Meanwhile, Baidu’s Ernie has been trained primarily on Chinese-language data as well as English-language data from Wikipedia and Reddit.
…
Read the rest of this entry »