Language Log

Archive for Corpus linguistics

Chinese Text Project augmented by AI translation

March 7, 2026 @ 7:01 pm· Filed by Victor Mair under Artificial intelligence, Corpus linguistics, Data bases, Translation

For those who don't know what "Chinese Text Project" (CTP) is, here's a:

Brief introduction:

The Chinese Text Project is an online open-access digital library that makes pre-modern Chinese texts available to readers and researchers all around the world. The site attempts to make use of the digital medium to explore new ways of interacting with these texts that are not possible in print. With over thirty thousand titles and more than five billion characters, the Chinese Text Project is also the largest database of pre-modern Chinese texts in existence.

You may wish to read more about the project, view the pre-Qin and Han, post-Han or Wiki tables of contents, or consult the instructions, FAQ, or list of tools. If you're looking for a particular Chinese text, you can search for texts by title across the main textual sections of the site.

(from the CTP homepage)

Read the rest of this entry »

Permalink Comments (1)

The language of a money laundering forum

September 10, 2025 @ 8:17 am· Filed by Victor Mair under Corpus linguistics, Language and economics

"Linguistic Mechanisms of Knowledge-Exchange in a Dark-Web Money Laundering Forum." Chiang, Emily. PLOS ONE 20, no. 8 (August 5, 2025): e0329777

Abstract

Money laundering facilitates serious crime, enables the expansion of criminal operations, and destabilises economies. Extant scholarship is largely concerned with anti-money laundering approaches, with far less attention being paid to the language and behaviours of the individuals who engage in money laundering. ‘Dark-web’ discussion fora are prime loci for illicit knowledge exchange and key enablers of money laundering, yet, are underexplored as sites for understanding the online activities and behaviours of users.

Read the rest of this entry »

Permalink Comments (4)

Spoken vs. written Sinitic

December 13, 2024 @ 12:47 am· Filed by Victor Mair under Corpus linguistics, Language teaching and learning, Orality, Second language

The gap between spoken and written Sinitic is enormous. In my estimation, it is greater than for any other language I know. The following are some notes by Ľuboš Gajdoš about why this is so.

"The Discrepancy Between Spoken and Written Chinese — Methodological Notes on Linguistics", Comenius University in Bratislava, Department of East Asian Studies

The issue of choosing language data on which synchronous linguistic research is being done appears in many ways not only to be relevant to the goal of the research, but also to the validity of the research results. The problem which particularly concerns us here is the discrepancy between speech on the one hand and written language on the other. In this context, we have often encountered in the past a situation where the result of the research conducted on a variety of the Chinese language has been generalized to the entire synchronous state of the language, i.e. to all other varieties of the language, while ignoring the mentioned discrepancy between the spoken and written forms. The discrepancy between the spoken and written forms is likely to be present in any natural language with a written tradition, but the degree of difference between languages is uneven: e.g. compared to the Slovak language, it may be stated that the situation in Chinese is in this respect extraordinary. Nevertheless, it is surprising that the quantitative (qualitative) research on discrepancies between different varieties of the language has not yet aroused the attention of Chinese linguistics to such an extent as would have been adequate for the unique situation of this natural language.

Read the rest of this entry »

Permalink Comments (7)

Archive for Corpus linguistics

Chinese Text Project augmented by AI translation

The language of a money laundering forum

Spoken vs. written Sinitic

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta