Language Log

2023 WOTYs, stage 1

December 3, 2023 @ 7:17 pm· Filed by Mark Liberman under Words words words

Choices for the 2023 Word Of The Year are starting to come out —

The Macquarie Dictionary chose cozzie livs ;
Merriam-Webster chose authentic;
Oxford University Press has announced their choice, but it's "UNDER EMBARGO until 00.01 GMT Monday 4 December 2023".
So we'll let you in on the secret tomorrow… [Update — it's rizz …]

Read the rest of this entry »

Permalink Comments (23)

Korean pot food in southern Taiwan

December 3, 2023 @ 1:28 pm· Filed by Victor Mair under Language and food, Multilingualism, Translation

2017 photo of a Kaohsiung storefront courtesy of Mark Eaglesfield:

Read the rest of this entry »

Permalink Comments (1)

Extracting training data from LLMs

December 3, 2023 @ 9:40 am· Filed by Mark Liberman under Computational linguistics

Nasr et al., "Scalable Extraction of Training Data from (Production) Language Models", arXiv.org 11/28/2023:

This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.

Read the rest of this entry »

Permalink Comments off

Q* = Q + A* ?

December 3, 2023 @ 8:21 am· Filed by Mark Liberman under Computational linguistics

Recent buzz over "Q*" started with stories about 10 days ago. A recent Wired article explains:

Last week, after briefly deposed CEO Sam Altman was reinstalled at OpenAI, two reports claimed that a top-secret project at the company had rattled some researchers there with its potential to solve intractable problems in a powerful new way.

“Given vast computing resources, the new model was able to solve certain mathematical problems,” Reuters reported, citing a single unnamed source. “Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success.” The Information said that Q* was seen as a breakthrough that would lead to “far more powerful artificial intelligence models,” adding that “the pace of development alarmed some researchers focused on AI safety,” citing a single unnamed source.

Read the rest of this entry »

Permalink Comments (9)

"Are": Japanese word of the year

December 2, 2023 @ 8:37 am· Filed by Victor Mair under Acronyms, Borrowing, Grammar, Language and entertainment, Language and sports, Word of the year

Japanese words of the year are always exciting and surprising, but this year's takes the cake.

are あれ

pronunciation

- IPA: [a̠ɾe̞]

distal demonstrative, something far off removed from both speaker and listener: that, yon

1. (deictically) that one over there (far from the speaker and the addressee)
  
  あれはなんですか？
  
  Are wa nan desu ka?
  
  What is that?
2. (anaphorically) that one we both know (both the speaker and the addressee know)
  
  これはあれでしょ？○○。
  
  Kore wa are desho?○○.
  
  This is that one thing, isn't it? You know, X.

Usage note

- Indicates something far off, removed from both speaker and addressee. Contrast with それ (sore), indicating something removed from the speaker but closer to the addressee.

(Wiktionary)

Read the rest of this entry »

Permalink Comments (24)

Implementing Pāṇini's grammar

December 1, 2023 @ 6:16 pm· Filed by Victor Mair under Computational linguistics, Grammar, Language and computers, Morphology

[Here's the conclusion to the hoped for trifecta on things Indian — see the preface here. It comes in the form of a guest post by Arun Prasad]

The cornerstone of traditional Sanskrit grammar is Pāṇini's Aṣṭādhyāyī, which in around 4,000 short rules defines a comprehensive system for generating valid Sanskrit expressions. It continues to prompt vigorous discussion to this today, some of which has featured in Language Log before.

As a professional software engineer and amateur Sanskritist, my lens is more pragmatic: if we could implement the Aṣṭādhyāyī in code and generate an exhaustive list of Sanskrit words, we could create incredibly valuable tools for Sanskrit students and scholars.

To that end, I have implemented just over 2,000 of the Aṣṭādhyāyī's rules in code, with an online demo here. These rules span all major sections of the text that pertain to morphology, including: derivation of verbs, nominals, secondary roots, primary nominal bases, and secondary nominal bases; compounding; accent; and sandhi.

Read the rest of this entry »

Permalink Comments (1)

Major Linguistic Faux Pas in Chinese Football Association PPT

December 1, 2023 @ 12:47 am· Filed by Victor Mair under Language and politics, Language and sports, Slogans

The Chinese Football Association used dǎngguó 党国 ("party state" — nettlesome term to be explained fully below) in a powerpoint on its plans for '24. Awkward political illiteracy!

Here's a screenshot.

Read the rest of this entry »

Permalink Comments (2)

Archive for December, 2023

2023 WOTYs, stage 1

Korean pot food in southern Taiwan

Extracting training data from LLMs

Q* = Q + A* ?

"Are": Japanese word of the year

Usage note

Implementing Pāṇini's grammar

Major Linguistic Faux Pas in Chinese Football Association PPT

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta