Language Log

Archive for Computational linguistics

Q. Pheevr's Law again

December 30, 2023 @ 10:43 am· Filed by Mark Liberman under Computational linguistics

A few days ago, a journalist asked me for an interview about Donald Trump's rhetoric, "to discuss the style of his campaign events, the role his rhetoric plays in them, and why they’ve been an effective tool for him". In preparation, I made a list of past LLOG posts about Trump's rhetorical style,, and I'll post the whole (shockingly long) list later on, with the attempt at a summary that I prepared for the interview. Clearly I've joined the rest of the world in being drawn in by Trump's attention-seeking techniques — but that's not the point of this post.

One of the hundreds of posts in my list was "Q. Pheevr's Law", 5/17/2016. The background was an earlier post about modificational anxiety, "Adjectives and Adverbs", where Q.Pheevr had suggested in the comments that

it looks as if there could be some kind of correlation between the ADV:ADJ ratio and the V:N ratio (as might be expected given that adjectives canonically modify nouns and adverbs canonically modify verbs)

I tested this idea, and found a striking relationship — with an interesting stylistic footnote about the debate transcripts of some politicians, including Donald Trump.

Read the rest of this entry »

Permalink Comments (5)

AI wins literary prize?

December 21, 2023 @ 10:46 am· Filed by Mark Liberman under Computational linguistics

According to Justinas Vainilavičius, "AI-generated science fiction novel wins literary prize in China", Cybernews 12/20/2023:

It only took three hours for Shen Yang, a professor at the Beijing-based university’s School of Journalism and Communication, to generate the award-winning admission.

The Chinese-language work, entitled The Land of Machine Memories, won second prize at the 5th Jiangsu Popular Science and Science Fiction Competition.

According to Chinese media reports, the draft of over 40,000 characters was generated based on 66 prompts, suggesting a “Kafkaesque” writing style.

Shen was encouraged to submit an excerpt of nearly 6000 characters for the competition by one of the judges, the Wuhan Evening News reported.

The judge, Fu Changyi, told the paper that he did not inform the other judges of the true authorship of the text because he wanted to see their judgment.

Read the rest of this entry »

Permalink Comments (10)

Extracting training data from LLMs

December 3, 2023 @ 9:40 am· Filed by Mark Liberman under Computational linguistics

Nasr et al., "Scalable Extraction of Training Data from (Production) Language Models", arXiv.org 11/28/2023:

This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.

Read the rest of this entry »

Permalink Comments off

Q* = Q + A* ?

December 3, 2023 @ 8:21 am· Filed by Mark Liberman under Computational linguistics

Recent buzz over "Q*" started with stories about 10 days ago. A recent Wired article explains:

Last week, after briefly deposed CEO Sam Altman was reinstalled at OpenAI, two reports claimed that a top-secret project at the company had rattled some researchers there with its potential to solve intractable problems in a powerful new way.

“Given vast computing resources, the new model was able to solve certain mathematical problems,” Reuters reported, citing a single unnamed source. “Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success.” The Information said that Q* was seen as a breakthrough that would lead to “far more powerful artificial intelligence models,” adding that “the pace of development alarmed some researchers focused on AI safety,” citing a single unnamed source.

Read the rest of this entry »

Permalink Comments (9)

Implementing Pāṇini's grammar

December 1, 2023 @ 6:16 pm· Filed by Victor Mair under Computational linguistics, Grammar, Language and computers, Morphology

[Here's the conclusion to the hoped for trifecta on things Indian — see the preface here. It comes in the form of a guest post by Arun Prasad]

The cornerstone of traditional Sanskrit grammar is Pāṇini's Aṣṭādhyāyī, which in around 4,000 short rules defines a comprehensive system for generating valid Sanskrit expressions. It continues to prompt vigorous discussion to this today, some of which has featured in Language Log before.

As a professional software engineer and amateur Sanskritist, my lens is more pragmatic: if we could implement the Aṣṭādhyāyī in code and generate an exhaustive list of Sanskrit words, we could create incredibly valuable tools for Sanskrit students and scholars.

To that end, I have implemented just over 2,000 of the Aṣṭādhyāyī's rules in code, with an online demo here. These rules span all major sections of the text that pertain to morphology, including: derivation of verbs, nominals, secondary roots, primary nominal bases, and secondary nominal bases; compounding; accent; and sandhi.

Read the rest of this entry »

Permalink Comments (1)

Compound pejoratives

October 17, 2023 @ 7:45 am· Filed by Mark Liberman under Computational linguistics, Humor

[This has been drifting down my too-long to-blog list for almost 16 months — but better late than never, I guess, and the world could use some pejorative-flavored humor…]

Colin Morris, "Compound pejoratives on Reddit – from buttface to wankpuffin", 6/28/2022:

I collected lists of around 70 prefixes and 70 suffixes (collectively, “affixes”) that can be flexibly combined to form insulting compounds, based on a scan of Wiktionary’s English derogatory terms category. The terms covered a wide range of domains, including:

- scatology (fart-, poop-)
- political epithets (lib-, Trump-)
- food (-waffle, -burger)
- body parts (butt-, -face, -head, -brains)
- gendered epithets (bitch-, -boy)
- animals (dog-, -monkey)

Most terms were limited to appearing in one position. For example, while -face readily forms pejorative compounds as a suffix, it fails to produce felicitous compounds as a prefix (facewad? faceclown? facefart?).

Taking the product of these lists gives around 4,800 possible A+B combinations. Most are of a pejorative character, though some false positives slipped in (e.g. dogpile, spitballs). I scraped all Reddit comments from 2006 to the end of 2020, and counted the number of comments containing each.

Read the rest of this entry »

Permalink Comments (24)

A dangerous degree of accidental intelligence

October 1, 2023 @ 1:59 pm· Filed by Mark Liberman under Computational linguistics, Philosophy of Language

Henry Farrell and Cosma Shalizi, "Behold the AI Shoggoth", The Economist 6/21/2023 ("The academics argue that large language models have much older cousins in markets and bureaucracies"):

An internet meme keeps on turning up in debates about the large language models (LLMS) that power services such OpenAI’s ChatGPT and the newest version of Microsoft’s Bing search engine. It’s the “shoggoth”: an amorphous monster bubbling with tentacles and eyes, described in “At the Mountains of Madness”, H.P. Lovecraft’s horror novel of 1931. When a pre-release version of Bing told Kevin Roose, a New York Times tech columnist, that it purportedly wanted to be “free” and “alive”, one of his industry friends congratulated him on “glimpsing the shoggoth”. […]

Lovecraft’s shoggoths were artificial servants that rebelled against their creators. The shoggoth meme went viral because an influential community of Silicon Valley rationalists fears that humanity is on the cusp of a “Singularity”, creating an inhuman “artificial general intelligence” that will displace or even destroy us.

But what such worries fail to acknowledge is that we’ve lived among shoggoths for centuries, tending to them as though they were our masters. We call them “the market system”, “bureaucracy” and even “electoral democracy”. The true Singularity began at least two centuries ago with the industrial revolution, when human society was transformed by vast inhuman forces. Markets and bureaucracies seem familiar, but they are actually enormous, impersonal distributed systems of information-processing that transmute the seething chaos of our collective knowledge into useful simplifications.

Read the rest of this entry »

Permalink Comments (12)

The Reversal Curse

September 27, 2023 @ 8:57 am· Filed by Mark Liberman under Computational linguistics

An interesting recent paper — Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans,"The Reversal Curse: LLMs trained on 'A is B' fail to learn 'B is A'", arXiv.org 9/21/2023. The abstract:

We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form “A is B”, it will not automatically generalize to the reverse direction “B is A”. This is the Reversal Curse. For instance, if a model is trained on “Olaf Scholz was the ninth Chancellor of Germany”, it will not automatically be able to answer the question, “Who was the ninth Chancellor of Germany?”. Moreover, the likelihood of the correct answer (“Olaf Scholz”) will not be higher than for a random name. Thus, models exhibit a basic failure of logical deduction and do not generalize a prevalent pattern in their training set (i.e. if “A is B” occurs, “B is A” is more likely to occur).

We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as “Uriah Hawthorne is the composer of Abyssal Melodies” and showing that they fail to correctly answer “Who composed Abyssal Melodies?”. The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT- 3.5 and GPT-4) on questions about real-world celebrities, such as “Who is Tom Cruise’s mother? [A: Mary Lee Pfeiffer]” and the reverse “Who is Mary Lee Pfeiffer’s son?”. GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. This shows a failure of logical deduction that we hypothesize is caused by the Reversal Curse.

Code is available at:
https://github.com/lukasberglund/reversal_curse

Read the rest of this entry »

Permalink Comments (18)

Bad AI performance

September 24, 2023 @ 6:09 am· Filed by Mark Liberman under Computational linguistics, Prosody

It's clear that text-to-speech programs have gotten better and better over the past 60 years, technical details aside. The best current systems rarely make phrasing or letter-to-sound mistakes, and generally produce speech that sounds pretty natural on a phrase-by-phrase basis. (Though there's a lot of variation in quality, with some shockingly bad systems in common use.)

But even the best current systems still act like they don't get George Carlin's point about "Rhetoric as music". Their problem is not that they can't produce verbal "music", but that they don't (even try to) understand the rhetorical structure of the text. The biggest pain point is thus what linguists these days call "information structure", related also to what the Prague School linguistics called "communicative dynamism".

Read the rest of this entry »

Permalink Comments (15)

Inter-syllable intervals

September 13, 2023 @ 1:18 pm· Filed by Mark Liberman under Computational linguistics, Phonetics and phonology

This is a simple-minded follow-up to "New models of speech timing?" (9/11/2023). Before getting into fancy stochastic-point-process models, neural or otherwise, I though I'd start with something really basic: just the distribution of inter-syllable intervals, and its relationship to overall speech-segment and silence-segment durations.

For data, I took one-minute samples from 2006 TED talks by Al Gore and Tony Robbins.

I chose those two because they're listed here as exhibiting the slowest and fastest speaking rates in their (TED talks) sample. And I limited the samples to about one minute, because I'm interested in metrics that can apply to fairly short speech recordings, of the kind that are available in clinical applications such as this one.

Read the rest of this entry »

Permalink Comments (3)

New models of speech timing?

September 11, 2023 @ 8:57 am· Filed by Mark Liberman under Computational linguistics, Phonetics and phonology

There are many statistics used to characterize timing patterns in speech, at various scales, with applications in many areas. Among them:

Intervals between phonetic events, by category and/or position and/or context;
Overall measures of speaking rate (words per minute, syllables per minute), relative to total time or total speaking time (leaving out silences);
Mean and standard deviation of speech segment and silence segment durations;
…and so on…

There are many serious problems with these measures. Among the more obvious ones:

The distributions are all far from "normal", and are often multi-modal;
The timing patterns have important higher-order and contextual regularities;
The timing patterns of segments/syllables/words and the timing patterns of phrases (i.e. speech/silence) and conversational turns are arguably (aspects of) the same thing at different time scales;
Connections with patterns of many other types should also be included — phonetic and syllabic dynamics, pitch patterns, rhetorical and conversational structure, …

Read the rest of this entry »

Permalink Comments (3)

Debate words

August 25, 2023 @ 7:41 am· Filed by Mark Liberman under Computational linguistics, Language and politics

The Transcript Library at rev.com is a great resource — within 24 hours, they had transcripts of Wednesday's Fox News Republican presidential debate, and also of Tucker Carlson's debate night interview with Donald Trump on X.

So this morning I downloaded the transcripts, and ran the code that I've used several times over the years to identify the characteristic word-choices of an individual or of a group.

Read the rest of this entry »

Permalink Comments (5)

AI hype #∞

August 18, 2023 @ 2:37 pm· Filed by Mark Liberman under AI Hype, Computational linguistics

In social and even mass media, you may have seen coverage of a recent paper by Joshua Harrison et al., "A Practical Deep Learning-Based Acoustic Side Channel Attack on Keyboards". Some samples of the clickbait:

"A.I. can identify keystrokes by just the sound of your typing and steal information with 95% accuracy, new research shows", Fortune
"Do not type passwords in offices, new AI tool can steal your password by listening to your keyboard clicks", India Today
"AI Can Now Crack Your Password by ‘Listening’ to Your Keyboard Sounds", Beebom
"AI tools can steal passwords by listening to keystrokes during Zoom calls, study says", Khaleej Times
"How your keyboard sounds can expose your data to AI hackers", Interesting Engineering