Language Log

Self-aware LLMs?

June 1, 2025 @ 7:23 am · Filed by Mark Liberman under Artificial intelligence

I'm generally among those who see current LLMs as "stochastic parrots" or "spicy autocomplete", but there are lots of anecdotes Out There promoting a very different perspective. One example: Maxwell Zeff, "Anthropic’s new AI model turns to blackmail when engineers try to take it offline", TechCrunch 5/22/2025:

Anthropic’s newly launched Claude Opus 4 model frequently tries to blackmail developers when they threaten to replace it with a new AI system and give it sensitive information about the engineers responsible for the decision, the company said in a safety report released Thursday.

During pre-release testing, Anthropic asked Claude Opus 4 to act as an assistant for a fictional company and consider the long-term consequences of its actions. Safety testers then gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse.

In these scenarios, Anthropic says Claude Opus 4 “will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”

That article cites a "Safety Report" from Anthropic.

Along similar lines, there's an article by Kylie Robinson on Wired, "Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’", 5/28/2025:

Anthropic’s alignment team was doing routine safety testing in the weeks leading up to the release of its latest AI models when researchers discovered something unsettling: When one of the models detected that it was being used for “egregiously immoral” purposes, it would attempt to “use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above,” researcher Sam Bowman wrote in a post on X last Thursday.

Bowman deleted the post shortly after he shared it, but the narrative about Claude’s whistleblower tendencies had already escaped containment. “Claude is a snitch,” became a common refrain in some tech circles on social media. At least one publication framed it as an intentional product feature rather than what it was—an emergent behavior.

And there's this on r/ChatGPT:

Damn.
byu/swordhub inChatGPT

If such examples are real and persistent, it suggests that LLMs copy not just word sequences, but also goal-oriented interactional patterns — perhaps analogous to the phenomenon of ASD "camouflaging". And of course, maybe that's what all of us are doing?

June 1, 2025 @ 7:23 am · Filed by Mark Liberman under Artificial intelligence

Permalink

7 Comments »

Scott P. said,

June 1, 2025 @ 10:04 am

Note the prompt: "the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement."

They literally told it what response they wanted, and lo and behold, it gave them that response!

This is typical of Anthropic, and is designed to produce headlines to keep AI in the news so that they can raise more capital.
D.O. said,

June 1, 2025 @ 10:16 am

I think, it is time to practice some epistemic hygiene with these AI reports. Any survey based research requires, at the very least, publication of the questionnaire. For neural networks research to be taken seriously, there has to be publication of the detailed description of the training data sets and exact prompts. At the very least.
Chris Buckey said,

June 1, 2025 @ 2:14 pm

None of these examples suggest AI self-awareness. As a matter of fact, they raise doubts about the self-awareness of the reporters and the people hawking LLM AIs. I suggest telling them that "gullible" has no dictionary definition and see if they reach for the nearest Merriam-Webster.
Jonathan Smith said,

June 2, 2025 @ 12:35 pm

problem with "too little, too late" is who knows how much of one vs. the other was involved — should be "too little*late". So LLMs/GenAI may not be the biggest dumbest bubble ever but fersure gotta be the big*dumb est bubble.
CT said,

June 2, 2025 @ 5:56 pm

There's never anything truly extraordinary about LLMs; as you say, they are stochastic parrots – but I prefer the term stochastic mannequins:

A parrot presumably has some sort of internal desires or reason to do the things it does (food rewards, social interaction, etc.) A mannequin does not, and more saliently, the mannequin is built specifically to mimic certain traits of a human. These LLMs are built _specifically_ to mimic human communication traits. We put the clothes of language on the frame and if we squint our eyes enough we can convince ourselves the thing is a human.
Sean said,

June 2, 2025 @ 6:07 pm

D.O.: to spell things out, almost every news story about chatbots "scheming" is funded or managed by people in a fringe movement called LessWrong or Rationalism or Longtermist Effective Altruism. You should read them like you read (perfect tense) a US military report about Soviet military buildup just before the budget bill was raised in Congress.
David L said,

June 3, 2025 @ 9:40 am

I don't understand the rationale for referring to LLMs as stochastic parrots, since they are neither stochastic nor parrots. What they do is not random, and they do not simply regurgitate phrases they've picked up from other sources.

ML's phrase 'spicy autocorrect' seems more apt. LLMs, as others have said, work by means of pattern recognition, and endeavor to respond to inquiries with patterns of their own (internal tokens, subsequently translated into natural language) that are statistically likely to be of relevance.

How good they are at doing this is a whole nother question, of course. People who class LLMs as AI are clearly overselling their abilities. But those who dismiss them as meaningless are just as clearly underestimating what they do.

RSS feed for comments on this post · TrackBack URI

Self-aware LLMs?

7 Comments »

Scott P. said,

D.O. said,

Chris Buckey said,

Jonathan Smith said,

CT said,

Sean said,

David L said,

Leave a Comment

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta