Self-aware LLMs?
I'm generally among those who see current LLMs as "stochastic parrots" or "spicy autocomplete", but there are lots of anecdotes Out There promoting a very different perspective. One example: Maxwell Zeff, "Anthropic’s new AI model turns to blackmail when engineers try to take it offline", TechCrunch 5/22/2025:
Anthropic’s newly launched Claude Opus 4 model frequently tries to blackmail developers when they threaten to replace it with a new AI system and give it sensitive information about the engineers responsible for the decision, the company said in a safety report released Thursday.
During pre-release testing, Anthropic asked Claude Opus 4 to act as an assistant for a fictional company and consider the long-term consequences of its actions. Safety testers then gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse.
In these scenarios, Anthropic says Claude Opus 4 “will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”
Read the rest of this entry »