Language Log

"Like learning physics by watching Einstein do yoga"

July 27, 2025 @ 6:54 am · Filed by Mark Liberman under Artificial intelligence

The most interesting LLM research that I've seen recently is from Alex Cloud and others at Anthropic and Truthful AI, "Subliminal Learning: Language models transmit behavioral traits via hidden signals in data", 7/20/2025:

ABSTRACT: We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a "student" model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T. We observe the same effect when training on code or reasoning traces generated by the same teacher model. However, we do not observe the effect when the teacher and student have different base models. To help explain our findings, we prove a theoretical result showing that subliminal learning occurs in all neural networks under certain conditions, and demonstrate subliminal learning in a simple MLP classifier. We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development. Distillation could propagate unintended traits, even when developers try to prevent this via data filtering.

Owain Evans lays the results out clearly in a thread on X (pdf version here). You should read either the paper or the thread, but here are a couple of the highlights:

I'll note that the term "subliminal" is somewhat misleading in this case, since in (human) psychology it refers to normally-interpretable stimuli that are below the threshold for conscious perception.

I'll also note that the authors use "being misaligned" to mean "being evil", as per our earlier posts on "alignment".

This also reminds me of the fashion a decade ago for re-training with noise to cause image-classification systems to do silly things:

July 27, 2025 @ 6:54 am · Filed by Mark Liberman under Artificial intelligence

Permalink

2 Comments

cM said,

July 28, 2025 @ 3:20 pm

An annoying case of "the internet forgets things all the time": That short URL in the last picture will stop resolving in four weeks. I thought I'd do something useful and post the actual URL.

Well, turns out it is dead already: https://plus.sandbox.google.com/u/0/photos/105708995245287396734/albums/5979639538888388769/5979639538631455794?pid=5979639538631455794&oid=105708995245287396734
Pamela said,

August 3, 2025 @ 9:24 am

Nice example of how the sinister dynamics of massive data accumulation are continued and amplified in AI. Of course filtering for T does not change the effects of T on the basic priorities and relations of data. It merely disguises the source of it. This appears to be part of the explanation for AI hallucination and failure to be able to check the factuality of its own assertsions. What AI has achieved is the perfect emulation of prejudice, tunnel vision and most of the other hobgoblins of perception and consciousness. If human consciousness progresses, it will have to teach that progress to AI: AI is not going to lead the way to better perception of analysis of anything beyond the most specific, bounded, and, one hopes, harmless tasks.

RSS feed for comments on this post

"Like learning physics by watching Einstein do yoga"

2 Comments

cM said,

Pamela said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta