Language Log

LLMs and tree-structuring

September 18, 2025 @ 5:59 pm · Filed by Victor Mair under Announcements, Artificial intelligence, Cognitive science, Computational linguistics

« previous post | next post »

"Active Use of Latent Tree-Structured Sentence Representation in Humans and Large Language Models." Liu, Wei et al. Nature Human Behaviour (September 10, 2025).

Abstract

Understanding how sentences are represented in the human brain, as well as in large language models (LLMs), poses a substantial challenge for cognitive science. Here we develop a one-shot learning task to investigate whether humans and LLMs encode tree-structured constituents within sentences. Participants (total N = 372, native Chinese or English speakers, and bilingual in Chinese and English) and LLMs (for example, ChatGPT) were asked to infer which words should be deleted from a sentence. Both groups tend to delete constituents, instead of non-constituent word strings, following rules specific to Chinese and English, respectively. The results cannot be explained by models that rely only on word properties and word positions. Crucially, based on word strings deleted by either humans or LLMs, the underlying constituency tree structure can be successfully reconstructed. Altogether, these results demonstrate that latent tree-structured sentence representations emerge in both humans and LLMs.

it seems that LLMs can think like linguists without being specifically trained to do so.

Selected readings

"Radial dendrograms (7/26/23)
"Language trees and script trees" (12/27/21)

[h.t. Ted McClure]

September 18, 2025 @ 5:59 pm · Filed by Victor Mair under Announcements, Artificial intelligence, Cognitive science, Computational linguistics

Permalink

7 Comments »

Peter Evans-Greenwood said,

September 18, 2025 @ 6:46 pm

This finding provides fascinating empirical support for the idea that LLMs are fundamentally ‘language machines’ rather than intelligence engines. The fact that both humans and LLMs spontaneously develop identical tree-structured representations suggests these systems are accessing the hierarchical organization already present in language itself—what I’ve called ‘sedimented human experience’ embedded in linguistic structures.

This also explains why capabilities seem to ‘emerge’ at scale: systems aren’t becoming smarter, they’re gaining access to progressively higher levels of language’s inherent structural organization.

I explore this framework in more detail here: the Language Machine https://open.substack.com/pub/thepuzzleanditspieces/p/the-language-machine?r=2wvwm&utm_medium=ios
JPL said,

September 18, 2025 @ 7:53 pm

"Altogether, these results demonstrate that latent tree-structured sentence representations emerge in both humans and LLMs."

In what sense are "tree-structures" here being considered emergent properties? What is it about a sentence that can be described as "having a tree-structure"? And why does it have that structure, and how did it get that structure? (I would suggest that what emerges is equivalence classes of tree-structures.) (Also I'm wondering about the assumption that "sentences are represented in the human brain".) (This problem might resonate with the recent discussion here about agent-based models and the phenomenon of action.)
magni said,

September 18, 2025 @ 8:14 pm

The strongest evidence the paper provides is that humans and LLMs have a strong bias for constituents as coherent units. However, recognizing a single constituent is not the same as representing the entire sentence as a multi-level, recursively nested tree.

Aside from syntactic units, constituents are often also semantically coherent ("the dog of his uncle") and distributionally cohesive (these words frequently appear together in similar contexts). The task could be solved by identifying the most "chunk-like" sequence of words based on semantic or statistical cohesion, which happens to align with linguistic constituents. The paper tries to control for this with meaningless sentences (Supp. Fig. 8), but this doesn't eliminate the possibility that the core mechanism in LLMs is based on powerful distributional statistics that mimic syntax without being a true hierarchical representation.
Peter Cyrus said,

September 19, 2025 @ 2:51 am

This was actually the topic of my undergraduate thesis (in 1976): that "grammar" works (in production) by replacing hierarchical relationships with sequential ones.

But the use of trees to represent phrase structure was already old, back then.
Benjamin E. Orsatti said,

September 19, 2025 @ 6:48 am

What do "constituent" and "constituency" mean here?:

"Both groups tend to delete constituents, instead of non-constituent word strings, following rules specific to Chinese and English, respectively. […]. Crucially, based on word strings deleted by either humans or LLMs, the underlying constituency tree structure can be successfully reconstructed."
Cervantes said,

September 19, 2025 @ 6:55 am

Okay, what happened here? (Actual headline from Boston.com)

"Lowell man convicted of stealing gold chain from man fatally struck by MBTA trolley"
Jerry Packard said,

September 19, 2025 @ 7:14 am

@Benjamin Orsati

By constituent they mean the form class (I.e., part of speech) nodes NP VP etc.

RSS feed for comments on this post · TrackBack URI

LLMs and tree-structuring

7 Comments »

Peter Evans-Greenwood said,

JPL said,

magni said,

Peter Cyrus said,

Benjamin E. Orsatti said,

Cervantes said,

Jerry Packard said,

Leave a Comment

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta