LLMs and tree-structuring
« previous post | next post »
"Active Use of Latent Tree-Structured Sentence Representation in Humans and Large Language Models." Liu, Wei et al. Nature Human Behaviour (September 10, 2025).
Abstract
Understanding how sentences are represented in the human brain, as well as in large language models (LLMs), poses a substantial challenge for cognitive science. Here we develop a one-shot learning task to investigate whether humans and LLMs encode tree-structured constituents within sentences. Participants (total N = 372, native Chinese or English speakers, and bilingual in Chinese and English) and LLMs (for example, ChatGPT) were asked to infer which words should be deleted from a sentence. Both groups tend to delete constituents, instead of non-constituent word strings, following rules specific to Chinese and English, respectively. The results cannot be explained by models that rely only on word properties and word positions. Crucially, based on word strings deleted by either humans or LLMs, the underlying constituency tree structure can be successfully reconstructed. Altogether, these results demonstrate that latent tree-structured sentence representations emerge in both humans and LLMs.
it seems that LLMs can think like linguists without being specifically trained to do so.
Selected readings
- "Radial dendrograms (7/26/23)
- "Language trees and script trees" (12/27/21)
[h.t. Ted McClure]
Peter Evans-Greenwood said,
September 18, 2025 @ 6:46 pm
This finding provides fascinating empirical support for the idea that LLMs are fundamentally ‘language machines’ rather than intelligence engines. The fact that both humans and LLMs spontaneously develop identical tree-structured representations suggests these systems are accessing the hierarchical organization already present in language itself—what I’ve called ‘sedimented human experience’ embedded in linguistic structures.
This also explains why capabilities seem to ‘emerge’ at scale: systems aren’t becoming smarter, they’re gaining access to progressively higher levels of language’s inherent structural organization.
I explore this framework in more detail here: the Language Machine https://open.substack.com/pub/thepuzzleanditspieces/p/the-language-machine?r=2wvwm&utm_medium=ios
JPL said,
September 18, 2025 @ 7:53 pm
"Altogether, these results demonstrate that latent tree-structured sentence representations emerge in both humans and LLMs."
In what sense are "tree-structures" here being considered emergent properties? What is it about a sentence that can be described as "having a tree-structure"? And why does it have that structure, and how did it get that structure? (I would suggest that what emerges is equivalence classes of tree-structures.) (Also I'm wondering about the assumption that "sentences are represented in the human brain".) (This problem might resonate with the recent discussion here about agent-based models and the phenomenon of action.)
magni said,
September 18, 2025 @ 8:14 pm
The strongest evidence the paper provides is that humans and LLMs have a strong bias for constituents as coherent units. However, recognizing a single constituent is not the same as representing the entire sentence as a multi-level, recursively nested tree.
Aside from syntactic units, constituents are often also semantically coherent ("the dog of his uncle") and distributionally cohesive (these words frequently appear together in similar contexts). The task could be solved by identifying the most "chunk-like" sequence of words based on semantic or statistical cohesion, which happens to align with linguistic constituents. The paper tries to control for this with meaningless sentences (Supp. Fig. 8), but this doesn't eliminate the possibility that the core mechanism in LLMs is based on powerful distributional statistics that mimic syntax without being a true hierarchical representation.
Peter Cyrus said,
September 19, 2025 @ 2:51 am
This was actually the topic of my undergraduate thesis (in 1976): that "grammar" works (in production) by replacing hierarchical relationships with sequential ones.
But the use of trees to represent phrase structure was already old, back then.
Benjamin E. Orsatti said,
September 19, 2025 @ 6:48 am
What do "constituent" and "constituency" mean here?:
"Both groups tend to delete constituents, instead of non-constituent word strings, following rules specific to Chinese and English, respectively. […]. Crucially, based on word strings deleted by either humans or LLMs, the underlying constituency tree structure can be successfully reconstructed."
Cervantes said,
September 19, 2025 @ 6:55 am
Okay, what happened here? (Actual headline from Boston.com)
"Lowell man convicted of stealing gold chain from man fatally struck by MBTA trolley"
Jerry Packard said,
September 19, 2025 @ 7:14 am
@Benjamin Orsati
By constituent they mean the form class (I.e., part of speech) nodes NP VP etc.