Latent trees

« previous post | next post »

There's been some buzz recently about how syntactic structures are implicit in Large Language Models — most recently, the Liu et al. paper noted yesterday by Victor, and an accepted ms by Futrell and Mahowald at Behavioral and Brain Sciences, "How Linguistics Learned to Stop Worrying and Love the Language Models". Futrell and Mahowald recognize something that Liu et al. mostly ignore, namely that constituent structure is obviously implicit in statistical patterns of sequential data, at least if the sequences were generated by a constituency-sensitive process — and that algorithms taking advantage of that fact have been Out There for 70 years or more.

From the Futrell and Mahowald paper:

In fact, there is a long tradition of statistical approaches to language, and these traditions were deeply involved in the early development of language models.

Perhaps the most well-known such contribution is distributional semantics: the idea that the semantics of a word is related to the distribution over contexts in which that word appears, often cited to Firth (1957, p. 11), and developed more systematically by Harris (1954). This idea originates from the school of structuralist linguistics (Saussure, 1916; Bloomfield, 1926). Structuralist linguistics had as one theoretical aim the development of discovery procedures, which were formal procedures that could be applied to bodies of text in order to discover (and even define) linguistic structures (Harris, 1951). The most well-developed of these discovery procedures were statistical in nature. For example, Harris (1955) developed a theory of words and morphemes based on statistical co-occurrence patterns, including a procedure for discovering morpheme boundaries by effectively calculating transitional probabilities, an idea taken up again much later in the psycholinguistics of language learning (Saffran et al., 1996), and closely related to tokenization methods such as byte-pair encoding (Shibata et al., 1999; Tanaka-Ishii, 2021, Ch. 11). At issue was the relationship between grammatical structure and the observable statistical structure of a corpus of language—one of the core questions that linguists working on language models are interested in again today.

For a trivial version of this simple idea, see my 2003 post "Parsers that count", which shows that (when there's enough training data) constituent relations can be inferred from simple bigram counts:

[NN]N
sickle cell anemia
10561 2422
N[NN]
rat bile duct
203 22366
[NA]N
information theoretic criterion
112        5
N[AN]
monkey temporal lobe
16     10154
[AN]N
giant cell tumour
7272 1345
A[NN]
cellular drug transport
262  746
[AA]N
  small intestinal activity
8723       120
A[AN]
inadequate topical cooling
4     195

The numbers are just counts of how often each adjacent pair of words occurs in (our 2003 local version of) the Medline corpus (which has about a billion words of text overall). Thus the sequence "sickle cell" occurs 10,561 times, while the sequence "cell anemia" occurs 2,422 times.

In realistic applications, the numbers are often much smaller, and also it obviously makes sense to look at larger contexts (as Harris 1955 proposed to do). You can see both of those algorithmic additions in Richard Sproat, Chilin Shih, William Gale, and Nancy Chang, "A stochastic finite-state word-segmentation algorithm for Chinese", 1994. Or in a 2015 replication that started as an algorithm for finding phrases in music, and then was checked just for fun on text: Sascha Griffiths, Matthew Purver, and Geraint Wiggins, "From phoneme to morpheme: A computational model", 2015.

There's some interesting and relevant work by Gašper Beguš, oddly ignored by both Liu et al. and Futrell et al. — e.g. this and this.

And there's more to say about the various stages of innovation in "language models", from Alan Turing in the Enigma project to today — the fact that there's a history doesn't mean that there's no progress — but that's all I have time for this morning.

 



2 Comments »

  1. Jerry Packard said,

    September 19, 2025 @ 7:22 am

    If it is mere constituency per se then I agree that the stochastic nature of the parser does the job. Where it falls short, I think, is for higher order elements such as anaphora, agreement and pragmatic/thematic relations.

  2. Mark Liberman said,

    September 19, 2025 @ 8:48 am

    @Jerry Packard: "Where it falls short, I think, is for higher order elements such as anaphora, agreement and pragmatic/thematic relations"

    There have been many statistical approaches to those issues, which also suggest reasons that aspects of the (extension of) the concepts involved should be implicit in the enormous network of matrix multiplications that make up a "large language model".

    But the current discussion is about what Liu et al. call "Active use of latent tree-structured sentence representation in humans and large language models"…

RSS feed for comments on this post · TrackBack URI

Leave a Comment