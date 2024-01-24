Back to Bacon
The implicit slogan of language-model research is J.R. Firth's dictum, "You shall know a word by the company it keeps", from his 1957 paper "A synopsis of linguistic theory, 1930-1955":
His theory that "you shall know a word by the company it keeps" / "a word is characterized by the company it keeps" inspired works on word embedding hence add [sic] a major impact in natural language processing. Many techniques were designed to build dense vectors representing words semantics based on their neighbors (e.g. Word2vec, GloVe).
Firth's 1957 paragraph footnotes Wittgenstein's Philosophical Investigations, but the cited passages deal with more general questions about the nature of meaning, based on analogies to games and so on. The phrase "you shall know a word by the company it keeps" seems more strikingly reminiscent of the old legal maxim "noscitur a sociis". Thus from Broom's 1845 Legal Maxims:
That's Sir Francis Bacon, the father of empiricism…
The same idea has been taken up many times since, e.g. in Maxwell's 1875 On the Interpretation of Statutes: "When two or more words, susceptible of analogous meaning, are coupled together, noscuntur a sociis; they are understood to be used in their cognate sense. They take, as it were, their colour from each other."
AntC said,
January 24, 2024 @ 8:23 am
“They sought it with thimbles, they sought it with care; / They pursued it with forks and hope.”(Lewis Carroll, “The Hunting of the Snark”)
Cervantes said,
January 24, 2024 @ 9:43 am
It is possible to label categories of speech acts (e.g. interrogative, expressive, representative, directive) quite reliably (kappas above .8 agreement with human coders) using a "bag of words" method — just vectors of the individual words contained in units of text. Topics — subject matter — can be labeled with similar reliability, perhaps less surprisingly. But that is very far from elucidating actual meaning.
Topher Cooper said,
January 24, 2024 @ 9:53 am
Re: AntC
“I said it in Hebrew—I said it in Dutch—
I said it in German and Greek:
But I wholly forgot (and it vexes me much)
That English is what you speak!”
Also from "Hunting of the Snark" (Fit The Fourth — AntC's verse is repeated throughout).