Back to Bacon

« previous post | next post »

The implicit slogan of language-model research is J.R. Firth's dictum, "You shall know a word by the company it keeps", from his 1957 paper "A synopsis of linguistic theory, 1930-1955":

As the Wikipedia article explains,

His theory that "you shall know a word by the company it keeps" / "a word is characterized by the company it keeps" inspired works on word embedding hence add [sic] a major impact in natural language processing. Many techniques were designed to build dense vectors representing words semantics based on their neighbors (e.g. Word2vec, GloVe).

Firth's 1957 paragraph footnotes Wittgenstein's Philosophical Investigations, but the cited passages deal with more general questions about the nature of meaning, based on analogies to games and so on. The phrase "you shall know a word by the company it keeps" seems more strikingly reminiscent of the old legal maxim "noscitur a sociis". Thus from Broom's 1845 Legal Maxims:

That's Sir Francis Bacon, the father of empiricism…

The same idea has been taken up many times since, e.g. in Maxwell's 1875 On the Interpretation of Statutes: "When two or more words, susceptible of analogous meaning, are coupled together, noscuntur a sociis; they are understood to be used in their cognate sense. They take, as it were, their colour from each other."



  1. AntC said,

    January 24, 2024 @ 8:23 am

    “They sought it with thimbles, they sought it with care; / They pursued it with forks and hope.”(Lewis Carroll, “The Hunting of the Snark”)

  2. Cervantes said,

    January 24, 2024 @ 9:43 am

    It is possible to label categories of speech acts (e.g. interrogative, expressive, representative, directive) quite reliably (kappas above .8 agreement with human coders) using a "bag of words" method — just vectors of the individual words contained in units of text. Topics — subject matter — can be labeled with similar reliability, perhaps less surprisingly. But that is very far from elucidating actual meaning.

  3. Topher Cooper said,

    January 24, 2024 @ 9:53 am

    Re: AntC

    “I said it in Hebrew—I said it in Dutch—
    I said it in German and Greek:
    But I wholly forgot (and it vexes me much)
    That English is what you speak!”

    Also from "Hunting of the Snark" (Fit The Fourth — AntC's verse is repeated throughout).

  4. Philip Taylor said,

    January 24, 2024 @ 5:02 pm

    The last line of "I said it in Hebrew" as quoted above jars horribly for me — the scansion of the last line seems completely wrong, yet the only version I can locate online that lack this flaw is at, where it reads :

    I said it in Hebrew – I said it in Dutch – I said it in German and Greek: But I wholly forgot (and it vexes me much) That English is what you must speak!

    I shall have to see if I have an early copy of Snark in my library …
    That version, for me, reads far, far better.

  5. Philip Taylor said,

    January 24, 2024 @ 5:03 pm

    Sorry, last two lines of the above transposed for reasons I wot not.

  6. AntC said,

    January 24, 2024 @ 8:35 pm

    Thank you Snarkologists all. But my point was _contra_ myl that with a zeugma (follow the link) sometimes

    a single word is used with two other parts of a sentence but must be understood differently in relation to each.

    Now an interesting Linguistical question: a zeugma has to be understood as deliberate to work (it's a form of pun). How does a competent speaker recognise it as such? Rather than the usual "analogous meaning" myl is drawing attention to.

    (I can see, from the replies here, my humorous example didn't get so recognized.)

    (@PT that last line is as given in Martin Gardner's 'Annotated Snark'. The scansion works for me by pronouncing Germ'n as a single syllable.)

  7. Topher Cooper said,

    January 25, 2024 @ 2:33 am

    The text I copied appears in the first published editition from 1876. It appears thus in a copy in Google Books containing a stamp from The British Museum indicating that they processed the copy in July of that year (where Wikipedia lists it as having been actually published 3 months earlier). So any version earlier could only have been from a prepublication manuscript.

    On the other hand, although I am not bothered by the scansion of the version I quote, I do think that the one you quote works at least as well to my ear. THOTS, though is not very consistent in its meter, so it would seem that that was not of prime concern to Carroll.

  8. Jason M said,

    January 27, 2024 @ 12:45 pm

    What is the issue with the scansion? Seems a typical four-stress, English nursery rhyme meter where the total number of syllables or syllables intervening the stressed syllables are inconstant and it’s all about prominent vowel stress. For example:

    Humpty Dumpty had a great fall (stress on Hump Dump had fall)
    All the king’s horses and all the king’s men (stress on king horses king men).

    The nursery rhyme meter came out of old English, right (e.g., Beowulf)?

    So for Lewis Carroll:

    said Heb said Dutch
    said Ger man Greek
    whol got vex much
    Eng what you speak

    The last line can also just be a 3-stress closer, where “you” is unstressed, as I think the closing line can be one stress shorter than the rest of the verse. I’m no expert but did find a paper when I was trying to refresh my memory on this:

  9. Rodger Cunningham said,

    January 27, 2024 @ 1:21 pm

    Surely the stress pattern is 4, 3, 4, 3?

  10. Jason M said,

    January 27, 2024 @ 2:06 pm

    @Rodger Cunningham 4 3 4 3 works most naturally for sure though, of course, forcing slightly unnatural emphases can be a poetic conceit, too. Anyway, the point is the meter is not iambs or trochees or anapests or dactyls, just good rhymed old English stress meter, like a nursery rhyme.

  11. Michael Watts said,

    January 28, 2024 @ 6:14 am

    Checking over the beginning of the poem, I don't get the sense that it's based on stress beats with more or less free variation in how many weak syllables might appear between strong ones. I get the sense that it's supposed to be written in regular anapests, but the construction is clumsy.

    The final line "what I tell you three times is true" seems especially off to me, with its four strong syllables in the pattern ..-.–.- .

  12. Michael Watts said,

    January 28, 2024 @ 6:26 am

    On the other hand, I should note that the final line that bothered me so much:

    what I tell you three times is true

    feels to me like it scans identically to the line that bothered Philip Taylor (and me):

    that English is what you speak

    (where "you" is ambiguously strong or weak)

    And for these lines, that feeling occurs because I perceive an early beat (tell / Eng) that is followed at a longer remove than usual by a run of (almost) uninterrupted beats, with "what you speak", all strong, matching "three times (is) true", with "is" weak and the other three strong. This analysis definitely relies on free variation in the number of weak syllables between strong ones, with weak "what I" matching weak "that", weak "you" matching weak "lish is", and weak "is" matching nothing.

    Even so, the lines that don't jump out as not belonging in the poem are in a very regular anapestic layout with various glitches scattered through them. The only variation in those anapests over the first five stanzas is that the first foot of any line may omit one or both weak syllables, and the last foot of a four-foot line may omit one, but not both, weak syllables… and that the penultimate foot of the first line omits one weak syllable, but this does not reoccur and my ear flags the line as incorrectly constructed.

  13. Michael Watts said,

    January 28, 2024 @ 6:33 am

    (For completeness, what I mean by "glitches": two lines I already mentioned don't seem to scan correctly to me. During scansion, I also marked a number of syllables in other lines where my sense of natural weight of the syllable conflicted with the rhythm required by the poem. This is clumsy. But in all of those cases, the number of syllables is compatible with the analysis I gave above, where feet except for (often) the first foot, and (much less often) the last foot of a long line, are required to be formal anapests. It's just that the rhythm is off.)

  14. Jason M said,

    January 28, 2024 @ 6:42 pm

    Fun following Carroll’s footsteps, as it were, and wondering how he wanted us to stress the “you” of which he spoke about speaking. If I had some time, (and pigs had wings) I’d go back through other Carroll verse as some context or control group for where these current lines in question fit.

    Did he usually tend to anapesting, or was he more of a Humpty-Dumpty-er?

  15. JPL said,

    January 29, 2024 @ 1:59 am

    Well, thanks for putting the Firth text in front of us; it reminds us that we shouldn't just forget the old texts, but read them from our present understanding. (I had never read any writings by Firth himself.) One of Wittgenstein's overarching concerns in Philosophical Investigations is the idea that our (and in particular Philosophy's) customary ways of talking about language get in the way of our gaining an understanding of language as a human phenomenon, and that an analysis and clarification and critique of this referential language is necessary in order to make any progress. Reading Firth's account of his approach to the description of language, I was struck by his constant concern that the customary ways linguists talk about language are preventing or holding back progress in understanding the significance of language and how it works in the life of human societies. In fact, one could say that Firth seems to have been inspired by Wittgenstein in the same way as we could say that Chomsky seems to have been inspired by Carnap, and that this accounts for their contrasting views about what an adequate understanding of language requires. Chomsky takes a model theoretic approach to the relation between formal language and metalanguage (Carnap's "syntax language"), while Firth takes the speech situation, the act of language use in context, as fundamental, and aims to describe and analyze it directly, referring to the same objects as the "empirical" linguists of the day, but using different terms from a different point of view, and eschewing the "modelling" approach. From my present point of view, I was struck by the mixture of outdated understanding and neglected or forgotten valuable insights, but I found some of the inverted viewpoints useful. E.g., while the engineers may have latched on to his notion of 'collocation', a pure inquirer might find his terms 'colligation' and 'exponence' more interesting (p. 13-17). 'Colligation' seems to describe a categorical syntactic structure that could be called a "schema", to use Weyl's term, which contributes semantic distinctions to the sentence over and above the contributions of the lexicon and other elements of the repertoire, and whose order relation is not serial or sequential, but that of logical inclusion or hierarchy. The schema, minus the semantics, is relatable to what Chomsky was modelling with his phrase structure rules. The term 'exponence/exponent' is relatable to the more familiar linguistic notions of 'marking' or 'signification', but allows an interesting foregrounding of the functional relation of effective expression. This little summary is inadequate, I'm really tired and I just started typing in the little box to ward off sleepiness, not intending to post anything, but I'm going to continue to try to "renew my connection in experience" with this text, so thanks for that.

    BTW, wrt the Baconian notion of collocation, recall the "cloze test" that was (and still is?) used in language proficiency testing.

RSS feed for comments on this post