Alan Turing's revenge?

« previous post | next post »

Ilia Shumailov et al., "The Curse of Recursion: Training on Generated Data Makes Models Forget", 5/31/2023:

What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs.

This strikes me as related to a problem that Alan Turing and his colleagues solved in developing and applying the very first example of a "language model", namely the method used for rapid automatic evaluation of possible machine-generated decryptions of German Enigma messages in WW II.

You can find a detailed description of his method in some lecture notes I wrote a couple of decades ago, "Statistical estimation for Large Numbers of Rare Events". The general problem is a simple one: how to predict the likelihood of future event-types, based on their  relative frequency in a sample of the past, where some (unknown fraction) of future event-types never occurred in your past sample. As I wrote:

It often happens that scientists, engineers and other biological organisms need to predict the relative probability of a large number of alternatives that don't individually occur very often. This is especially troublesome in cases where many of the things that happen have never happened before: where "rare events are common".

The simple "maximum likelihood" method for predicting the future from the past is to estimate the probability of an event-type that has occurred r times in N trials as r/N. This generally works well if r is fairly large (and if the world doesn't change too much). But as r gets smaller, the maximum likelihood estimate gets worse. And if r is zero, it may still be quite unwise to bet that the event-type in question will never occur in the future. Even more important, the fraction of future events whose past counts are zero may be substantial.

There are two problems here. One is that the r/N formula divides up all of the probability mass — all of our belief about the future — among the event-types that we happen to have seen. This doesn't leave anything for the unseen event-types (if there are any). How can we decide how much of our belief to reserve for the unknown? And how should we divide up this "belief tax" among the event-types that we've already seen?

Alan Turing (and colleagues) came up with a simple and general solution to this problem — which they were not able to publish in relation to its context of use, since the Enigma decryption project was still secret at the time of Turing's death. But the method is general enough that  I.J. Good could publish it as an application in mathematical ecology, "The Population Frequencies of Species and the Estimation of Population Parameters", Biometrika 40(3-4) 237-264, December 1953.

Why is this relevant to "the curse of recursion"? Well, if you expand your training data by recursive application of a generative model trained on its own output, without accurate estimation and distribution of a Good-Turing "belief tax", progressive Model Collapse "where tails of the original content distribution disappear" seems inevitable. Do current LLM training methods do the taxation and redistribution adequately? Apparently not.

For more on the Good-Turing method and its history, you can watch this 2015 lecture video by Alon Orlitsky.

See also "Good is dead" (5/29/2009), and "Counting hierarchical kinds", 8/24/2011.



14 Comments

  1. .mau. said,

    July 5, 2023 @ 3:28 pm

    It's thirty years or so that I struggled to find that reference! Back at the end of the 80s I was working on (not so large) LMs, and a colleague of mine found an article which explained this technique. We dubbed the belief tax with the fake German motto "Einmal is keinmal". I then left that field, but every now and then I searched for the article we used. Trouble is, I remembered Turing's name but not Good's… Thank you very much :-)

  2. GeorgeW said,

    July 5, 2023 @ 3:43 pm

    I have been wondering about this as well. As LLM produced texts become a larger portion of the LLM input, what happens? How do the LLMs ensure a fresh supply of language generated by humans and exclude LLM produced material? Can LLMs, like humans, coin new words? Create novel grammar?

  3. Mark Liberman said,

    July 5, 2023 @ 4:25 pm

    @GeorgeW: It's easy to create new words (compounds, derivations, blends, abbreviations, etc.) — we don't need systems as complex as today's LLMs to do that, though such capabilities could be plugged into them. Many kinds of "new grammar" are also easy, though choosing plausible innovations is presumably harder.

    What Shumailov et al. are warning about is not lack of innovation, but rather loss of diversity:

  4. AntC said,

    July 5, 2023 @ 4:26 pm

    The codebreakers at Bletchley Park made it even more difficult for themselves. I listened to a lecture from Peter Hilton:: top brass were worried that junior officers with low-level clearance might be spies; (this bizarre suspicion was because many who knew German were Jewish background, as was Hilton); so the tentative translations were chopped up in to short sections, each distributed to a different junior; the juniors had to guess whether they'd cracked the settings on the encoding machine (or merely tried a random number) from the plausibility that a short string of letters might be German.

    One day Hilton got 'RAECENLA'. Other juniors thought this was just garbage/couldn't be German. Hilton guessed 'GRAECENLAND': this was the order to invade Greece.

  5. Doctor Science said,

    July 5, 2023 @ 5:45 pm

    @AntC:
    this bizarre suspicion was because many who knew German were Jewish background, as was Hilton
    … I feel like I'm missing a step here. More likely to be spies for HITLER because they were JEWISH?!?! WTgold-platedF? Or more likely to be spies "you know, in general"? Maybe for the Soviets (Jews->Communists->USSR, would be the "logic" — which lets you know how the Cambridge 5 got away with so much, they were Our Kind, Dear)?

  6. AntC said,

    July 5, 2023 @ 6:14 pm

    @ DrScience More likely to be spies for HITLER because they were JEWISH

    Read as … because they weren't 'one of us'. A cuppla historical points:

    Mosley was all for rapprochement with Hitler: H was welcome to claim Europe with all those greasy wops, provided he didn't threaten Britain. M had many sympathisers in the ruling classes.

    The existence of the 'Jewish solution' was not well-publicised by Churchill during the war — for the same fear of a reaction against British lives being laid down for greasy wops. And indeed it wasn't until after the war that the full extent of the 'solution' became clear.

    But yeah, what you said. A cynic might observe the Allies won the war only because they were slightly less bumblingly incompetent and prejudiced than the Germans.

  7. GeorgeW said,

    July 5, 2023 @ 6:18 pm

    @Mark Liberman: So, it is not an issue that portions (potentially substantial portions) of LLM input, at some point, are from previous LLM output? In time, the LLM would be recycling it's own output and processing less and less novel input from humans.

  8. Mark Liberman said,

    July 5, 2023 @ 6:52 pm

    @GeorgeW: So, it is not an issue that portions (potentially substantial portions) of LLM input, at some point, are from previous LLM output?

    "Model Collapse" seems like a Bad Thing, to me…

  9. Haamu said,

    July 5, 2023 @ 10:38 pm

    The idea that "models start forgetting improbable events over time" seems to describe some aspects of human language evolution as well. It's worth asking why human language use doesn't appear to be subject to model collapse — or at least not catastrophic model collapse, as the human language model overall seems to be in a constant process of incremental collapse and regeneration. What exactly do LLMs lack in this regard? A vague notion of "creativity" is one (not very satisfying) suggestion; some form of embodied cognition is another. It seems like the concept of model collapse could be fruitful not just for improving how LLMs are trained but for focusing our thinking on what's really necessary for general intelligence.

  10. AntC said,

    July 6, 2023 @ 12:20 am

    lack … some form of embodied cognition

    Words and clauses denote things and situations — is why humans use language as a _tool_, not as an end in itself [sorry, Chomsky].

    So if a bunch of words fail to denote; or denote something ludicrously unlikely; or denote something so trivially obvious noone would waste their breath pointing it out [**], humans immediately start looking suspiciously. OTOH human cognition is notoriously liable to 'groupthink' and jumping to conclusions that would be the expected norm. 'RAECENLA' can't be a string in German.

    [**] Except EU bureaucrats/politicians.

  11. Yuval said,

    July 6, 2023 @ 5:46 am

    Whether training methods for LLMs take into account unseen sequences is an interesting question, but at least for generation this is part of what the "temperature" settings are meant to handle.

  12. Peter Grubtal said,

    July 6, 2023 @ 6:45 am

    AntC : if this were a place to deal with your political prejudices , but since your "historical points" are coloured by them :-
    An explanation offered not long after WW2 was that the reputation of British information policy was so tarnished after it became clear in the twenties that in WW1 they had sometimes peddled stuff that wasn't correct. It was felt then that publicity about the final solution would be dismissed as British propaganda. In Ireland, this was the case even after the facts became incontrovertible.

    And although some reports were reaching the allies, there would have been hesitation about making these, at the time, astounding allegations, public until they could be proven up to the hilt.
    Roosevelt had just as much information as Churchill but also didn't go public.

  13. AntC said,

    July 6, 2023 @ 8:56 pm

    Thanks @PeterG, yes this is LLog not WWII Alternative History Log, so I'll be brief.

    some reports were reaching the allies "some" huh? "Kristallnacht sparked international outrage." says wp.

    It was a political judgment whether to 'peddle' what was known; I wasn't there; I'm not going to criticise whether revealing it would have encouraged or discouraged the war effort. Your lack-of-proof claim sounds like post-hoc justification.

    Would revealing it have brought America into the war earlier? The blocking of the 1939 Wagner-Rogers Bill (U.S. equivalent to Kindertransport, supported by Roosevelt N.B.) by a South Carolina Senator tells us, I think.

    Plenty of civilians were worried about the treatment of Jewry enough to risk their lives in rescues — which must have been based on secure knowledge: not only Kindertransport; but also Rescue of the Danish Jews 1943 (see wp).

    In the Mathematics international/German-dominated/Jewish-intellectual-dominated community: Peter Hilton was there, see my report above; Hilbert (German, not Jewish) said (wp) the Nazi purges destroyed the German Mathematics Institute at Göttingen; Turing's war contribution didn't rescue him from prejudices against 'otherness' in 1953/54.

    And I think Dr.Science's take above of the much different attitude to the Cambridge 5 is spot-on. (Evidently Guy Francis de Moncy Burgess, an Eton boy, was more adept at concealing that specific 'otherness'.)

  14. Francis Boyle said,

    July 8, 2023 @ 2:52 am

    @Haamu

    Wouldn't the reason that human-generated language doesn't happen simply be the fact that as noted in the OP "rare events are common" and therefore humans have a need to refer to them.

RSS feed for comments on this post