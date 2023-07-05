« previous post |

Ilia Shumailov et al., "The Curse of Recursion: Training on Generated Data Makes Models Forget", 5/31/2023:

What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs.

This strikes me as related to a problem that Alan Turing and his colleagues solved in developing and applying the very first example of a "language model", namely the method used for rapid automatic evaluation of possible machine-generated decryptions of German Enigma messages in WW II.

You can find a detailed description of his method in some lecture notes I wrote a couple of decades ago, "Statistical estimation for Large Numbers of Rare Events". The general problem is a simple one: how to predict the likelihood of future event-types, based on their relative frequency in a sample of the past, where some (unknown fraction) of future event-types never occurred in your past sample. As I wrote:

It often happens that scientists, engineers and other biological organisms need to predict the relative probability of a large number of alternatives that don't individually occur very often. This is especially troublesome in cases where many of the things that happen have never happened before: where "rare events are common".

The simple "maximum likelihood" method for predicting the future from the past is to estimate the probability of an event-type that has occurred r times in N trials as r/N. This generally works well if r is fairly large (and if the world doesn't change too much). But as r gets smaller, the maximum likelihood estimate gets worse. And if r is zero, it may still be quite unwise to bet that the event-type in question will never occur in the future. Even more important, the fraction of future events whose past counts are zero may be substantial.

There are two problems here. One is that the r/N formula divides up all of the probability mass — all of our belief about the future — among the event-types that we happen to have seen. This doesn't leave anything for the unseen event-types (if there are any). How can we decide how much of our belief to reserve for the unknown? And how should we divide up this "belief tax" among the event-types that we've already seen?

Alan Turing (and colleagues) came up with a simple and general solution to this problem — which they were not able to publish in relation to its context of use, since the Enigma decryption project was still secret at the time of Turing's death. But the method is general enough that I.J. Good could publish it as an application in mathematical ecology, "The Population Frequencies of Species and the Estimation of Population Parameters", Biometrika 40(3-4) 237-264, December 1953.

Why is this relevant to "the curse of recursion"? Well, if you expand your training data by recursive application of a generative model trained on its own output, without accurate estimation and distribution of a Good-Turing "belief tax", progressive Model Collapse "where tails of the original content distribution disappear" seems inevitable. Do current LLM training methods do the taxation and redistribution adequately? Apparently not.

For more on the Good-Turing method and its history, you can watch this 2015 lecture video by Alon Orlitsky.

See also "Good is dead" (5/29/2009), and "Counting hierarchical kinds", 8/24/2011.

