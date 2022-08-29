« previous post |

The mantra of machine learning, as Fred Jelinek used to say, is "The best data is more data" — because in many areas, there's a Long Tail of relevant cases that are hard to analyze without either a valid theory or enough examples.

But a recent meta-analysis of machine-learning work in digital medicine shows, convincingly, that more data can lead to poorer reported performance. The paper is Visar Berisha et al., "Digital medicine and the curse of dimensionality", NPJ digital medicine 2021, and one of the pieces of evidence they present is shown in the figure reproduced below:



This analysis considers two types of models: (1) speech-based models for classifying between a control group and patients with a diagnosis of Alzheimer’s disease (Con vs. AD; blue plot) and (2) speech-based models for classifying between a control group and patients with other forms of cognitive impairment (Con vs. CI; red plot).

This effect is basically a form of "Graduate Student Descent", a semi-pun on the optimization method of "gradient descent". The description in this 2020 article offers a flow chart for this method:

The "random_changes" are not, of course, always random. Changes in the machine-learning architecture, or additional "features" derived from the available data, may well be based on plausible hypotheses. But such explorations necessarily lead to some amount of over-fitting. And smaller datasets lead to more over-fitting, and therefore to better estimated performance and worse generalization to new observations.

A maximally simple simulation illustrates this effect. We create 50 random features for each of a set of N "subjects", and look at the performance of plain old linear discriminant analysis, as N varies from 50 to 800:



(R code is here.)

More realistic simulations would involve leave-one-out cross validation (or some other train/test division), various more sophisticated ML methods, etc. But the same basic pattern emerges — it's easier to fool yourself (and others) when your system is classifying a smaller number of observations.

Adding more random "features" for a given number of subjects generally yields a similar effect.

How can we avoid this? In the foundational work on Human Language Technology, roughly from 1985 to 2010, the practice was to keep "Evaluation Test" data secret from researchers until final testing time; use it only once (at least ideally); and change the dataset and the task details regularly. And even more important, the dataset sizes were large enough that the simplest kinds of over-fitting were avoided.

A key problem with clinical applications is that the dataset sizes are typically tiny, due to the culture of the field, which uses genuine concerns about privacy and confidentiality to protect more selfish motives in protecting each group's proprietary data. This makes it very hard to avoid the kind of over-fitting illustrated above. And another consequence is that each dataset tends to have its own set of locally-specific uncontrolled covariates.

So in conclusion, it's still true that more data is better — and the fact that getting data from more subjects can lead to lower estimated performance is actually evidence for the value of getting data from more subjects.

