Attached is a locally autocorrelated map based on the percent of um vs uh (i.e. um/(um+uh)) in a few billion word of geocoded tweets of 2013 (about 40,000 tokens each). Red are areas where "uh" is relatively more common and blue are areas where "um" is more common. quite a clear pattern, and probably the clearest Midland (only?) lexical pattern I've ever found.
So there's significant geographical variation, as well as variation by sex, age, years of education, autism diagnosis, …
[Update — Jack recently wrote:
The map/legend is right though. It's my email explanation that is backwards. Um region in red, Uh region in blue. […]
The maps could be improved too. It's only based on about 1/3rd of the corpus there and there is also various noisy data in there that we need to strip out (e.g. blogs, retweets, Spanish). But overall the basic map is right. I'll get you some cleaned up data when I have a chance.
(One caveat among many: only a few of the functions of UM and UH are likely to be used on Twitter, I imagine.)
For some background, see:
(Jack's result may help to explain the differences that I noted between the Switchboard and Fisher datasets, since the subjects for Switchboard were recruited at TI in Dallas.)
Meanwhile, Joe Fruehwald has some evidence from the Philadelphia Neighborhood Corpus about real time vs. apparent time changes in these variables. More on all this later…
Update — Perhaps all of these patterns are just social amplification of random fluctuations in cultural signifiers. In fact, at some level that has to be true. But it's also possible that what's really happening involves variation at the level of different conversational functions of what we transcribe as UM and UH — and the meaningful variation might be that some people tend to vocalize certain conversational functions differently, or that some people tend to perform certain converational functions more often.
Some of this is obvious — thus we know that older people are more likely have take longer to find a word than younger people, and so the tendency for older people to use UH more often than younger people might be for this reason.
But some (of the many) possible stories of this type are less obvious. For example, there's a usage that we might call the "Awkward UM", exemplified by Alice's response "Um, yeah, go Golden Dragons" in the first panel of the 8/12/2014 Dumbing of Age strip:
[Alice and Billie were high-school cheerleaders together; Alice has moved on in college in a way that Billie apparently hasn't.]
So maybe Midland people are on average less likely to signal interpersonal awkwardness, or at least to signal it with phrase-initial UM. This is probably not true; but it's a plausible example of the kind of thing that might be behind the complex demographics of these simple words. [I need to get the explanation of the colors correct, though…]