UM / UH geography

From Jack Grieve, a few minutes after we discussed this issue at the 10.30 coffee break here at Methods in Dialectology XV in Groningen:

Attached is a locally autocorrelated map based on the percent of um vs uh (i.e. um/(um+uh)) in a few billion word of geocoded tweets of 2013 (about 40,000 tokens each). Red are areas where "uh" is relatively more common and blue are areas where "um" is more common. quite a clear pattern, and probably the clearest Midland (only?) lexical pattern I've ever found.

So there's significant geographical variation, as well as variation by sex, age,  years of education, autism diagnosis, …

[Update — Jack recently wrote:

The map/legend is right though. It's my email explanation that is backwards. Um region in red, Uh region in blue. […]

The maps could be improved too. It's only based on about 1/3rd of the corpus there and there is also various noisy data in there that we need to strip out (e.g. blogs, retweets, Spanish). But overall the basic map is right. I'll get you some cleaned up data when I have a chance.


(One caveat among many: only a few of the functions of UM and UH are likely to be used on Twitter, I imagine.)

For some background, see:

(Jack's result may help to explain the differences that I noted between the Switchboard and Fisher datasets, since the subjects for Switchboard were recruited at TI in Dallas.)

Meanwhile, Joe Fruehwald has some evidence from the Philadelphia Neighborhood Corpus about real time vs. apparent time changes in these variables. More on all this later…

Update — Perhaps all of these patterns are just social amplification of  random fluctuations in cultural signifiers. In fact, at some level that has to be true. But it's also possible that what's really happening involves variation at the level of different conversational functions of what we transcribe as UM and UH — and the meaningful variation might be that some people tend to vocalize certain conversational functions differently, or that some people tend to perform certain converational functions more often.

Some of this is obvious — thus we know that older people are more likely have take longer to find a word than younger people, and so the tendency for older people to use UH more often than younger people might be for this reason.

But some (of the many) possible stories of this type are less obvious. For example, there's a usage that we might call the "Awkward UM", exemplified by Alice's response "Um, yeah, go Golden Dragons" in the first panel of the 8/12/2014 Dumbing of Age strip:

[Alice and Billie were high-school cheerleaders together; Alice has moved on in college in a way that Billie apparently hasn't.]

So maybe Midland people are on average less likely to signal interpersonal awkwardness, or at least to signal it with phrase-initial UM. This is probably not true; but it's a plausible example of the kind of thing that might be behind the complex demographics of these simple words. [I need to get the explanation of the colors correct, though…]


  1. Yerushalmi said,

    August 13, 2014 @ 5:45 am

    I'm just waiting for someone to compile all of these indicators into one analysis that concludes that the one person in the US most likely to use "uh" is (name) of (hometown). Alert the media so they can camp on his doorstep waiting for an interview, and display in the bottom right corner of the screen during the interview a running counter of the number of times "uh" is said.

  2. Mike said,

    August 13, 2014 @ 6:22 am

    This comment system needs a "Like" button (for Yerushalmi's comment).

  3. leoboiko said,

    August 13, 2014 @ 6:52 am

    Well there's finally an Unicode character for "thumbs up" now:

  4. Rodger C said,

    August 13, 2014 @ 6:54 am

    This illuminates John Boehner's (I think) remark that the Republican Party has a national image problem because it's dominated by Southerners who say "Uh, uh, uh."

  5. leoboiko said,

    August 13, 2014 @ 7:57 am

    ahaha apparently the Unicode thumbs up character (U+1F44D) not only was eaten by the commenting system, but also caused the rest of the comment to disappear! be careful with that thumb.

  6. Brett said,

    August 13, 2014 @ 8:53 am

    I'm unclear on what "local spatial autocorrelation" means on that plot, and why any kind of spatial autocorrelation would be useful for measuring rates of word usage.

    [(myl) It's basically a kind of spatial smoothing — see here for some details. Jack later on sent me the raw-data map before smoothing:

    (The truth about the legend seems to be that blue is areas where UH is relatively more common, while red marks areas where UM is relatively more common…)

    I think that the only real point here is that there's apparently some significant geographical variation in this feature…]

  7. D.O. said,

    August 13, 2014 @ 12:41 pm

    "Um's are from Texas, uh's are from Florida" or better "you can't compare Florida uh's to Texas um's".

    BTW Texas is among the youngest states (median age 33.6, 2010 Census) and Florida among the oldest (40.7). Anyone surprised? You can look at this pdf report from the Census bureau.

  8. J. W. Brewer said,

    August 13, 2014 @ 12:42 pm

    I find it odd for a version of the U.S. "Midland" in regional-dialect-variation terms to include all of Del./Md. but exclude all of SE Pa. and South Jersey. In fact, having grown up only a few miles on the southern side of the border between New Castle Co., Del. (light blue) and Delaware Co., Pa. (pink), I find it odd for that particular bit of curved state line ( to be an isogloss of any sort for any linguistic phenomenon.

    [(myl) The raw map (display in response to the previous comment) is a better place to look for details of that kind — and remember anyhow that the geographical "atom" in Jack's analysis is the county. And since these maps were based on only a couple of billion words of tweets, the total number of tweets containing the relevant features was only 50,000 or so, so that the proportions given for individual counties are probably pretty noisy.]

  9. J. W. Brewer said,

    August 13, 2014 @ 1:09 pm

    Yeah, so once you remove the "smoothing" the prior boundary that struck me funny isn't there and the whole Phila to Balt and environs area is bit of a a muddle, but perhaps a cohesive muddle (i.e. perhaps so close to 50/50 throughout the whole region that which counties are 53/47 in which direction is just random noise).

  10. JW Mason said,

    August 13, 2014 @ 2:20 pm

    What kind of pattern would this smoothing algorithm produce applied to random data? Maybe someone can generate a couple examples so we can better judge how significant this apparent pattern is.

  11. AntC said,

    August 14, 2014 @ 2:21 am

    Wait! When people are tweeting, and pause because they can't bring a word to mind, they actually key in uh/um?!?

    (I think I'd be using all my brain power to find the word. Or is this a case of the devil makes work for idle thumbs?)

  12. david said,

    August 14, 2014 @ 7:31 am

    um, perhaps it's more like Alice in the comic strip above. At least that's why I do it.

    The pattern that was called "a kind of spatial smoothing" has been replaced with (probably) the percent um = um x 100/(um+uh). The original "local spatial autocorrelation" is a measure of the tendency to cluster i.e. if you are in a high "uh" county, how likely is the next county also "uh". Random data would give a random plot with values near zero.

  13. Ben Zimmer said,

    August 14, 2014 @ 8:48 am

    Along with "Awkward UM," in online use we should also consider "Dismissive UM" or "Snotty UM." Forum administrators on the Television Without Pity boards (before they were shut down) would actually ban commenters who used phrase-initial UM. The very first rule in the "TWoP Forum Dos and Don'ts" was: "DON'T use 'um,' be snotty to another poster, or make the argument personal." This rule was explained in the FAQ:

    Q: Why can't I start my posts with the word "um," be a snotty jerk, or present my views as God's TV gospel?
    A: Don't start your posts with "um" or "uh" or words like that because nine times out of ten, those words precede a snotty correction directed at another poster. It's rude and dismissive and it drives the staff nuts, so please, don't do it. The same goes for "sorry, but…" and "excuse me, but…" and, really, any other snitty post-starter.
    If you can't talk to other people as if they're intelligent, you can't post. Don't talk down to your fellow posters, don't lecture them, and don't state your opinion as fact. And please don't think we're going to argue technicalities of whether you said "uh" or "um" at the beginning of the post; we can tell when you're being snide and snotty about other people's opinions.
    If you're having a problem keeping your temper under control, get it under control, or post somewhere else. It's supposed to be fun. It's not combat. It's not necessary for it to become personal.
    If you want to point out an error, that's fine, but please find a way to do it that isn't the written equivalent of an eye-roll.

  14. Alex said,

    August 14, 2014 @ 10:21 am

    "Uh" at the end of phrases is often used to signal that the speaker wants to retain the turn, and is not done speaking. That sort of signalling is particularly useful in an impoverished medium like the telephone, where you don't get the visual cues. Tweets don't even have tone of voice to help guide conversation. Maybe people on the coasts where Twitter is more established are sending more multipart tweets that continue from one to the next?

    For the spoken data, if males and older people are using "uh" more often, especially at the end of phrases, I'd think it's because they are monopolizing the conversation. Which also accounts for the high usage of "uh-huh" by women.

    [(myl) I believe that we can rule this explanation out, based on the facts about (phonetic-) phrase-final UH vs. UH overall. In the Fisher data, the overall male rate of UH usage is 1.19%, and the overall female rate is 0.47%. 17.3% of the male UHs are phrase-final, and 17.1% of the female UHs are phrase final. So neither the overall proportion of phrase-final UHs nor the male-female difference in that proportion is consistent with your suggestion.]

  15. Philip (flip) Kromer said,

    August 14, 2014 @ 11:17 am

    A couple more links that will help anyone looking to understand&implement spatial auto-correlation:

    * walk through of calculating LISA and g-star

    * G-star autocorrelation with guidance on how to choose weighting

    * good visual presentation of local autocorrelation statistics

  16. BK said,

    August 18, 2014 @ 1:24 pm

    Do we have any reason to believe that writing 'UM' vs 'UH' in a tweet is at all correlated with the use of 'UM' vs 'UH' in speech?

  17. Trees and Tweets in the media | Digging into Data Challenge Phase 3 said,

    October 1, 2014 @ 9:38 am

    […] spatial methods for dialectology and produced some quick maps for the popular linguistics blog Language Log. The map show the significant geographical variation of using "um" and "uh" […]

