Twitter-based word mapper is your new favorite toy

« previous post | next post »

At the beginning of 2016, Jack Grieve shared the first iteration of the Word Mapper app he had developed with Andrea Nini and Diansheng Guo, which let users map the relative frequencies of the 10,000 most common words in a big Twitter-based corpus covering the contiguous United States. (See: "Geolexicography," "Totally Word Mapper.") Now as the year comes to a close, Quartz is hosting a bigger, better version of the app, now including 97,246 words (all occurring at least 500 times in the corpus). It's appropriately dubbed "The great American word mapper," and it's hella fun (or wicked fun, if you prefer).

This is one of those moments, like the rollout of the Google Books Ngram Viewer or the New York Times dialect quiz, when a pleasingly designed interface allows users to interact with huge troves of linguistic data, letting people see language (and play with it) in a brand-new way. If you do a Twitter search on "Where Americans use," you can see how much fun people are having with this new data-visualization toy.

(As noted on Quartz, you can download the full dataset, with county-by-county breakdowns for each word, on this page, which also has lots of supporting documentation from Grieve et al.)



21 Comments

  1. C said,

    December 16, 2016 @ 3:30 am

    But does it adjust for the huge variation in population density across the USA? If not, and especially if it doesn't highlight that point, it's of limited usefulness.

    [Having heard Jack Grieve give a talk on this at Edinburgh recently, I think I can give a provisional answer (not technically accurate enough, but I think I can give a sense of what a full answer would say). To talk of adjusting for population density misses the point: we're not interested in the actual number of occurrences of a word flowing out of a given location's tweets, we're looking at the distribution. And far from being rendered "of limited usefulness" by variations in population density, this tool can actually measure them. Jack's work has shown (and it really is breathtaking to see it on the screen) that you can use specific sets of words to find where the wide-open spaces and hunting and farming areas can be found. Jack uses a smoothing algorithm — a completely automatic statistical procedure for sharpening up areas and their boundaries. When he shows you a comparison of smoothed maps for certain word frequencies with the map of where population density is low, and with the map of where people voted for Trump against Clinton, the three are so similar that the audience gasps. To reduce it to a single-word stereotype, where people's tweets show a higher frequency of words like "rifle" and "tractor" the population density is low and Trump won; where tweets show a higher frequency of words like "arugula" and "opera" the population density is high and they voted for Clinton. Other sets of words will pick out the boundary of the Confederacy almost exactly. This tool is not to be underestimated. If you get a chance to hear Jack Grieve do a presentation about it, cancel everything and go. —Geoff Pullum]

  2. Jack Grieve said,

    December 16, 2016 @ 4:34 am

    Hi C (if I may), it's mapping relative frequency not frequency. So the data is normalised based on the total number of words collected in each county, which is proportional to the population density of the county, as one would expect. If it were based on raw frequency every map would look basically the same–like a population density map. But try a few words and you'll see. Enjoy! Jack

  3. Jack Grieve said,

    December 16, 2016 @ 4:35 am

    And thanks Ben for the post (and Mark for posts in the past) and Geoff for the nice comment. We really appreciate all the Language Log support!

  4. C said,

    December 16, 2016 @ 5:13 am

    That's brilliant: the site and your timely explanation. Hours of fun await, and I can justify it as educational.

    I hope a UK version is on the agenda – especially as you are based in the UK.

  5. Francois Lang said,

    December 16, 2016 @ 10:31 am

    I put in the word "sex" and was puzzled by the result. Concentrations (with medium smoothing) are
    (1) Northern California/Southern Oregon/NW Nevada,
    (2) SW Iowa, and
    (3) just about all the southeast United States

    Not quite what I'd expected.

    [(myl) Here's the plot:

    What does it mean? Maybe not be what you think — we'd have to look at a sample of the underlying tweets. For example, maybe in 2014 (when the tweets were gathered) people in the southland were more concerned about the dangers of same-sex bathrooms than people in other regions? Or more likely, the key factor is the fact that 2014 was a turning point in terms of legalization of same-sex marriage:

    So it's not necessarily all about steamy southern (or Iowan) sexytimes.]

  6. Sarah said,

    December 16, 2016 @ 11:02 am

    Did this app have to become political? Those "unpopulated" areas have some of the finest people you'll ever meet. Many tweeting about "tractors" hold PhDs, and farm the land. That's all they've known, and they can't imagine leaving God's beautiful country. The cities voted for Clinton not because of people like me who appreciate arugula and the opera, but sadly because of the higher concentration of poor people who want their welfare checks from the government and who keep having babies so they can live off the money of other hard working people and not have to work. These areas that voted for Clinton also have the highest crime rates. There are other words we could use to show the same results besides arugula and opera. :)

  7. Jonathan Smith said,

    December 16, 2016 @ 11:54 am

    Awesome. Wish I could add and subtract maps Venn diagram style. Or define geographical areas and generate best-fit word(s). Apparently in Kentucky, aside from "Kentucky," we talk about "cats." Results for "verdad" are certainly true-looking…

    At the same time, it seems medium-to-high smoothing will always find areas of ("significantly"?) high concentration, even for words whose distributions one would expect to be rather even ("I", "like," or, uh, "cats"). So if I input a word I am genuinely curious about ("Kyoto"), it's hard to know if this is a western North Dakota/Minneapolis thing (i.e., verdad)… or probably not.

  8. Francois Lang said,

    December 16, 2016 @ 12:17 pm

    Thanks, MYL — great analysis, and thanks for posting the plot!

  9. Ellen K. said,

    December 16, 2016 @ 12:21 pm

    @Jonathan Smith: I suspect is more accurate to say you talk about Cats (vs cats), specifically Wildcats, in Kentucky.

  10. Jonathan Smith said,

    December 16, 2016 @ 12:31 pm

    @Ellen K. Ah, of course! If not me precisely. In general, though, I doubt these stories will always have a basis in fact…

  11. D.O. said,

    December 16, 2016 @ 12:54 pm

    There are obviously words with huge regional variation and with small one. There always are. As far as I can see these maps ignore it and map the distribution relative to standard deviation for the word (or some other dispersion measure). If I am right, it removes the sense of the scale. Maybe the Quartz people should add a number to their plots showing the extent of the variation. One number won't kill anybody.
    Otherwise, one might think that people are not using articles south of 36th parallel.

  12. Rod Johnson said,

    December 16, 2016 @ 1:19 pm

    Sarah, what's giving you the idea that Geoff is suggesting that the use "rifle" or "tractor" has anything to do with the moral standing of the users? There's a peculiar kind of touchiness out there in which even the mention of certain sensitive cultural motifs is equivalent to condemnation or "politicizing." One might almost see it as a kind of projection. If, indeed, the correlation between word use and voting patterns is real, that should be recognized, regardless of which side you're on. Sadly, you give the game away with bullshit claims about "poor people who want their welfare checks from the government and who keep having babies so they can live off the money of other hard working people and not have to work." It's not the app that is political, it's you.

  13. Jack Grieve said,

    December 16, 2016 @ 1:57 pm

    FTR I should have said that total words per county in the corpus is proportional to county population not population density.

  14. Jack Grieve said,

    December 16, 2016 @ 2:25 pm

    With the sex map, I'm not sure about what's happening in the Northwest, but in general Tweets from the Southeast are characterized by frequent usage of all sorts of words related to relationships, socializing, communication, sex, etc.

    [(myl) Interesting. Does your dataset allow you to look at the distribution of words preceding and following a given word (here "sex") in a particular region?]

  15. Jack Grieve said,

    December 17, 2016 @ 4:38 am

    Hi Mark, yes, the corpus does and it's an idea I've been kicking around–geo semantic vector spaces. It would be interesting to see when people tweet about a particular topic how they tweet about it apart from the frequency with which they tweet about it. But the data for word mapper doesn't have that info. What we have done is look at what types of words show similar patterns and what we can see is that most of the patterns are explained at least partially by broad topical and stylistic patterns. If anyone is interested here's a talk I give a few months ago: https://dl.dropboxusercontent.com/u/99161057/AMES3.pdf

  16. Cervantes said,

    December 17, 2016 @ 5:59 am

    Thanks, Jack, and everyone else for commentary.

  17. Ray said,

    December 17, 2016 @ 12:04 pm

    I’m not a twitter user, so I’m curious how this app accounts for:

    1) people whose tweets are merely copy/pastes of online-generated media headlines/catchphrases/memes/hashtags/quotes etc.

    2) twitter bots, which replicate prepared texts verbatim and in volume (and geographically?) as if they were original and individual

    3) folks in pennslylvania for example who say “supper” but who don’t use twitter vs folks in pennsylvania who say “dinner” but who do use twitter (esp. when drawing conclusions about cultural, social, political connections)

    ie, to what extent is this app measuring twitter usage phenoms vs measuring real-world usage phenoms, and does that even matter and if so how is it adjusted for?

    really cool looking app!

  18. Jack Grieve said,

    December 18, 2016 @ 9:11 am

    It's just geocoded Twitter. WYSIWYG.

    So 1 and 2 are there but my opinion is they should be. Not everyone agrees with that but regardless those types of tweets really are in the minority. Like if you grab 100 tweets at random from the corpus the vast majority look like tweets that would belong even if we had filtered in various ways, which seems like a slippery slope to me anyway.

    As for 3 these are Twitter maps. They are only intended to represent regional variation in this register. So I think issues like that are largely irrelevant at least from my perspective. I mean the same is true in my opinion of any dialect study, including the ANAE, which strictly speaking is only representative of the register telephone-based sociolinguistic interviews.

    Similarly a lot of people worry that Twitter user demographics don't match the general population. We know they don't, but that really isn't what anyone should be trying to represent when studying Twitter. It's the (mobile) Twitter demographics we should be worrying about representing IMO (and we do). And again the same could be said of any dialect survey, which for example often focus on NORMs.

    That all said I find that the maps on Twitter generally match the maps from surveys or from corpus analyses of other registers. More important, I find the most common regional patterns in any register and from any level of linguistic analysis tend to match very closely. And why shouldn't they? It's presumably the same external factors that are responsible–mountain ranges, cultural borders, etc.

  19. Justin said,

    December 19, 2016 @ 3:14 pm

    Search for falafel (with some smoothing) shows a big blue area on the border between Nebraska and Kansas. Turning the smoothing off shows this is entirely due to tweets in one county (in which there don't seem to be any urban areas).
    This leads me to two hypotheses:
    1. There's a great falafel restaurant in the middle of nowhere.
    2. Tweets that can't be geolocated more accurately than "United States" are attributed to the centre of the country.

  20. Jack Grieve said,

    December 21, 2016 @ 8:03 am

    Just the smoothing on really low frequency words–which I know the qz app doesn't really identify in anyway, sort of related to Dan's point above–isn't very meaningful. Better to just look at the raw maps there. But there is someone there in that county tweeting a lot about falafel. We only take about 2% of tweets off the API–those w full long lat info.

  21. Ray said,

    December 23, 2016 @ 8:56 am

    it would be interesting to grab a corpus of twitter from 2016 and compare it to that of 2014.

    for instance, when I type in “neoliberal” “pantsuit” “neocon” “bigly” “posttruth” “duopoly” “radicalize” I get NO DATA from the 2014 map

    when I type in “server” I get a smattering in idaho, new mexico, and the mountains of wyoming

    when I type in “hack” I get a smattering in northern minnesota, eastern montana, and the northwest corner of arkansas

    and when I type in “woke” I get a southern belt

    “crooked” shows up in central oregon, northern colorado, tip of northern minnesota, southern virginia

    while “extremist” and “jihad” concentrate on the texas panhandle

    and “BLM” stretches along the land of the rockies, from idaho to arizona

RSS feed for comments on this post