On "culturomics" and "ngrams"

« previous post | next post »

I'm still mulling over the blockbuster "culturomics" paper published in Science last week and ably addressed here by Geoff Nunberg and Mark Liberman. I'll have more to say about aspects of the paper having to do with the size of the English lexicon, but in the meantime let me direct you to my latest Word Routes column on the Visual Thesaurus, which takes up the more superficial question of nomenclature: both culturomics and ngram (as in the Ngram Viewer) are less than transparent to non-specialists (and even trouble some specialists). An excerpt follows below.

The authors of the Science paper, "Quantitative Analysis of Culture Using Millions of Digitized Books " (free registration required), define culturomics as "the application of high-throughput data collection and analysis to the study of human culture." The culture part of culturomics is straightforward enough, but what about the -omics? Many observers in the wake of last week's publicity barrage have been stymied by that. The esteemed language expert David Crystal, for instance, initially surmised on his blog that culturomics is "presumably based on ergonomics, economics, and suchlike." Dan Clayton, a British language researcher (and friend of the VT) similarly speculated that the new word is "a blend of culture and economics, with a bit of linguistics thrown in."

Full disclosure: I was lucky enough to get a preview of the Science paper a couple of months ago from a presentation by the lead researchers, the young Harvard scholars Jean-Baptiste Michel and Erez Lieberman-Aiden, so by the time the paper was published last week I had advance warning about culturomics. And I already knew that it was intended to be pronounced with a long "o" (cultur-OH-mics), a clue that it has nothing to do with economics or ergonomics. Rather, the model is genomics: the study of organisms in terms of their full DNA sequences, or genomes.

Further disclosure: among my other comments on their paper presentation, I told Jean-Baptiste and Erez that I didn't think culturomics was the most felicitous choice for the new field of study they envisioned. The connection to genomics might be apparent to those in the biosciences who have already seen the proliferation of other words ending in -omics, such as proteomics, the study of the proteome (the full set of proteins encoded by a genome). This Wikipedia page lists a raft of other -omics topics, such as connectomics, interferomics, and transcriptomics. But despite the large number of -omics coinages in biology and allied sciences, a lay audience would not immediately pick up on the meaning of the suffix, especially if they only see the word in print rather than hearing the tell-tale long "o" sound.

Read the rest here.


  1. Natalie Binder said,

    December 23, 2010 @ 2:05 am

    I have been thinking about the new term coined in the paper as well. I've seen it rendered as "culturenomics," "cultronomics" and now "culturomics." I know from the incoming searches on my blog (where I recently criticized the Ngrams Viewer) that searchers are using both "culturenomics" and "cultronomics" in association with this project. The subtle differences in these terms suggest shades of meaning. "Culturenomics," is easier to say, and also benefits from an associaton with economics. This relationship is also suggested by another term I've heard of to describe this new discipline: "freakumanities." Despite the scientists' wishes, it seems like it's difficult to avoid the association with the social sciences.

    For my taste, I'd be just as happy if we called it "digital anthropology" or "digital humanities." and had done. The coining of a new term feels premature and a bit gimmicky, and the association with genomics is not very clear, considering how many people hear "-omics" and think "economics." I like the idea that we might someday be able to sequence the linguistic genome using something like ngrams, but another book-cover neologism isn't going to get us there.

  2. John Cowan said,

    December 23, 2010 @ 3:06 am

    Culturomics by any spelling would smell as yucky. They should have left it as n-gram search (with hyphen, please). Not every scientific advance is a new Kuhnian paradigm.

  3. Neil said,

    December 23, 2010 @ 4:12 am

    A great example of CP Scott's Two Cultures at work here – I can't think of many scientists who wouldn't have seen the word 'culturomics' without rolling their eyes and thinking 'Not another "-omics"!'

  4. Ian Tindale said,

    December 23, 2010 @ 4:54 am

    I often think about the effect of an “enforced” English literacy as an effect of the Internet spreading mainly English first, then other languages secondarily, to every specialist area of human endeavour throughout the accessible world.

    Under pre-network conditions, the way people learn about specialist topics would be either by book learning, or by classroom learning, and there would have been an opportunity for a distinct difference in pronunciation between the two categories, if someone had perhaps grown up to be “the only expert in the village”, with nobody else to talk to at that same level of specialisation, and had formed their own opinion of how things are pronounced (this is of course, assuming that their topic is not a linguistic or language-based topic, but perhaps a technical specialisation).

    The first time one encounters another in the same specialisation can be quite a shock, to learn that they pronounce a certain word entirely differently to the way one had been pronouncing it all these years.

    I think with the effect of the Internet spreading quite esoteric and in some cases, much arcane knowledge around, yet often quite thinly, it will result in an increase in the above-described effect, especially in areas or topics where the participants are perhaps not socially inclined anyway and tend not to gather in groups or go to classes, functions, meetings or parties.

    I think in the future, as a result of this effect, the accepted pronunciation of words will trend away from having a fixed and normalised format per word, and toward a permissible “say it the way I assume it should be said” norm, as the global long tail of people who do precisely that gather critical mass.

  5. Richard Sabey said,

    December 23, 2010 @ 5:33 am

    "n-gram" was also an infelicitous choice. "digram" already has a meaning: a sequence of two letters. Likewise "trigram", etc.. Thus "n-gram" suggests a sequence of n letters, not words.

    [(myl) That's what it suggests to you, perhaps, but not to anyone familiar with the long-standing terminological practices of this field, where n-gram is taken to mean a sequence of n symbols, not letters. In constructing a statistical language model, it's appropriate to consider the symbols to be words. This has been the standard way of talking about statistical language models for many decades, and is not in any way a Google Labs innovation.

    (Also, the usual term in this field is "bigram", not "digram".) ]

  6. GeorgeW said,

    December 23, 2010 @ 6:29 am

    In my opinion, 'culturomics' was an awful choice. Neither the intended pronunciation nor the meaning are transparent (and are potentially misleading).

    I would have opted for something like 'googlegram.'

  7. Mark Liberman said,

    December 23, 2010 @ 6:38 am

    For some earlier -omic commentary, see "-ome is where the art is", 10/27/2004. Terms of the form "X-ome" for "big list of everything in X", and "X-omics" for "systematic study of the big list of everything in (some area suggested by) X", have been common in the life sciences for years. And I contributed my mite to the process, in the form of an NSF-sponsored research project under the title "Mining the bibliome", which ran from 2002 to 2007.

  8. GeorgeW said,

    December 23, 2010 @ 8:20 am

    @myl: Perhaps 'culturomics' would work under this pattern if it were the "the big list of everything" in culture. For the name to suggest that it represents a compilation of everything in culture is quite a reach.

  9. mgh said,

    December 23, 2010 @ 8:38 am

    I'm in a field where omics has became a standard suffix (or even stand-alone word) and so the choice of culturomics was really unremarkable to me.

    But, now that several people have fussed over it, I have to admit it is more troublesome than I thought at first. "-ome" is the localized collection of many or all members of a species, like genome and proteome, so at first I thought all this discussion would have been avoided if the authors had just referred to their database as the cultureome (I agree the "e" at the end of culture should be left on).

    But it's not really a cultureome — the collection of many or all cultures. What would it be — a wordome? a linguome? The problem of naming what it's a collection of is evident in the choice of N-gram, another bit of jargon, suggesting that the correct (but horrible) choice would be Ngramome.

  10. mgh said,

    December 23, 2010 @ 8:40 am

    to be clear — (1) if you're defining a new "omics" you need to have first define the "ome" and (2) the problem here isn't just unfamiliarity of the -omics suffix, it's the collective nature of the word "culture" as opposed to "gene" or "protein".

    [(myl) But biologists are already guilty (or perhaps proud?) of similar ventures in -omical areas where "the localized collection of many or all members of a species" is a murky concept at best. Let's start with the "phenome", which was defined back in 1997 as the "physical totality of all traits of an organism or of one of its subsystems", and in 2003 as “the body of information describing an organism's phenotypes, under the influences of genetic and environmental factors”. I'd argue that the concept of "phenotype" is roughly as open-textured as the concept of "culture" is, and that attempts to define "phenome" by analogy to "genome" run into roughly the same problems that attempts to define "culturome" do. At least the Culturomicists have set up a well-defined proxy list, even if there are good reasons to be skeptical of its value and reach.]

  11. Rod Johnson said,

    December 23, 2010 @ 9:43 am

    Ha, I read "phenome" as related to "phenomena," which of course it is, though less directly, and liked it as "the big list of observables." I guess I was looking for "phenomenome."

  12. mgh said,

    December 23, 2010 @ 9:49 am

    myl, thanks for your reply but I'm not sure where you're coming from — you may be thinking of the classical usage of "phenotype" as the set of observable traits of an individual, but recent use tends to refer to a single trait, eg a fruit fly might show an eye color phenotype, a wing development phenotype, and a courtship phenotype, and that constellation of phenotypes would be what those authors tried to call its phenome. I'm not sure how well that one's caught on.

    [(myl) In the world of fruit-fly genetics, there is a standard list of well-individuated phenotypic traits, which (as I understand it) usually correspond nearly one-to-one with known genomic variations. In the real world (certainly the real world of humans, and I suspect the real world of flies as well), there is no standard list of well-individuated phenotypic traits, and even simple physical traits appear to be connected to many genes, most of them now unknown. What is the "phenome" (list of phenotypic traits) for ear shape? Once you get past the Mendelian trait for earlobe attachment, I don't think that there's much to say, even though the details of pinna shape are presumably under genetic control. What about cortical convolutions? The answers to these questions are no more well defined than the answers to questions like "What is the culturome (list of cultural traits) for marriage?"

    I'm not trying to suggest that these questions have no possible answers, or even that there are not some plausible candidate answers out there (though I don't think there are, in general). What I mean is that the phenotypic and cultural universes don't naturally decompose into a list of reasonably-well-separated things, as genes and proteins do.]

  13. Ellen K. said,

    December 23, 2010 @ 9:54 am

    Natalie: considering how many people hear "-omics" and think "economics."

    Actually, if we were to hear "-omics", we wouldn't think "economics" because the first vowel has a very different sound. The problem, as Ben Zimmer notes at the end of his post, is that we don't hear it, only see it, where it looks just like the end of "economics".

  14. chris said,

    December 23, 2010 @ 10:58 am

    For the name to suggest that it represents a compilation of everything in culture is quite a reach.

    Particularly since it clearly doesn't. LL recently had a post on the difficulty of writing down rap, but other things like dance are even harder to write. And unwritten customs are just that, unwritten, yet nobody would suggest that they aren't part of the culture.

    With all due respect to the linguistic orientation of this blog, much, perhaps even most of culture is not (usually) expressed in words at all and can't possibly be included in this alleged "culturome". The project itself is quite ambitious, but a name like that is more ambitious by far.

  15. mgh said,

    December 23, 2010 @ 1:38 pm

    myl, I appreciate your point about natural phenotypes being difficult to atomize, but (1) that's exactly what people who use the word "phenome" do, and (2) "phenome" hasn't caught on all that broadly anyway, at least not compared to genome, proteome, transcriptome.

    I thought the Google project was primarily based on Western literature ("cultural phenomena that were reflected in the English language between 1800 and 2000") although I suppose they are trying to position themselves as the first study in a new field that will expand to include other cultures (or they see English-language works from 1800 and 2000 as already comprising multiple cultures).

    I suppose you're right that the collection of many cultures under different time frames as the "cultureome" is not much worse than the collection of many phenotypes under different environmental conditions as the "phenome". As you say, I'm not sure whether they should be guilty or proud of that.

  16. Dan T. said,

    December 23, 2010 @ 1:56 pm

    The popular book "Freakonomics" probably helps reinforce the idea that "-omics" relates to economics.

  17. Peter Taylor said,

    December 23, 2010 @ 3:07 pm

    (Also, the usual term in this field is "bigram", not "digram".)

    Interesting. Would that be because digram is ambiguous, being used also for digraphs?

  18. Mark F. said,

    December 23, 2010 @ 4:07 pm

    Here are some alternate candidates:

    lexomics (but we're dealing with sequences of words, not just words)
    phraseomics (but most n-grams aren't phrases)
    ngramomics (already proposed; yeah, that'll fly)
    linguomics (sort of a category error, parallel to genotypeomics rather than genomics; also sounds funny)
    glottomics: (like linguomics, but from a Greek root)

    I think lexomics is best. If you insist on n=1 it really is lexomics; allowing n>1 is generalized lexomics, or lexomics for short.

  19. Jonathan D said,

    December 23, 2010 @ 6:03 pm

    Is Richard Sabey thinking of "digraph" and "trigraph"?

  20. language hat said,

    December 24, 2010 @ 10:14 am

    In my opinion, 'culturomics' was an awful choice.

    It certainly was, and it is indeed a sign of the two cultures that people for whom this is unexceptionable are so immersed in their specialties they don't realize what a terrible word it is for everyone else. It reminds me of those computer nerds who created programs usable only by nerds like themselves, oblivious of the needs of the vast public who would be attempting to use them. To me, it clearly has the -omics of economics, and I intend to go on pronouncing it that way; if my pronunciation gives even a small measure of irritation to someone who thinks it "should" have a long o, it will not have been in vain.

    [(myl) I resemble that remark…]

  21. Dan Lufkin said,

    December 24, 2010 @ 10:25 am

    Could there be a problem with engram already being used for a (maybe copyrighted) Scientology concept (262 kGh)?

  22. Ray Dillinger said,

    December 26, 2010 @ 12:18 pm

    I think that if I were gathering a database of all (or all available) that has been written in a given language, I would probably call it the "Logosphere" of that language. The study of it would then be "Logospherics" or "Logospheronomy" or something like that.

    But, honestly? These people haven't attempted to gather the English Logosphere. A larger chunk of it than previous efforts, yes, but not the whole thing. What these guys are doing is just plain ordinary Corpus Linguistics, and the only difference is the size of the corpus. There is no fundamental breakthrough or change in methodology requiring a new term.

  23. John Cowan said,

    December 26, 2010 @ 1:45 pm

    Quite right, Bear, but quantity can have a quality all its own, as Lenin is said to have said (but probably didn't). Still, this isn't as big a change in scale as I thought when I first talked about it.

  24. David Bird said,

    August 12, 2011 @ 6:31 pm

    This topic is past its stale date, I suppose. FYEO, then. The culturome is an improvement on the speechome http://news.bbc.co.uk/2/hi/science/nature/4987880.stm.

    As I recall, my dictionary of word roots and combining forms defines "-ome" or "-oma" as a mass. Do those who see economics in culturomics also see it in carcinoma? That might make some sense, come to think of it.

RSS feed for comments on this post