Google thinks Darwin is Freud

« previous post | next post »

Or at least some automatically-derived Google thesaurus does:

For some searches including the term "Freud", a significant fraction of the hits (including the second one in the screenshot above) do not contain "Freud" (or derivatives like "Freudian") at all. At the same time, instances of "Darwin" in the displayed snippets are put into bold typeface, as if they were instances of one of the search terms.

Here's another example from the first page of returns for the query in question:

And a stretch of the second page is here.

What this means, I believe, is that the entry for "Freud", in the thesaurus that Google's search algorithm is using for query expansion (i.e. to add additional search terms to our queries), includes the term "Darwin". The thesaurus in question was almost surely constructed automatically, from some combination of co-occurrence patterns in query logs and web text.

I base this conclusion on three premises:

1. People have been studying automatic thesaurus construction for almost half a century (at least since Sparck-Jones and Needham, "Automatic term classification and retrieval", Information Processing and Management 1968).
2. Google's engineering ethos is strongly weighted against human intervention in such things.
3. No sensible and informed human being would imagine that "Darwin" was a good term to add in general to queries involving "Freud".

The specific Freud → Darwin expansion appears to be new, though the only evidence for this is that no one noticed it before. It might have happened as an unintended consequence of a new thesaurus-construction algorithm, or (less likely) as a result of some new accidental congruence of query logs or web texts.

As Ben Zimmer pointed out to me, this is similar to the issues that led Google Translate at one time to think that "Austria == Ireland" (3/28/2008).

[Hat tip to Victor Steinbok.]



25 Comments

  1. Matt Heath said,

    July 23, 2011 @ 5:25 pm

    If you search "Freud origin species" occurrences of "Darwin" aren't highlighted, so there must be something more complicated going on. Maybe Google's algorithm thinks "Darwin" means "Freud" specifically when accompanied by "barbarian horde" because the two are listed together somewhere as members of a 19th century barbarian horde that attacked traditional Western culture.

  2. Fernando Pereira said,

    July 23, 2011 @ 6:11 pm

    Thanks for the catch, I've reported the issue.

  3. Emily said,

    July 23, 2011 @ 7:00 pm

    I was entering "Freud" followed by Darwin-related terms (evolution, fish, finch, Beagle, Galapagos, natural selection) to see if this glitch would occur, and I found one possible cause– the company which markets those "Darwin fish" car decorations also has Freud fish:
    http://evolvefish.com/fish/emblems.html
    However, "Freud fish" doesn't seem to bring up Darwin results itself, and neither do the other words I tried.

  4. Ray Dillinger said,

    July 23, 2011 @ 7:02 pm

    I'm increasingly annoyed with Google, actually. When I enter search terms, I want a page that has those exact terms somewhere on it. When I enter a search term prefaced by a -, I want a page that doesn't have that term on it. If I bother to use AND or OR, it's because I'm looking for a very specific thing, and the "optimized" searches now seem to completely ignore boolean operators. If I put a phrase in quotation marks, it's because I remember that exact quotation being on the page I want, and I don't want the phrase interpreted or corrected. Grrrf. Another case of "designed to be used by idiots" that becomes useless to any non-idiot purpose.

    So I'm paying fees now to a half-dozen subscription search engines that still allow more specific, non-optimized searches and give you a reasonable sample of pages that actually match your search terms. You have to pay more to get people to *not* mess with the data so much, and most of them are fairly restricted in subject matter too.

    In fact, Google is getting increasingly useless for research or linguistic purposes; the results are filtered and massaged and based on stemmed versions of your search terms, etc, to such a degree that you can't make any reasonable conclusions about how common or accepted something is, especially for subtle variants in usage. I keep coming back to wondering if I need to create my own internet-crawler and maintain my own database to get simple full-text boolean searches back.

  5. Peter said,

    July 23, 2011 @ 7:06 pm

    @Fernando Pereira – is it possible for you to explain or for you to ask someone else to explain to us what it is that gets adjusted here? Presumably you have an automatically-derived thesaurus, as Mark says above, in which the weightings of the alternative terms are sensitive to their context.

    Are those weightings sensitive to the query in which the term is used, as Matt Heath suggests above?

    When the bold terms on the SERP are split between 'darwin' and 'freud', for example, are the results returned from different searches? It seems like this would provide some difficulty in ranking results from different queries on the same SERP – I understand that this thesaurus is not intended to connect such disparate entities.

  6. Eric P Smith said,

    July 23, 2011 @ 7:31 pm

    Indeed something complicated is going on.

    Charles Darwin and evolutionist sociologists of the nineteenth century used a term of Tartar origin, "primitive horde", "primal horde" or "barbarian horde", to refer to the simplest possible form of social formation in existence during prehistoric times. Sigmund Freud took up the theory, and Freud and Darwin are often mentioned together in that context. A Google search on "Freud" and "primitive horde" likewise highlights "Darwin".

    For some reason a Google search on "Darwin" and "Clement Freud" reports that there are 18,400,000 hits, but when you look to see what they are there are actually only about 750. Whatever gremlin thought there were 18,400,000 hits may have created or strengthened the Freud-Darwin association in Google's mind.

    The association may also have been strengthened with the death 3 days ago of Lucian Freud, brother of Clement and therefore also a grandson of Sigmund. There is an obituary on the website of WorldNews.com at http://article.wn.com/view/2011/07/21/Obituary_Lucian_Freud/, and the source code of that page includes 252 occurrences of 'Freud' and 160 occurrences of 'Darwin'.

  7. Richard Hershberger said,

    July 23, 2011 @ 7:41 pm

    I am with Ray Dillinger on this. I don't mind the various assists Google gives. I don't even mind their being the default. But there should be some reasonably transparent way to search on exactly what you type in, including Boolean operators. Without this, Google is really good for one type of search and absolutely useless for another, for no good reason.

  8. A. said,

    July 23, 2011 @ 8:13 pm

    I already found that irritating when a search for snagcheol etymology found pages which included jazz but not snagcheol.

  9. Dan said,

    July 23, 2011 @ 8:16 pm

    I reran the Google query, and Darwin is still in bold. But _this_ page is now the first one returned!

    One factor to consider is Google's battle against the algorithm gamers. Sometimes the relevance of a page to a query is influenced by the text associated with the pages that link to it–rather than the text actually on the page. This seems to be an attempt to counter the ploy of overloading a page with irrelevant keywords to influence search results. Google is not completely like a catalog search because it doesn't completely trust the content that it indexes, and reasonably so.

    If you look at the cached version of the Jean Shepard page, the display notes: "These search terms are highlighted: barbarian horde These terms only appear in links pointing to this page: freud." This text suggests the possibility that "Darwin" may not have influenced the existence or ranking of this page in the search results, but only the bolding of that word.

  10. Teresa G said,

    July 23, 2011 @ 11:13 pm

    Does the advanced search no longer work as advertised?

    http://www.google.com/advanced_search

  11. John Lamping said,

    July 23, 2011 @ 11:38 pm

    There is some official discussion of how Google's synonym system works at these links:
    http://googlepublicpolicy.blogspot.com/2008/03/making-search-better-in-catalonia.html
    http://googleblog.blogspot.com/2010/01/helping-computers-understand-language.html

  12. Paul McCann said,

    July 24, 2011 @ 12:32 am

    I agree with those here who feel Google has gotten overaggressive in its auto-correction, but some of this can be addressed by using magic keywords which, if documented properly somewhere, are well-hidden. A good one is "intext:", which only returns pages containing the words you actually want.

  13. Brian said,

    July 24, 2011 @ 12:48 am

    Putting a plus sign in front of a term also forces that word to actually be on the page.

  14. Ray Dillinger said,

    July 24, 2011 @ 9:18 am

    Last time I tried it, "advanced search" did in fact no longer work as advertised. Especially when trying to find exact quotations.

  15. Adrian said,

    July 24, 2011 @ 12:53 pm

    I agree with Ray and Richard. Paul may be right, but even if he is, Google could offer more overt assistance in helping us to tailor our searches.

  16. Ran Ari-Gur said,

    July 24, 2011 @ 1:46 pm

    @Teresa G: "Advanced Search" is just a helpful GUI for crafting a regular-search URL. For example, anything you put in the "this exact wording or phrase" just gets put in quotation marks, and has the same effect as putting a phrase in quotation marks in a regular Google search.

  17. AntC said,

    July 24, 2011 @ 4:52 pm

    I think we should be demanding that Google reimburses our subscription — just like Language Log offers to do from time to time ;-)
    Seriously: Google is not a controlled or guaranteed service (like, say, electricity or sewerage). You'd have a difficult time suing them for misleading, false or omitted search results — or even suing a dictionary or encyclopedia for which you've paid good money.
    Google's foibles are amusing, no more than that.

  18. JFM said,

    July 25, 2011 @ 6:56 am

    I did a very simple test search on three search engines a while ago — Google, Lycos and AltaVista — to see if they gave similar results. They didn't.
    http://jfmaho.wordpress.com/2010/08/16/search-engines-recalls-and-ratios/
    Ideally, and if search engines worked the way I had hoped they did, at least they would give similar ratios when comparing the results for two searches, even if the actual number of hits would differ. I find it very annoying.

  19. AntC said,

    July 26, 2011 @ 1:25 am

    @JFM: "…before trusting any figures provided by any of the search engines"; "…faith in Google".
    [And apologies for wandering off topic of linguistics.]
    On what rational basis could anyone have trust/faith in those search engines? Do you have a contract for a service? How have you paid for it? What recourse does the service provider offer for non-performance?
    GHits might be sufficient 'evidence' for a fatuous piece of pop journalism; but not (I hope) for any academically robust research.

  20. Will said,

    July 26, 2011 @ 4:48 am

    The second of John Lamping's links has this in it:

    Note that because our synonyms depend on the other words in your search and use many signals, you won't necessarily always see the word "photos" bolded for "pictures", only when our algorithms think it is useful and important to bold.

  21. JFM said,

    July 26, 2011 @ 4:59 am

    @AntC
    >On what rational basis could anyone have trust/faith in those search engines?

    None actually. I've really not used Google for any serious purposes, and I have to confess to have made the somewhat naive assumption that they perform "clean" searches of the interwebs (i.e. without excessive behind-the-scenes data massaging). That blog post merely exhibits my sober awakening, as it were. It gave me concrete numbers to looks at.

  22. JFM said,

    July 26, 2011 @ 5:04 am

    I'd also like to add, that it would be very useful to have a good an extensive database of actual writing with the ability to make searches also for misspellings and typos, without having them being auto-corrected for you.

    As far as I know, the professional corpuses/corpora only use edited texts.

    The internet *is* such a database, of course, but currently it seems impossible to make any reliable searches of it.

  23. Jeremy Hoffman said,

    July 27, 2011 @ 1:03 am

    I happen to work on query expansion at Google. Thanks for your interest, everyone! You've done a nice job reverse-engineering some of our work. :-) I thought I'd chime in with my two cents.

    I totally understand the problems that Ray, Richard, and JFM are talking about in trying to doing linguistics research on the Internet itself using Google search. It's tempting to try because Google is fast and comprehensive, but the simple fact is, it's not a system that was built for linguistics research. A system for linguistics researchers would be designed and implemented differently, probably with different algorithms and with different hardware.

    So when Richard says, "Without this, Google is really good for one type of search and absolutely useless for another, for no good reason," I just want to point out that there are good reasons that Google search works the way it does. :-) Software design is all about tradeoffs, and a search engine is very, very complex software. We're constantly trying to make Google more helpful for the majority of cases while still letting you do what you need to do in the rare specific case. We really do worry about all of our users' experience, from the power user to the neophyte, and we have philosophical discussions, iterate on designs, run experiments, get feedback, analyze data, etc. to do the best we can.

    Anyway, feedback like this is always helpful, and I hope we can make you happier in the future!

  24. Ray Dillinger said,

    July 28, 2011 @ 3:11 pm

    Ugh. So I really do need to unleash my own crawler and maintain my own database to get plain old full-text boolean search back.

    No hard feelings, really; I understand, from a business perspective, why it is the way it is and what most people use it for. But I'm frustrated because the thing we want to do and can't is in fact much, much simpler and capable of a highly efficient implementation compared to what's available.

    I think if we're studying something and want to understand it, we need an index of it made without filters. I've been thinking for a long time now that we're being directed, by market pressures alone rather than any deep conspiracy, toward pages made by an increasingly narrow number of providers whenever possible, and that gives us an increasingly restricted view of what is out there.

  25. Frans said,

    August 14, 2011 @ 3:33 am

    Attila the Hung? (second screenshot)

RSS feed for comments on this post