Hugh Jackilometresan

« previous post | next post »

On Twitter, John Lewis shared a prime example of the perils of global search-and-replace: what happens when “km” gets expanded to “kilometres” in an edition of Trivial Pursuit.

(This card showed up about a week ago on Reddit, where it appeared in a photo taken from a slightly different angle. It looks legitimate — not a Photoshop job or anything.)

More examples of search-and-replace follies from the Language Log archives:

Some of the examples that have come up on Language Log (“Tyson Homosexual,” “clbuttic,” “particitrousers”) got mentioned in the thread of comments following John Lewis’s tweet. One example mentioned on Twitter also came up in the comments on the 6/1/12 “Nookd” post: when “crib” gets changed to “cot” in books or websites about baby care (in an attempt to turn American English copy into British or Australian English), “described” turns into “descoted.”

(That example comes from an Australian Fisher-Price site.)



51 Comments

  1. Y said,

    January 4, 2017 @ 12:44 am

    Him and Nicobaltle Kidecimetrean are both fine Austreetalian actors.

  2. Steven Marzuola said,

    January 4, 2017 @ 1:38 am

    This is so silly. Microsoft Word and most other editors allow a user to select only complete words when making this type of replacement.

  3. Stephen said,

    January 4, 2017 @ 5:29 am

    @Steven Marzuola

    Part, possibly a significant part, of the problem is that km is *not* a complete word and so will not reliably be delimited by white space or a punctuation mark, so ten kilometres could be any of ‘ten km’, ’10 km’, ’10km’, ‘ten km,’, ’10 km,’, ’10km,’, ‘ten km.’, ’10 km.’, ’10km.’.

    A better result can be obtained by using a regular expression
    https://en.wikipedia.org/wiki/Regular_expression .

    Using that you could search for ‘xkmy’ where x & y are not alphabetic characters.

    Even then there are limits to what can be done, ’10km,’ and ’10km.’ both need to be changed, but not to ’10kilometres,’ and ’10kilometres.’.

    I was involved in a project where we had to make some global changes to computer code, so inherently more structured than a natural language. What we did was to write some code that picked up every line that contained the search string and write the relevant line and the name of the source to a database. This database was read by an editing function that listed every *unique* entry twice, with the second copy editable (to be used as the replacement text).

    Using this, the database was reviewed and updated manually and then another bit of code was used to make the relevant replacements.

    Even this was not always sufficient. When the search string contained multiple words it could be split over two lines. So in the editing function a person would spot this and realise that this needed to be treated as a special case. Automating that would have been very hard.

  4. Martin Ball said,

    January 4, 2017 @ 8:00 am

    I would hope that the answer to “Ireland was the RMS Titanic’s last port of call” is False; Ireland not being a port :(

  5. Alen Mathewson said,

    January 4, 2017 @ 8:16 am

    We had a similar mix-up in the UK when the Guardian decided to change the American sounding ‘poll’ to the more traditional British ‘turnout’ when reporting elections to the European Parliament some years ago. The successful Labour Party candidate in the South West of London became Anita Turnoutack!

  6. RP said,

    January 4, 2017 @ 8:22 am

    Even if “whole words only” is ticked, it is unforgivably lazy for an editor to do a replace-all without checking the results individually. Who is to say that the book about babies does not also at some stage use the word “crib” to mean “answer-sheet”? It may not be likely, but it is risky to make assumptions.

  7. Brendan said,

    January 4, 2017 @ 9:00 am

    Martin Ball’s comment deserves a much wider audience. Alas, Martin, I suspect that the answer is probably given as “true”.

    [At the risk of causing further irritation, the port was Cobh, Co. Cork, formerly called “Queenstown”.]

  8. Geoffrey K. Pullum said,

    January 4, 2017 @ 9:26 am

    Trivial Pursuit fans should keep an eye out for possible weirdness in questions that might have intended to mention blackmail, bookmobiles, brinkmanship, bunkmates, checkmates, hackmatacks, muskmelons, taskmasters, or workmen; and people called Beckman, Blackman, Heckman, Hickman, Pickman, or Sparkman; or of course Clackmannanshire or the town of Clackmannan.

  9. mark dowson said,

    January 4, 2017 @ 10:10 am

    @Steven Marzuola
    Steven is right that regular expressions are the way to go. I faced this some years back when management insisted that we change the IDs of our software/systems processes to change the prefix “TSS” – an acronym for the company no longer valid after a re-org. The processes and related documents (mostly in .doc) were richly cross referenced in their text, so ~1000 docs in a complex hierarchy, each with multiple cross references (which were not always properly formatted) needed to be changed.

    I did this using Word find and replace which supports regular expressions (a comprehensive and useful description is available at http://word.mvps.org/FAQs/General/UsingWildcards.htm ). Some programming was needed to apply find/replace to a file hierarchy, but it looks like there is now an existing utility which will do this ( http://www.thewindowsclub.com/find-replace-text-multiple-files-bulk-windows-pc ). I haven’t tried it. We did review the results (an admin searched the files for missed cases).

    As Steven notes, there are a lot of cases to be considered even for something as simple as “10 km” and he doesn’t include the possibility of leading or trailing parens and some other similar cases.

    Steven may have given up a bit too easily for the case of a search string with multiple words and possible line breaks. After cleaning up multiple successive space characters, multiple searches – one for each possible position of a line break instead of a space – should do it, unless it is so long that more than one line break is possible.

  10. C said,

    January 4, 2017 @ 11:01 am

    @Martin Ball
    I smiled at your comment, but colloquially, a ‘port of call’ is not necessarily a port, is it? Just a general location.

  11. Stephen said,

    January 4, 2017 @ 11:19 am

    @mark dowson

    Mr. Marzuola is a Steven, I am a Stephen.

    “and he doesn’t include the possibility of leading or trailing parens”
    I was including those under the heading of punctuation marks.

    “Steven may have given up a bit too easily for the case of a search string with multiple words and possible line breaks”
    Just as a datum. We did not give up on trying to automate the replacement, we never attempted it as (from experience) we knew that there were multiple variants of some of the source strings and we wanted to harmonise them. So in the editing function we changed A1, A2, A3 … A9 all to B1.

    Also we, correctly as it turned out, suspected that the small suite of tools we built would be useful in the future.

  12. J.W. Brewer said,

    January 4, 2017 @ 11:34 am

    Also these days a “port” may be defined more by a body of water (bay/harbor/what-have-you) than by the often multiple municipalities along its shores. So these days in Ireland you’ve got a conceptual and probably bureaucratic entity known as the https://en.wikipedia.org/wiki/Port_of_Cork, and ships calling on that port might end up tied up to a dock located in Cork city itself, or in Cobh (nee Queenstown), or any of several other points in the area. The usual docking location for passenger ships is apparently in Cobh but the itineraries may still plausibly list “Cork” as the port of call.

  13. Jonathon Owen said,

    January 4, 2017 @ 11:39 am

    One time a friend of mine was posting a personal story on a discussion board, and she decided to change some names before posting it. It would have been fine, except that some of us were puzzled by references to Billtmas. Someone asked about it, she explained how a find-and-replace had gone awry, and a new holiday was born.

  14. Robert Coren said,

    January 4, 2017 @ 11:48 am

    Apart from the silliness of the result, would’t even a correctly-implemented blanket replacement of “km” by “miles” produce inaccurate results? I mean, the answers to “Name a city about 300 km southwest of Boston” and “Name a city about 300 miles southwest of Boston” would be different.

  15. Robert Coren said,

    January 4, 2017 @ 11:51 am

    @RP: What color is the sky on your planet? On the one I live on, the idea that there’s an “editor” who actually looks at the results of automatic replacement is a complete fantasy.

  16. Cervantes said,

    January 4, 2017 @ 11:58 am

    Apart from the silliness of the result, would’t even a correctly-implemented blanket replacement of “km” by “miles” produce inaccurate results?

    Sure, but in this case they were apparently replacing “km” with “kilometers.”

  17. Brett said,

    January 4, 2017 @ 12:23 pm

    @Alen Mathewson: What does “turnout” mean in this sense? As an American, I’m not familiar with any sense that is synonymous with “poll.”

  18. Stephen said,

    January 4, 2017 @ 12:47 pm

    @Brett, in BrE turnout is often used for the number of of people (absolute or %age) who have taken part in an election, see entry 1 at
    https://en.oxforddictionaries.com/definition/turnout

    which is the same as 4d at
    https://www.merriam-webster.com/dictionary/poll

  19. Cervantes said,

    January 4, 2017 @ 1:14 pm

    What does “turnout” mean in this sense? As an American, I’m not familiar with any sense that is synonymous with “poll.”

    It was in 1989. As election results came in, the term “percentage poll” was used to say what proportion of the electorate had actually voted in each constituency. At the Guardian  the editors decided that in this instance “turnout” would be clearer …

    Anita Pollack’s name was printed incorrectly two days running before the paper issued a correction, blaming the error on “the startling simplicity of computer program logic.”

  20. bfwebster said,

    January 4, 2017 @ 1:21 pm

    A better result can be obtained by using a regular expression.

    Thus invoking the canonical response: http://xkcd.com/1171

    (And I write that as someone who uses them myself.)

    On a related topic, I was attempting to post a comment to an article at the International Business Times website and was told my comment was rejected because it used the word “semen”. Huh? A closer read showed that I had used the word “basement”. SMH.

  21. rcalmy said,

    January 4, 2017 @ 1:22 pm

    At least in my personal experience “turnout” is what is most commonly used for number of people voting in AmE. I have never heard of that definition for “poll” before. Definitions 4a and 4b from the Webster’s link are the ones I’m familiar with.

  22. Guy said,

    January 4, 2017 @ 2:13 pm

    The poll/turnout replacement strikes me as like the kill/die replacement I remember seeing somewhere. It’s especially baffling because it’s not only easy to imagine where it would cause an undesirable result, it’s actually very difficult to imagine any situation where it would cause a desirable result, so that you wonder what usages the person who did the replacement was even trying to deal with.

  23. Brett said,

    January 4, 2017 @ 2:14 pm

    @Alen Mathewson, Stephen, rcalmy: I suspect that this is one of those cases where a strange-sounding usage is believed to come from another variety of English, when in fact it is just obscure. “Turnout” is the natural word to use in American English, and I also have never encountered that meaning of “poll.”

  24. Stephen said,

    January 4, 2017 @ 2:21 pm

    @bfwebster

    Like your basement being rejected, years ago when an American ISP (AOL?) started up in the UK there was more than a little press coverage that some people were being rejected because they were said to be using profanity in their address details. The people lived in Scunthorpe
    https://en.wikipedia.org/wiki/Scunthorpe

    And I now see that the Wikipedia article not only mentions this but has refers to the phenomenon as the ‘Scunthorpe problem’ and has an article lists lots of cases of this (and allied issues).

  25. mark dowson said,

    January 4, 2017 @ 4:34 pm

    @ Steven
    Apologies, both for the “Stephen” and for appearing to suggest that you tried to handle the line break case and failed. But you did say that automating it would have been “very hard”. The advantage of the approach I described is that, so long as the target is a single document in Word (or can be opened in Word) however long, it can be implemented using native Word find/replace with regular expressions, and no programming is needed. Admittedly, finding the correct set of find/replace expressions can be fiddly and may need some experimentation.
    @Robert Coren
    Review of the results of an auto replacement doesn’t need an “editor”, and can be very simple. In the km->kilometer case it just needs a search for occurrences of “km” to see if any were missed, and for occurrences of “kilometer” to flag any stupid ones for correction. If your real world doesn’t have an admin available who could do this (after five minutes explanation) then it would have to be do-it-yourself – easy but boring.

  26. Stephen said,

    January 4, 2017 @ 5:23 pm

    @mark dowson

    You don’t need to apologise for calling me Stephen, that is after all my name.

    All I was saying about the work we did was that we never *tried* to change any of it automatically because of what we were doing.

    Suppose we were trying to change ‘The quick brown fox jumps over the lazy dog’ to ‘Pack my box with five dozen liquor jugs’. Then the source text might have been abbreviated (brown->brwn, dog->dg, etc) in some cases and there might have been extra spaces in some places. Plus there were dates involved which could be formatted in different fashions.

    There were quite a number of files to process (c. 20,000 from memory) so it was not feasible to determine all of the actual variants of the source text without some sort of automated scan. There were hundreds of variants of some of the source strings, even when they were on a single line.

    Trying to cope with that using a very large series of scan & replace commands (only a small % of which would apply to a given file) would have, IMO, been a lot harder than putting some human intelligence in the process. Dealing with line breaks on top of that would just have made it harder still.

    One detail. As I said this was computer source code, so I can never see *any* advantage to opening and modifying that in Word! :)

  27. Rick said,

    January 4, 2017 @ 5:23 pm

    @Robert Coren – switching between km and miles without adjusting the number would indeed be silly, but the common way of showing both figures can be just as silly. I often see expressions such as ‘a distance of about 10 miles (16.1 km)’, where the conversion factor has been applied with a pseudo precision clearly never intended.

    In fact, I’m not quite sure if there is any graceful way to handle conversions of this sort, where intended round number approximations, in English or metric units, don’t neatly fit the other system.)

  28. Rick said,

    January 4, 2017 @ 5:26 pm

    Or, indeed, to fix a hanging parenthesis, when you either forgot to add the other one or took it out without removing both.

  29. per incuriam said,

    January 4, 2017 @ 5:39 pm

    Cobh (nee Queenstown)

    Nee? Cobh (originally spelled Cove) is the earlier name. Queenstown was used between 1849 and 1920.

  30. Rubrick said,

    January 4, 2017 @ 6:07 pm

    I suspect employees of Wizards of the Coast are still told of the tragedy of the dawizard as a cautionary tale. Short version: “mage” was replaced with “wizard” in an ill-conceived search-and-replace. Hilarity ensued. And was printed in a rather expensive volume.

  31. mark dowson said,

    January 4, 2017 @ 6:28 pm

    @stephen
    I think I have your name right this time at least.
    Yes, I can see that the automated find/replace approach I discussed wouldn’t have worked for your situation, but is more likely to work in the km->kilometer case in text files, where a thoughtless simple km>kilometer find/replace produces lots of nonsensical results. Any automated approach has to start with a very careful case analysis similar to the one you provided in your original post. Then it’s possible to define a sequence of regular expression find/replaces that handle (at least most of) the cases. Often, a useful starting point is to collapse all sequences of multiple space characters into a single space, which reduces the number of cases which need to be handled, that is (for example), there may be 10km or 10km, but never 10km or 10km etc.

  32. mark dowson said,

    January 4, 2017 @ 6:33 pm

    @stephen
    The last line of my previous post got auto-corrected when posted to be meaningless. It should read:
    there may be 10km or 10km, but never 10km or 10km etc.

  33. mark dowson said,

    January 4, 2017 @ 6:41 pm

    One more try at correcting:
    there may be 10km or 10-1space-km, but never 10-2space-km or 10-3space-km etc.
    If this doesn’t work, I give up

  34. Adrian Morgan said,

    January 4, 2017 @ 7:29 pm

    I feel there ought to be a poem about “the fascinating lesson of the jackilometresan” but I have been unable to write it. In my head it is highly rhythmic with lots of clever rhymes, suitable for reading aloud as a fast-tempo chant.

  35. Lugubert said,

    January 4, 2017 @ 7:39 pm

    Another case of the “Scunthorpe problem”: On an Internet community, a post of mine was changed to refer to the Rembrandt painting “The Nigh****ch”.

  36. Mark S said,

    January 4, 2017 @ 8:14 pm

    My former pastor told me of a time when two elderly women parishioners died and had funerals in the same week. To save time, the church secretary created the order of service for the second funeral by a search-and-replace on the first one. This would all have been fine, except for the line of the Creed which became “Born of the Virgin Edna”.

  37. Graeme said,

    January 4, 2017 @ 8:52 pm

    Electoral lawyer here. ‘Poll’ and ‘turnout’ can be synonymous as verbs, but with ‘poll’ as a verb being unusual to Anglo usage.

    It’s more quantity than metric, but his Aussie compatriots often pronounce Jackman’s name as ‘Huge Ackman’, tongue partly in cheek.

  38. Guy said,

    January 5, 2017 @ 2:57 am

    Graeme, can you give an example of “turnout” used as a verb with a meaning relating to elections?

  39. Robert said,

    January 5, 2017 @ 8:13 am

    I wonder why they didn’t render “common” as comillimetreon.

  40. wtsparrow said,

    January 5, 2017 @ 8:49 am

    “Global search and replace” or “global search and destroy”?

  41. Joshua K. said,

    January 5, 2017 @ 9:17 am

    On Parade magazine’s web site a few years ago, the comments section was heavily censored by software. People who wanted to refer to the then-Vice President found that their comments were changed to “**** Cheney.” The governing document of the U.S. became the “Cons***ution.” And some people wanted to make reference to a country just west of India, but the software interpreted the first part of the country’s name as a racial slur, so it became “****stan.”

  42. Andreas Johansson said,

    January 5, 2017 @ 9:35 am

    mark dowson wrote:

    If your real world doesn’t have an admin available who could do this (after five minutes explanation) then it would have to be do-it-yourself – easy but boring.

    I’m trying to imagine getting our admin to do something like that. Convincing her it’s her job and explaining how to do it would probably each take longer than doing it oneself.

  43. Robert Coren said,

    January 5, 2017 @ 10:31 am

    @Cervantes: Right you are, carelessness on my part.

    @Rick: Yep. I claim (although I’ve heard arguments against it) that this is why “normal” human body temperature is described, with way excessive precision, as 98.6 Fahrenheit — which happens to be what you get if you convert 37 Celsius. Presumably whoever determined the average noted that it was (about) 37C, which should have been translated as (about) 99.

  44. mark dowson said,

    January 5, 2017 @ 2:50 pm

    Back to km->kilometres:

    The following – easy with regular expressions – should do it:
    Where “km” is immediately preceded and followed by an alphabetic character replace with “zyx”
    Replace all “km” with “ kilometres”
    Replace all “ kilometres” with “ kilometres”
    Where “ kilometres” is preceded by “ 1” replace it with “ kilometre”
    Where “ kilometres” is preceded by “ one” replace it with “ kilometre”
    Where “ kilometres” is preceded by “ One” replace it with “ kilometre”
    Replace all “zyx” with “km”

    Admittedly, this will give a bad result on the (unlikely) text “In this document, km is used as an abbreviation for kilometre”, but short of language understanding AI there is no way to handle such cases, which is why review is needed.

  45. mark dowson said,

    January 5, 2017 @ 2:55 pm

    The double space killer struck again in my algorithm. In line 3 any double space preceding “kilometres” is replaced by a preceding single space

  46. Andrew said,

    January 5, 2017 @ 3:47 pm

    @Rick:

    This phemonenon drives me crazy on King Arthur Flour’s website. While they (very laudably and generously) provide all their recipes in a US volumetric (“1 cup sugar”), US weight (“7 oz sugar”) and metric weight (“199g sugar”) format, their metric conversions are excessively precise, to the point where you get measurements such as:

    397g [14 oz] sugar
    361g [12.75 oz] flour
    21g [0.75 oz] honey
    199g [7 oz] sugar, etc. — it’s clear that all their quantities are exact conversions from imperial weights (Google “3.5 oz to g” and you’ll get 99.2233g as the conversion.)

    There is absolutely no reason that these should not be reasonably rounded quantities of 400g sugar, 360g (or even 350g) flour, 20g honey, 200g sugar. I mean, I bake, and I know weighing is more precise than volumetric measurements, but really, 1 gram of sugar is going to make no difference in a recipe that calls for 199 or 200 grams.

    I know it’s probably not this simple, but it seems like you could write a reasonable algorithm that would round metric conversions to the right precision — so no rounding between 1g and 20g, for example, and rounding to every 2g between 21g and 50g, every 10g between 51g and 100g and to every 25g at 101g and beyond. (Or whatever — just as an example.)

    (I except “227g butter” and eighth-increments thereof from this complaint, since this is still a website targeted largely at US readers and butter in the US is sold and measured in 8 oz/227g sticks. For everything else, though, there’s absolutely no reason to not have reasonably-rounded metric equivalents.)

  47. mollymooly said,

    January 6, 2017 @ 10:12 am

    @J. W. Brewer:
    Wikipedia distinguishes “Port of New York and New Jersey” from “Port Authority of New York and New Jersey” but not “Port of Cork” (the non-contiguous plots of shoreland) from “Port of Cork” (the body corporate formerly known as “Cork Harbour Commissioners”). OTOH both Cork Harbour and New York Harbor have articles separate from those of the respective ports.

    Comparing all pages named “Port of…” with Category:Port authorities reveals similar untidiness for other ports.

  48. Robert Coren said,

    January 7, 2017 @ 11:35 am

    My recollection (from when I lived in New York City 50+ years ago) is that the Port Authority’s name, at least as it appeared on their terminal in midtown, used to be “Port of New York and New Jersey Authority”, clunky as that sounds. Most people just called it (both the agency and the terminal) “the Port Authority”, but a few people, possibly influenced by the legend at the entrance, called it “the Port of Authority”. (My paternal grandmother was one of them.)

  49. Cervantes said,

    January 7, 2017 @ 4:43 pm

    Robert, here’s the front of the building as I recall it from that era. On each corner face, and above the main entrance, the signage read “Port Authority Bus Terminal.”

  50. Graeme said,

    January 8, 2017 @ 5:51 am

    Guy. Sorry, I was referring to the concept in verb form. Yes, it almost invariably appears as a compound verb, as in ‘to turn out’. https://www.merriam-webster.com/dictionary/turn%20out
    Here’s an international story sourced in the US using the verb ‘to turn out’ in the past tense http://www.businessinsider.com.au/trump-voter-turnout-records-history-obama-clinton-2016-11?r=US&IR=T
    But you will find wonks using ‘turnout’ as a verb. A paywalled example from Canadian profs Rubenson, Blais et al in Acta Politica (39: 413) is “people did not turnout in the 2000 election because the result was a foregone conclusion”.

  51. Robert Coren said,

    January 8, 2017 @ 11:23 am

    @Cervantes: Yeah, well, memory. I’m pretty sure that I did hear my grandmother and others say “Port of Authority”, and maybe the rest of it is my mind’s invention in an attempt to explain how this came about.

RSS feed for comments on this post