Soundex and Metaphone

« previous post | next post »

One of the earliest and best photographers in China was called John Zumbrun, but I have also seen his surname spelled various different ways, including Zumbrum.  Some of his pictures may be seen here (this site is run by Thomas H. Hahn, digital archivist of old photographs).

As soon as I saw his surname, I suspected that it might be a variant of the Zumbrunnen among my own maternal relatives who were of Swiss German extraction.  When I mentioned to my sister Heidi (who does intense genealogical research on our family) that I thought Zumbrun might be a variant of Zumbrunnen, she replied, "Oh man, the variant spellings of Zumbrunnen are driving me batty.  I have even seen Zum Pwunnen.  Have you heard of the soundex?  It is a way to index names & deal with all of the variant spellings."

Upon looking up Soundex, I found that it was developed around 1918 and was a method for indexing names in the 1880, 1900, 1910, and 1920 US Censuses.

Soundex is still very much in use today and there is a neat Soundex converter that enables one to easily and quickly obtain the one letter + three digit alphanumeric code for any surname that one enters into the system.

Essentially a phonetic algorithm for indexing names by sound, Soundex encodes homophonous names with the same alphanumeric representation so that they can be correlated despite differences in spelling.

Metaphone is an improved version of Soundex that was invented in 1990 and that takes into account irregularities in English spelling and pronunciation.  The latest version, Metaphone 3, was brought out in 2009 and "achieves an accuracy of approximately 99% for English words, non-English words familiar to Americans, and first names and family names commonly found in the United States, having been developed according to modern engineering standards against a test harness of prepared correct encodings." (Wikipedia)

I thought that I'd give Soundex a try on a controlled body of material.  I've long been aware that there are numerous different ways to spell the name Shakespeare.  In an article entitled "The Spelling and Pronunciation of Shakespeare's Name", David Kathman brings together many of these variants.  Here are the Soundex results I obtained when I entered the variants into the software:

non-literary references

Shakespeare             S221
Shakespere              S221
Shakespear              S221
Shakspeare              S216
Shackspeare             S216
Shakspere                S216
Shackespeare           S221
Shackspere              S216
Shackespere            S221
Shaxspere               S216
Shexpere                 S216
Shakspe~                 S210
Shaxpere                 S216
Shagspere                S216
Shaksper                 S216
Shaxpeare               S216
Shaxper                  S216
Shake-speare           S221
Shakespe                 S221
Shakp                      S210

literary references

Shakespeare            S221
Shake-speare           S221
Shakspeare              S216
Shaxberd                 S216
Shakespere               S221
Shakespear               S221
Shak-speare              S216
Shakspear                S216
Shakspere                S216
Shaksper                 S216
Schaksp.                 S210
Shakespheare          S221
Shakespe                 S221
Shakspe                   S210

Since all of these spellings refer to the same name, ideally they should all have yielded the same alphanumeric code.  It is encouraging, though, that most of the variants come out as S221 or S216, while there are only 4 occurrences of S210.  I have not run all of the variants through Metaphone, though I presume that it would do an even better job than Soundex.

Nevertheless, we should not be so naive as to believe that Soundex and Metaphone can do our genealogical research for us, since they are only meant to recognize patterns that we might otherwise overlook.  For example, the alphanumeric Soundex code for "Mair" is M600, but the same code is also applied to the following long list of names:   MAHAR | MAHER | MAHR | MAIER | MAIR | MARIA | MARIE | MARR | MARROW | MARY | MAURY | MAYER | MAYOR | MEIER | MERRIHEW | MERRY | MEYER | MIR | MOHR | MOIR | MOOR | MOORE | MORA | MORE | MOREAU | MOREY | MORR | MORROW | MOWER | MOWERY | MOWRY | MOYER | MUIR | MURIE | MURR | MURRAH | MURRAY | MURRIE | MURROW | MURRY | MYER | MYHRE | ; I'm certain that these are not all variants of the same name.

On the other hand, although "Mair" and "Maier" are variants of the same surname, that is not the end of the story either.  Before I went to my ancestral village of Pfaffenhofen, Austria in 1967, I had always assumed that "Mair" was an Anglicization of "Maier" or some other spelling of the German surname (e.g., Meyer, Meier, Mayer, Maier, Mier, Meir).  Indeed, many people used to ask me if I were related to Lucy Mair, the British anthropologist, but I knew that could not be so because her name was of Scots or English origin, while mine was of German derivation.  It is interesting that I am listed in Wikipedia as being a person with the surname Mair in a Scots context, though I'm sure that it won't be long after this post goes up that the Wikipedia editors shift me to the much smaller group of people named Mair in a German context.  In any event, when I went to Pfaffenhofen, I discovered that there were many individuals whose surname in the church record books and on tombstones was given as "Mair", and in the Innsbruck phonebook there were scores of people surnamed "Mair".  Even more surprising to me was that it was not uncommon for families to change their name from "Maier" (or some other spelling) to "Mair" and vice versa, depending upon fashion or personal preference.

For those who might be curious, the German surname "Mair" derives from Middle High German meiger, meaning "higher or superior", often used for stewards of landholders or great farmers or leaseholders; today a Meier is generally a dairy farmer. Meier and Meyer are used more often in Northern Germany, while Maier and Mayer are found more frequently in Southern Germany.  (This note is based upon this entry in genealogy.about.com.)

The main purpose of this post, however, is not to engage in genealogical investigations of the surname "Mair", but to bring the Soundex and Metaphone  algorithms to the attention of Language Log readers and to suggest that they might have useful purposes for linguistic research quite apart from genealogical investigations.



16 Comments

  1. Eric TF Bat said,

    February 5, 2012 @ 1:53 am

    My understanding of Soundex is that it should weed out duplicates, so S221 is theoretically invalid. I'm betting it's actually S2216-something, so if your implementation were following what I understand to be the correct algorithm, you'd get much better results.

  2. tudza said,

    February 5, 2012 @ 2:04 am

    Used to generate portions of driver's license numbers in some states wasn't it?

  3. Antariksh Bothale said,

    February 5, 2012 @ 2:46 am

    This is a major problem in India where the system of romanization is far from standardized. In fact, I am currently developing a phonetic code similar in spirit to Soundex (but quite different in implementation) that can be used for spell-checking Indian proper nouns.

  4. Anton Sherwood said,

    February 5, 2012 @ 2:47 am

    Yes, my Illinois DL number (issued in 1977) was S630-xxxy-yddd, where S630 is the Soundex code for SheRwooD; xxx are arbitrary digits; yy is the year of my birth, mod 100; and ddd is 31*(month-1)+dayofmonth.

  5. Rick Sprague said,

    February 5, 2012 @ 8:03 am

    I remember using Soundex some 30 years ago to build a consolidated mailing list for a state political party. The source lists came from diverse organizations and local chapters, so there was no consistency in ordering and abbreviating honorifics, street addresses, etc. and in particular many variant name spellings. Our job was to parse unstructured names and addresses into postal format and to eliminate duplicates.

    The Soundex algorithm itself is as Victor described it, generating just a letter + 3-digit code, and yields many false positives but far fewer false negatives. In our case we used a combination of Soundex codes for first and last name, street name, and city, plus encoded state (in those days standardized postal abbreviations for states were fairly new and not consistently used). With such combinations, you can eliminate virtually all the false positives without adding too many false negatives (which in our situation represented duplicate mailings to a household, so they were considered pretty harmless).

    At the time I thought it was ironic that we used an information-losing algorithm to improve our data quality, but in retrospect I see that this simply parallels what biological sensory systems regularly do, so it makes sense.

  6. Fred said,

    February 5, 2012 @ 8:09 am

    Libyan friends used to tell me that Shakespeare was almost certainly a Libyan called 'Sheik El-Sgbir', but my Egyptian friends would then disagree…

  7. Peter Christian said,

    February 5, 2012 @ 8:11 am

    Before anyone gets too enthusiastic about Soundex, could I draw your attention to my 1998 article, Soundex – can it be improved?, which looks at some of its failings.

    A much better system for English names is my colleague Steve Archer's Nominex, which attempts to incorporate the sound-to-spelling rules of English more comprehensively and systematically.

  8. BK said,

    February 5, 2012 @ 8:21 am

    Coincidentally, you've created an alternate spelling of David Kathman's name.

    [VHM: fixed now; was mistyped as Katham]

  9. Glenn Bingham said,

    February 5, 2012 @ 10:00 am

    Having my obviously British name of Bingham (a town in Nottinghamshire, supposedly named for the Norman Binne family and solidified back to at least 1362 when Richard de Bingham was the Sheriff of Nottingham) appear in US records in the 1700s and early 1800s as Bingham, Byngham, Bingam, Bingum, Bingman, Bingmen (in accord with my great-great-great-grandfather's signature), Bingaman, Bingeman, Bing#m#n (such that #=vowels written over other vowels), and in the oldest supposed record Bingorman. The oldest I can trace, at least tentatively, was in German-settled Reading, PA, USA, where there is still a Bingaman Street. It appears through various circumstantial evidence that the family moved across the Delaware River into NJ 1750ish where British scribes were at work with a mostly illiterate population, attempting to spell the names phonetically resulting in all the other forms.

    When I coupled the given names of the second known generation in NJ, the sons of John Bingham/Bingman, who were John, Peter, Jacob, and Matthias, with the disappearance of Johannes Bingaman in Reading, it became obvious that it was time to trade in my teacups for beer steins and change my family heritage. Uncle Harry, whose middle name is Souder, questions me every time he sees me, "So we are German and not British?" Two of the four brothers moved West, and most of my relatives in the West are named 'Bingman' today. The other Binghams are of British (CT > west) or Irish (VA > KY) descent.

    Soundex yields B525 for all the forms, including the one with written-over vowels, except "Bingorman," which yields B526. Pretty good! The Metaphone versions, however, show that newer is not necessarily better:

    Bingham, Byngham, BNM
    Bingam, Bingum, BNKM
    Bingman, Bingmen, Bingaman, Bing#m#n, BNKMN
    Bingeman, BNJMN (I have seen 'Benjamin' spelled 'Bingeman')
    Bingorman, BNKRMN

    Most of my German ancestors also suffered Anglicization in NJ: Wolpaert > Woolford; Bechtle > Beckley; Johannes Balthaser Harph > John Balzer Hurff; Kurmann > Carman, etc.

  10. John Roth said,

    February 5, 2012 @ 10:39 am

    Sherwood – that xxx is not arbitrary. It's a dictionary-based encoding of your first name. The rest of the code is as I remember it from when I worked on driver's licenses for an automobile insurer.

  11. julie lee said,

    February 5, 2012 @ 1:19 pm

    Re. Fred's mention about Shakespeare being Arab:
    Ali Mazrui wrote in one of his books that Homer may have been an African, "Homer" another spelling for "Omar".

  12. Dave M said,

    February 5, 2012 @ 6:47 pm

    As a fellow member of M600 [MAIER subdivision], I very much appreciated this post. It is indeed bizarre that Soundex thinks that Morrow, Murray and Merrihew belong among us. I remember one guy in high school who kept calling me "Miller," and that makes more sense to be than "Morrow."

    We've always wondered where the name came from (and as you say, Great-grandpa was indeed from south Germany), thinking possibly it had something to do with the month of May (= Mai). So it's actually from the word for "superior." Well, that makes sense, wouldn't you agree?

  13. tpr said,

    February 6, 2012 @ 6:33 am

    For the programmers among you, PHP has inbuilt functions for generating soundex and metaphone codes, as well as other functions for testing the similarity of strings.

  14. a George said,

    February 6, 2012 @ 12:30 pm

    I knew Hans Karlgren from SKRIPTOR, a company he founded, when he began using phonetic searches for similarities between pronouncable trademarks – so-called word marks. Some of the work was based on research by Benny Brodda. But I have to confess that Soundex was never mentioned when my ears were open.

  15. David Brooks said,

    February 6, 2012 @ 4:50 pm

    I used Soundex in my first job in 1971. It was a hospital appointment booking system; Soundex helped you find a patient in the system when you might have a mis-spelling in front of you, or heard on the phone.

  16. Jaem said,

    February 6, 2012 @ 10:24 pm

    For much earlier photography in China, check out Felice Beato, whose China work is from 1860. The Getty has a bunch of his work and did a book.
    http://www.getty.edu/art/exhibitions/beato/

RSS feed for comments on this post