Phonetic Matching is based on the principal that pronunciation depends on the language. So the first step is to determine the language from the spelling of the name. Then the name is converted into a sequence of phonetic tokens using pronunciation rules specific to that particular language. And, finally, names are compared based on their phonetic-token sequence.
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms (in part because it is a standard feature of popular database software such as DB2, PostgreSQL, MySQL, Ingres,MS SQL Server and Oracle) and is often used (incorrectly) as a synonym for “phonetic algorithm”.
The Soundex code for a name consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants. Consonants at a similar place of articulation share the same digit so, for example, the labial consonants B, F, P, and V are each encoded as the number 1.
The correct value can be found as follows:
- Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
- Replace consonants with digits as follows (after the first letter):
- b, f, p, v → 1
- c, g, j, k, q, s, x, z → 2
- d, t → 3
- l → 4
- m, n → 5
- r → 6
- If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
- Iterate the previous step until you have one letter and three numbers. If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.
Using this algorithm, both “Robert” and “Rupert” return the same string “R163” while “Rubin” yields “R150”. “Ashcraft” and “Ashcroft” both yield “A261” and not “A226” (the chars ‘s’ and ‘c’ in the name would receive a single number of 2 and not 22 since an ‘h’ lies in between them). “Tymczak” yields “T522” not “T520” (the chars ‘z’ and ‘k’ in the name are coded as 2 twice since a vowel lies in between them). “Pfister” yields “P236” not “P123” (the first two letters have the same number and are coded once as ‘P’).
The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System (now a part of the New York State Division of Criminal Justice Services). It features an accuracy increase of 2.7% over the traditional Soundex algorithm.
The algorithm, as described in Name Search Techniques, is:
- Translate first characters of name: MAC → MCC, KN → N, K → C, PH, PF → FF, SCH → SSS
- Translate last characters of name: EE → Y, IE → Y, DT, RT, RD, NT, ND → D
- First character of key = first character of name.
- Translate remaining characters by following rules, incrementing by one character each time:
- EV → AF else A, E, I, O, U → A
- Q → G, Z → S, M → N
- KN → N else K → C
- SCH → SSS, PH → FF
- H → If previous or next is non-vowel, previous.
- W → If previous is vowel, A.
- Add current to key if current is not same as the last key character.
- If last character is S, remove it.
- If last characters are AY, replace with Y.
- If last character is A, remove it.
- Append translated key to value from step 3 (removed first character)
- If longer than 6 characters, truncate to first 6 characters. (only needed for true NYSIIS, some versions use the full key)
Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar sounding words should share the same keys.
Original Metaphone codes use the 16 consonant symbols 0BFHJKLMNPRSTWXY. The ‘0’ represents “th” (as an ASCII approximation of Θ), ‘X’ represents “sh” or “ch“, and the others represent their usual English pronunciations. The vowels AEIOU are also used, but only at the beginning of the code. This table summarizes most of the rules in the original implementation:
- Drop duplicate adjacent letters, except for C.
- If the word begins with ‘KN’, ‘GN’, ‘PN’, ‘AE’, ‘WR’, drop the first letter.
- Drop ‘B’ if after ‘M’ at the end of the word.
- ‘C’ transforms to ‘X’ if followed by ‘IA’ or ‘H’ (unless in latter case, it is part of ‘-SCH-‘, in which case it transforms to ‘K’). ‘C’ transforms to ‘S’ if followed by ‘I’, ‘E’, or ‘Y’. Otherwise, ‘C’ transforms to ‘K’.
- ‘D’ transforms to ‘J’ if followed by ‘GE’, ‘GY’, or ‘GI’. Otherwise, ‘D’ transforms to ‘T’.
- Drop ‘G’ if followed by ‘H’ and ‘H’ is not at the end or before a vowel. Drop ‘G’ if followed by ‘N’ or ‘NED’ and is at the end.
- ‘G’ transforms to ‘J’ if before ‘I’, ‘E’, or ‘Y’, and it is not in ‘GG’. Otherwise, ‘G’ transforms to ‘K’.
- Drop ‘H’ if after vowel and not before a vowel.
- ‘CK’ transforms to ‘K’.
- ‘PH’ transforms to ‘F’.
- ‘Q’ transforms to ‘K’.
- ‘S’ transforms to ‘X’ if followed by ‘H’, ‘IO’, or ‘IA’.
- ‘T’ transforms to ‘X’ if followed by ‘IA’ or ‘IO’. ‘TH’ transforms to ‘0’. Drop ‘T’ if followed by ‘CH’.
- ‘V’ transforms to ‘F’.
- ‘WH’ transforms to ‘W’ if at the beginning. Drop ‘W’ if not followed by a vowel.
- ‘X’ transforms to ‘S’ if at the beginning. Otherwise, ‘X’ transforms to ‘KS’.
- Drop ‘Y’ if not followed by a vowel.
- ‘Z’ transforms to ‘S’.
- Drop all vowels unless it is the beginning.
It should be noted, however, that this table does not constitute a complete description of the original Metaphone algorithm, and the algorithm cannot be coded correctly from it. Original Metaphone contained many errors and was superseded by Double Metaphone, and in turn Double Metaphone and original Metaphone were superseded by Metaphone 3, which corrects thousands of miscodings that will be produced by the first two versions.
The Double Metaphone phonetic encoding algorithm is the second generation of this algorithm. Its implementation was described in the June 2000 issue of C/C++ Users Journal. It makes a number of fundamental design improvements over the original Metaphone algorithm.
It is called “Double” because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. For example, encoding the name “Smith” yields a primary code of SM0 and a secondary code of XMT, while the name “Schmidt” yields a primary code of XMT and a secondary code of SMT—both have XMT in common.
Double Metaphone tries to account for myriad irregularities in English of Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origin. Thus it uses a much more complex ruleset for coding than its predecessor; for example, it tests for approximately 100 different contexts of the use of the letter C alone.
To implement Metaphone without purchasing a (source code) copy of Metaphone 3, the best guide would be the reference implementation of Double Metaphone, which may be found here.
- Metaphone 3
Metaphone 3 is the third generation of the Metaphone algorithm. It increases the accuracy of phonetic encoding from the 89% of Double Metaphone to 98%, as tested against a database of the most common English words, and names and non-English words familiar in North America. This produces an extremely reliable phonetic encoding for American pronunciations.
Metaphone 3 was designed and developed by Lawrence Philips, who designed and developed the original Metaphone and Double Metaphone algorithms.
A professional version was released in October 2009, developed by the same author, Lawrence Philips. It is a commercial product but is sold as source code. Metaphone 3 further improves phonetic encoding of words in the English language, non-English words familiar to Americans, and first names and family names commonly found in the United States. It improves encoding for proper names in particular to a considerable extent. The author claims that in general it improves accuracy for all words from the approximately 89% of Double Metaphone to 98%. Developers can also now set switches in to code to cause the algorithm to encode Metaphone keys 1) taking non-initial vowels into account, as well as 2) encoding voiced and unvoiced consonants differently. This allows the result set to be more closely focused if the developer finds that the search results include too many words that don’t resemble the search term closely enough. Metaphone 3 is sold as C++, Java, C#, PHP, Perl, and PL/SQL source, as well as Metaphone 3 for Spanish and German available as Java source. The latest revision of the Metaphone 3 algorithm is v2.5.2, released February 2015.