How similar are these two strings? Does string X sound like string Y? Could they be duplicates? Is the difference between string N and string M just a typo? There are many scenarios in enterprise software development where the answers to these questions can be highly significant.
In my search to answer such questions with code, the most helpful resource I’ve ever found was presented by George M. Brown on www.codeguru.com who implemented four well known and powerful fuzzy string matching algorithms in VBA for Access a few years ago. In an effort to convert these algorithms to C#, I found two alternatives that saved me some time (see below).
The four algorithms, with requisite Wikipedia links, are:
Each of the algorithms have been implemented here as extensions to the string and string array. Before we get to the code, let’s take a look at some results. Here are two very simplistic tests. The algorithms are not combined in any way. You can experiment with them and create your own secret sauce.
Name Matching for (mispelled deliberately): "Jensn"
The first test result set presents the raw output of the algorithms on a mispelled surname (mine) against a list of other surnames. I’ve highlighted the best score.
Dice Coefficient for Jensn:
.00000 against Adams
.46154 against Benson
.00000 against Geralds
.37500 against Johannson
.42857 against Johnson
.76923 against Jensen
.30769 against Jordon
.30769 against Madsen
.00000 against Stratford
.14286 against Wilkins
Levenshtein Edit Distance for Jensn:
4 against Adams
2 against Benson
5 against Geralds
5 against Johannson
3 against Johnson
1 against Jensen
4 against Jordon
4 against Madsen
8 against Stratford
6 against Wilkins
Longest Common Subsequence for Jensn:
.04000, s against Adams
.33333, ensn against Benson
.05714, es against Geralds
.08889, jnsn against Johannson
.17143, jnsn against Johnson
.56667, jensn against Jensen
.06667, jn against Jordon
.13333, en against Madsen
.02222, s against Stratford
.11429, ns against Wilkins
Double Metaphone for Jensn: ANSN
ATMS metaphone for Adams
PNSN metaphone for Benson
JRLT metaphone for Geralds
AHNS metaphone for Johannson
ANSN metaphone for Johnson
ANSN metaphone for Jensen
ARTN metaphone for Jordon
MTSN metaphone for Madsen
STTR metaphone for Stratford
FLKN metaphone for Wilkins
Address Matching for "2130 South Fort Union Blvd."
The second test is the same code run against multiple addresses trying to match a given address. Let’s see how it turned out.
Dice Coefficient for 2130 South Fort Union Blvd.:
.16000 against 2689 East Milkin Ave.
.10000 against 85 Morrison
.27273 against 2350 North Main
.07843 against 567 West Center Street
.66667 against 2130 Fort Union Boulevard
.61538 against 2310 S. Ft. Union Blvd.
.42553 against 98 West Fort Union
.12245 against Rural Route 2 Box 29
.05000 against PO Box 3487
.04444 against 3 Harvard Square
Levenshtein Edit Distance for 2130 South Fort Union Blvd.:
18 against 2689 East Milkin Ave.
22 against 85 Morrison
18 against 2350 North Main
22 against 567 West Center Street
11 against 2130 Fort Union Boulevard
8 against 2310 S. Ft. Union Blvd.
14 against 98 West Fort Union
19 against Rural Route 2 Box 29
22 against PO Box 3487
22 against 3 Harvard Square
Longest Common Subsequence for 2130 South Fort Union Blvd.:
.02116, 2 st in v. against 2689 East Milkin Ave.
.02020, son against 85 Morrison
.04444, 230 oth in against 2350 North Main
.01010, st t against 567 West Center Street
.25481, 2130 fort union blvd against 2130 Fort Union Boulevard
.25765, 230 s ft union blvd. against 2310 S. Ft. Union Blvd.
.25514, st fort union against 98 West Fort Union
.02593, out o against Rural Route 2 Box 29
.01347, o o against PO Box 3487
.01389, 3 hrvd against 3 Harvard Square
Double Metaphone for 2130 South Fort Union Blvd.: STFR
STML metaphone for 2689 East Milkin Ave.
MRSN metaphone for 85 Morrison
NRTM metaphone for 2350 North Main
SSNT metaphone for 567 West Center Street
FRTN metaphone for 2130 Fort Union Boulevard
SFTN metaphone for 2310 S. Ft. Union Blvd.
STFR metaphone for 98 West Fort Union
RRLR metaphone for Rural Route 2 Box 29
PPKS metaphone for PO Box 3487
RFRT metaphone for 3 Harvard Square
As you can see, the double metaphone algorithm may not be as useful on its own as the other algorithms. But when you put them together in creative ways, you’ll get a very powerful result.
Here’s how easy the algorithms, as extension methods, are to use.
private static void NameMatching()
{
//name matching
string input = "Jensn";
string[] surnames = new string[] {
"Adams",
"Benson",
"Geralds",
"Johannson",
"Johnson",
"Jensen",
"Jordon",
"Madsen",
"Stratford",
"Wilkins"
};
Console.WriteLine("Dice Coefficient for Jensn:");
foreach (var name in surnames)
{
double dice = input.DiceCoefficient(name);
Console.WriteLine("\t{0} against {1}",
dice.ToString("###,###.00000"), name);
}
Console.WriteLine();
Console.WriteLine("Levenshtein Edit Distance for Jensn:");
foreach (var name in surnames)
{
int leven = input.LevenshteinDistance(name);
Console.WriteLine("\t{0} against {1}", leven, name);
}
Console.WriteLine();
Console.WriteLine("Longest Common Subsequence for Jensn:");
foreach (var name in surnames)
{
var lcs = input.LongestCommonSubsequence(name);
Console.WriteLine("\t{0}, {1} against {2}",
lcs.Item2.ToString("###,###.00000"), lcs.Item1, name);
}
Console.WriteLine();
string mp = input.ToDoubleMetaphone();
Console.WriteLine("Double Metaphone for Jensn: {0}", mp);
foreach (var name in surnames)
{
string nameMp = name.ToDoubleMetaphone();
Console.WriteLine("\t{0} metaphone for {1}", nameMp, name);
}
}
Source References
In researching and finding these algorithms, I poured over what seemed like hundreds of articles, blogs and open source resources. In the end, I settled on three primary sources and made certain modifications to suite my own needs. Here they are.
Additional references:
I recommend you spend time with these sources and doing your own research. Undoubtedly you will come up with even better algorithms. When you do, please share them with us here.
Update: code now on GitHub