tsJensen

A quest for software excellence...

Four Functions for Finding Fuzzy String Matches in C# Extensions

How similar are these two strings? Does string X sound like string Y? Could they be duplicates? Is the difference between string N and string M just a typo? There are many scenarios in enterprise software development where the answers to these questions can be highly significant.

In my search to answer such questions with code, the most helpful resource I’ve ever found was presented by George M. Brown on www.codeguru.com who implemented four well known and powerful fuzzy string matching algorithms in VBA for Access a few years ago. In an effort to convert these algorithms to C#, I found two alternatives that saved me some time (see below).

The four algorithms, with requisite Wikipedia links, are:

Each of the algorithms have been implemented here as extensions to the string and string array. Before we get to the code, let’s take a look at some results. Here are two very simplistic tests. The algorithms are not combined in any way. You can experiment with them and create your own secret sauce. 

Name Matching for (mispelled deliberately): "Jensn"
The first test result set presents the raw output of the algorithms on a mispelled surname (mine) against a list of other surnames. I’ve highlighted the best score.

Dice Coefficient for Jensn:
    .00000 against Adams
    .46154 against Benson
    .00000 against Geralds
    .37500 against Johannson
    .42857 against Johnson
    .76923 against Jensen
    .30769 against Jordon
    .30769 against Madsen
    .00000 against Stratford
    .14286 against Wilkins

Levenshtein Edit Distance for Jensn:
    4 against Adams
    2 against Benson
    5 against Geralds
    5 against Johannson
    3 against Johnson
    1 against Jensen
    4 against Jordon
    4 against Madsen
    8 against Stratford
    6 against Wilkins

Longest Common Subsequence for Jensn:
    .04000, s against Adams
    .33333, ensn against Benson
    .05714, es against Geralds
    .08889, jnsn against Johannson
    .17143, jnsn against Johnson
    .56667, jensn against Jensen
    .06667, jn against Jordon
    .13333, en against Madsen
    .02222, s against Stratford
    .11429, ns against Wilkins

Double Metaphone for Jensn: ANSN
    ATMS metaphone for Adams
    PNSN metaphone for Benson
    JRLT metaphone for Geralds
    AHNS metaphone for Johannson
    ANSN metaphone for Johnson
    ANSN metaphone for Jensen
    ARTN metaphone for Jordon
    MTSN metaphone for Madsen
    STTR metaphone for Stratford
    FLKN metaphone for Wilkins

Address Matching for "2130 South Fort Union Blvd."
The second test is the same code run against multiple addresses trying to match a given address. Let’s see how it turned out.

Dice Coefficient for 2130 South Fort Union Blvd.:
    .16000 against 2689 East Milkin Ave.
    .10000 against 85 Morrison
    .27273 against 2350 North Main
    .07843 against 567 West Center Street
    .66667 against 2130 Fort Union Boulevard
    .61538 against 2310 S. Ft. Union Blvd.
    .42553 against 98 West Fort Union
    .12245 against Rural Route 2 Box 29
    .05000 against PO Box 3487
    .04444 against 3 Harvard Square

Levenshtein Edit Distance for 2130 South Fort Union Blvd.:
    18 against 2689 East Milkin Ave.
    22 against 85 Morrison
    18 against 2350 North Main
    22 against 567 West Center Street
    11 against 2130 Fort Union Boulevard
    8 against 2310 S. Ft. Union Blvd.
    14 against 98 West Fort Union
    19 against Rural Route 2 Box 29
    22 against PO Box 3487
    22 against 3 Harvard Square

Longest Common Subsequence for 2130 South Fort Union Blvd.:
    .02116, 2 st in v. against 2689 East Milkin Ave.
    .02020,  son against 85 Morrison
    .04444, 230 oth in against 2350 North Main
    .01010,  st t  against 567 West Center Street
    .25481, 2130 fort union blvd against 2130 Fort Union Boulevard
   .25765, 230 s ft union blvd. against 2310 S. Ft. Union Blvd.
    .25514,  st fort union against 98 West Fort Union
    .02593,  out  o  against Rural Route 2 Box 29
    .01347, o o  against PO Box 3487
    .01389, 3 hrvd against 3 Harvard Square

Double Metaphone for 2130 South Fort Union Blvd.: STFR
    STML metaphone for 2689 East Milkin Ave.
    MRSN metaphone for 85 Morrison
    NRTM metaphone for 2350 North Main
    SSNT metaphone for 567 West Center Street
    FRTN metaphone for 2130 Fort Union Boulevard
    SFTN metaphone for 2310 S. Ft. Union Blvd.
    STFR metaphone for 98 West Fort Union
    RRLR metaphone for Rural Route 2 Box 29
    PPKS metaphone for PO Box 3487
    RFRT metaphone for 3 Harvard Square

As you can see, the double metaphone algorithm may not be as useful on its own as the other algorithms. But when you put them together in creative ways, you’ll get a very powerful result.

Here’s how easy the algorithms, as extension methods, are to use.

private static void NameMatching()
{
	//name matching
	string input = "Jensn";
	string[] surnames = new string[] { 
		"Adams",
		"Benson",
		"Geralds",
		"Johannson",
		"Johnson",
		"Jensen",
		"Jordon",
		"Madsen",
		"Stratford",
		"Wilkins"
		};

	Console.WriteLine("Dice Coefficient for Jensn:");
	foreach (var name in surnames)
	{
		double dice = input.DiceCoefficient(name);
		Console.WriteLine("\t{0} against {1}", 
			dice.ToString("###,###.00000"), name);
	}

	Console.WriteLine();
	Console.WriteLine("Levenshtein Edit Distance for Jensn:");
	foreach (var name in surnames)
	{
		int leven = input.LevenshteinDistance(name);
		Console.WriteLine("\t{0} against {1}", leven, name);
	}

	Console.WriteLine();
	Console.WriteLine("Longest Common Subsequence for Jensn:");
	foreach (var name in surnames)
	{
		var lcs = input.LongestCommonSubsequence(name);
		Console.WriteLine("\t{0}, {1} against {2}", 
			lcs.Item2.ToString("###,###.00000"), lcs.Item1, name);
	}

	Console.WriteLine();
	string mp = input.ToDoubleMetaphone();
	Console.WriteLine("Double Metaphone for Jensn: {0}", mp);
	foreach (var name in surnames)
	{
		string nameMp = name.ToDoubleMetaphone();
		Console.WriteLine("\t{0} metaphone for {1}", nameMp, name);
	}
}

Source References
In researching and finding these algorithms, I poured over what seemed like hundreds of articles, blogs and open source resources. In the end, I settled on three primary sources and made certain modifications to suite my own needs. Here they are.

Additional references:

I recommend you spend time with these sources and doing your own research. Undoubtedly you will come up with even better algorithms. When you do, please share them with us here.

Update: code now on GitHub