Support DESK

Follow

Appendix B - Advanced Configuration Guide - Fuzzy Algorithms

Previous Article matchIT Hub Index Next Article 

matchIT Fuzzy

The default matchIT algorithm for fuzzy string matching allows for only 1 error of one of the following types: insertions, deletions, substitutions, transpositions, and wide transpositions. Where: a transposition is where two adjacent characters are swapped, and a wide transposition is where two non-adjacent characters are swapped.

The matchIT Fuzzy compare algorithm is equivalent to Damerau-Levenshtein with an edit distance of 1, except that Damerau-Levenshtein doesn't allow for wide transpositions.

Damerau-Levenshtein

This is an extension to the original Levenshtein distance algorithm that handles transpositions as well as insertions, deletions, and substitutions. Both algorithms are in the public domain.

The algorithms produce a string metric ("edit distance") for measuring the similarity between two strings (i.e. the number of changes to get from one string to the other). matchIT then produces a fuzzy score between 0 and 1, which is a measure of the edit distance in relation to the length of the longer string. Two configuration settings provide fine tuning of fuzzy matching Compare | Fuzzy | MaximumEditDistance and Compare | Fuzzy | MinumumScore.

Each string comparison must satisfy both constraints. See below for some examples that provide a rational for both constraints.

Examples

Using the default settings, these strings will be matched:

Type First word Second word Edit distance Score
Identical JO JO 0 1
Identical JOHN JOHN 0 1
Identical JONATHAN JONATHAN 0 1
Fuzzy (reversal) JO OJ 1 0.5
Fuzzy (insertion) AB ABC 1 0.667
Fuzzy (substitution) (#) A B 1 0.5
Fuzzy (substitution) JOHN JAHN 1 0.75
Fuzzy (transposition) JOHN JHON 1 0.75
Fuzzy (insertion) REBECCA REBECCCA 1 0.875

An exception is made for single-letter matches. matchIT only allows single-letter matches if they’re similar (i.e. G and J, S and F, M and N), and does not compare them using Damerau-Levenshtein.

Using the default settings, the following strings will not be matched – either because the edit distance exceeds the maximum (1) or the score hasn’t reached the minimum (0.5):

Type First word Second word Edit distance Score
Fuzzy (insertions) (*) ANTON ANTHONY 2 0.714
Fuzzy (insertions) (*) REBECCA REBBECCCA 2 0.778
Fuzzy (substitutions) (*) LEONARD LOENRAD 2 0.857
Fuzzy (reversal) (*) JON NOJ 2 0.333
Fuzzy (reversal) JOHN NHOJ 3 0.25
Different (+) JO ED 2 0
Different DAVE ED 3 0.25
Different JOHN DAVE 4 0
Different NOTTINGHAM GLOUCESTER 10 0
Containment NOTTINGHAM NOTTINGHAMSHIRE 5 0.667

By increasing the MinimumEditDistance to 2, the three fuzzy matches marked (*) will be found. However, the row marked with (+) still won’t be matched, even though its edit distance is 2, because its score is below the minimum of 0.5. Hence the need for both configuration options.

(Of course, some of these matches are going to be found because they’re phonetically identical or similar (e.g. Rebecca and Rebbeccca), these examples are just to illustrate what can be achieve by using Damerau-Levenshtein regardless of phoneticization.)

Previous Article matchIT Hub Index Next Article 
Was this article helpful?
0 out of 0 found this helpful

0 Comments

Please sign in to leave a comment.