Fast edit distance function for text comparison
Syntax: =\NGRAMS.DISTANCE(text1, text2, [n])
1 2 3 4 5 6 7 8 9 10 11 12 13 | = LAMBDA(text 1 ,text 2 ,[n], LET( n _ sel, IF(OR(ISOMITTED(n),n< 1 ), 2 ,n), ngrams _ f, LAMBDA(text, MID(text,SEQUENCE(LEN(text)-n _ sel+ 1 ),n _ sel) ), pad, REPT( "_" ,n _ sel- 1 ), b, VSTACK(ngrams _ f(pad&text 1 &pad),ngrams _ f(pad&text 2 &pad)), c, VSTACK(SEQUENCE(LEN(text 1 )+n _ sel- 1 , 1 , 1 , 0 ),SEQUENCE(LEN(text 2 )+n _ sel- 1 , 1 ,- 1 , 0 )), 1 -REDUCE( 0 ,UNIQUE(b),LAMBDA(acc, val ,acc+ABS(SUM(FILTER(c,b = val )))))/ROWS(c) ) ) |
Documentation
The purpose of this function is similar to \LEVENSHTEIN, but it works much faster, in O(n+m) time, instead of O(n2) for unbounded Levenshtein.
The function computes sets of n-grams (bi-grams or 2-grams by default) for the texts being compared, i.e. splits the strings into overlapping pairs (shingles) of consecutive characters and returns the Jaccard distance between the two sets, i.e. number of matching elements divided by the total number of elements.
This particular implementation takes into account multiple occurrences of each shingle (so the sets aren’t exactly sets) and includes padding for beginning and ending characters, (“dog” will become “_d”, “do”, “og”, “g_”, “mama” will become “_m”, “ma” x2, “am”, “a_” ).
Due to the use of a simple = (equals) comparison in the FILTER function, the function is not case-sensitive.
NOTE: for n=2 (default) the function will return 0 distance for strings differing in word order, as long as the first and last words are the same.
Blog
This method yields surprisingly good results for text comparisons. The only reason it is not a viable replacement for Levenshtein for all use cases is that the metric returned is not as intuitively explainable.
The speed of the function opens the door to writing a fuzzy match function that works on large data sets.