LookupCompound and WordSegmentation, algorithms built on Symmetric Delete/Deletion Neighborhoods, are pretty interesting.
I wonder how difficult it would be to adapt this to work on sounds or other frequent typos and misspelling sources instead of just characters. It seems it should he possible if you can define a decent "normalization" function.
For those the sound of the word isn't right, you need something that checks for nearby keys (I guess it needs to know the keyboard layout), and for missing or extra letters.
- I mistyped the word.
- I don't know how to spell the word.
The errors in these two look quite different and may need different methods of correcting.
Soundex is a rough phonetic "spelling" used for such things. It's also something I never see mentioned.
The precision and recall is too bad to use as a spell checker.
"argetina" suggests "retina", Argentina isn't in the list.
"nethrlands" suggests "fatherland", Netherlands isn't in the list.
"ukriane" suggests "hurricane", Ukraine isn't in the list.
It fails for every country I try.
That said, the technique is not wholly without merit, but does carry certain "average-worst case" trade offs related to latency in the memory/storage system because of SymSpell's reliance upon large hash tables. For details see https://github.com/c-blake/suggest
EDIT: Also - I am unaware of other implementations even saving the large, slow-to-compute index. The others I am aware of seem to rebuild the large index every time which seems kind of lame. EDIT2 - I guess there is a recent Rust one that is persistent as well as the "mmap & go" Nim one. Still, what should be standard is very rare.
How about comparing it to. Levenshtein automaton or another state of the art approach?
The TL;DR is that on a "worst case" basis, it may be 10x faster or even 10x slower than a non-indexed linear scan depending upon latency of backing stores. Physical memories in computers are pretty big these days, but this technique would have been a total non-starter just 10 years before the 2007 article by Norvig. Meanwhile its linear scan competitor would have worked fine. It is also not a great approach if you have a big dictionary of long words and want truly large edit distances (>>5, say). I would say "mid-size edit distances of 2-4" are SymSpell's sweet spot.
I just found a solution to this problem.
You can supply a phonetic name and this helps google match. This seems a bit low tech though.
Does anyone have any insight into what's holding this back?
What open-source solutions have you been using that aren't working as well as you'd like?
 https://www.microsoft.com/en-us/research/blog/speller100-zer..., https://cloud.google.com/blog/products/g-suite/everyday-ai-b...
Edit: To determine the number of records created from permuting a single string, use this:
for (r = 1; r <= e; r++)
numRecs += (fact(n) / (fact(r) * fact(n - r)));
numRecs = -1 + 2^n + (n nCr e+1) * 2F1(1, e-n+1; e+2; -1)
n nCr k combinatorial choice, 2F1 is the hypergeometric function
Or sounds (like Shazam does)?