Hacker News new | past | comments | ask | show | jobs | submit login
SymSpell: 1M times faster spelling correction (github.com/wolfgarbe)
203 points by mci 5 months ago | hide | past | favorite | 32 comments

What they are calling "Symmetric Delete" seems to be the same as an older concept called "deletion neighborhoods". It is a term coined in an academic paper written by Thomas Bocek, Ela Hunt, and Burkhard Stiller from the University of Zurich, titled "Fast Similarity Search in Large Dictionaries"[1]. The work described there was expanded in a paper written by Daniel Karch, Dennis Luxen, and Peter Sanders from the Karlsruhe Institute of Technology, titled "Improved Fast Similarity Search in Dictionaries"[2]. Both of these papers deal with efficient searching for similar string values, given a query string.

LookupCompound and WordSegmentation, algorithms built on Symmetric Delete/Deletion Neighborhoods, are pretty interesting.

[1]https://fastss.csg.uzh.ch/ifi-2007.02.pdf [2]https://arxiv.org/abs/1008.1191v2

Amazing info, thank you.

This seems very focused on spelling. What I find is key for a good spell correction system is how the words sound. For example I find that Firefox often can't find the word I meant for it's suggestion list but pasting the misspelt word into Google gets the right result 99% of the time as the one provided option.

I wonder how difficult it would be to adapt this to work on sounds or other frequent typos and misspelling sources instead of just characters. It seems it should he possible if you can define a decent "normalization" function.

I used an old algorithm in a text search, called Metaphone[0]. It works relatively well, for English (There's a Spanish version, too). I used Double Metaphone.

[0] https://en.wikipedia.org/wiki/Metaphone

There's a pretty cool python library with a huge number of these if you want to experiment (GPLv3): https://github.com/chrislit/abydos

Metaphone is good if you typed in something that sounds like the correct word. But often spelling typos are missing letters, or pressing a nearby letter instead of the one you meant.

For those the sound of the word isn't right, you need something that checks for nearby keys (I guess it needs to know the keyboard layout), and for missing or extra letters.

There are definitely two related concerns here:

- I mistyped the word.

- I don't know how to spell the word.

The errors in these two look quite different and may need different methods of correcting.

Same here, Apple often auto correct my word into something else, or failed to even suggest the correct word. I had to use Google to get a correct spelling. ( I am not sure if I am alone in this sometimes I dont think I can spell any more. )

Comparing the spellcheck quality of Firefox to Google (search, not Chrome) might not be completely fair. The browser most likely uses a client-side spelling model with tight memory & performance constraints. The suggestions from Google come from high quality models hosted on their servers.


Soundex is a rough phonetic "spelling" used for such things. It's also something I never see mentioned.

Soundex was designed as a way to index last names before computers. It's primary benefit it that it's designed to do by hand. It performs pretty lousy as a general purpose natural language hashing scheme.

The precision and recall is too bad to use as a spell checker.

It's also pretty simple, I'm sure there are much better options readily available as libraries these days.

Firefox's spelling correction is very poor. For instance, try spelling the name of a country starting with a lowercase letter and omitting a single letter.

"argetina" suggests "retina", Argentina isn't in the list.

"nethrlands" suggests "fatherland", Netherlands isn't in the list.

"ukriane" suggests "hurricane", Ukraine isn't in the list.

It fails for every country I try.

As jamra correctly points out in a sibling comment, the entry point to this (which gets a lot of traction on HN) is indeed attacking a strawman tutorial-written-on-an-airplane-in-Python algorithm. So, the 1M speed-up is very over-hyped.

That said, the technique is not wholly without merit, but does carry certain "average-worst case" trade offs related to latency in the memory/storage system because of SymSpell's reliance upon large hash tables. For details see https://github.com/c-blake/suggest

EDIT: Also - I am unaware of other implementations even saving the large, slow-to-compute index. The others I am aware of seem to rebuild the large index every time which seems kind of lame. EDIT2 - I guess there is a recent Rust one that is persistent as well as the "mmap & go" Nim one. Still, what should be standard is very rare.

Links to the other implementations would be very helpful.

There are many such links in the main article linked here, actually. That's where I just saw the Rust one using RocksDB. (No link to the Nim one, though...which has no external dependencies beyond the Nim stdlib.)

This seems very suspicious to me. They’re comparing performance to a tutorial blog post that is extremely inefficient.

How about comparing it to. Levenshtein automaton or another state of the art approach?

You are absolutely right to be suspicious. https://news.ycombinator.com/item?id=30580011

The TL;DR is that on a "worst case" basis, it may be 10x faster or even 10x slower than a non-indexed linear scan depending upon latency of backing stores. Physical memories in computers are pretty big these days, but this technique would have been a total non-starter just 10 years before the 2007 article by Norvig. Meanwhile its linear scan competitor would have worked fine. It is also not a great approach if you have a big dictionary of long words and want truly large edit distances (>>5, say). I would say "mid-size edit distances of 2-4" are SymSpell's sweet spot.

Slightly OT but trying to look up friends names using Android auto speech recognition whilst driving. I have two Austrian friend "Viktor" and "Patric". If I say. "Hey google, call Viktor" google says. "Sorry you have no Victor in your phone book. Same with Patric. "Sorry you have no Patrick" in your phone book. I'm surprised that there is not even basic scoring done when looking up names in the phone book with the most likely one offered.

*EDIT* I just found a solution to this problem. https://support.google.com/assistant/thread/559644?hl=en&msg... You can supply a phonetic name and this helps google match. This seems a bit low tech though.

I don't really see a reason why the C# version couldn't be written in a similar functional style as the Haskell version. I don't see them using linq anywhere. I see shallow clones, double constructors, no use of records. So I won't be too quick to judge. Would love to see someone try their hand at optimizing the C# version, using the full availability of modern c#.

I'd love to see better open source spell checking. The state of the art spell checkers (in say, MS Word or Google Search) are excellent and more than good enough for my needs. And yet the spell checkers in browsers (e.g. Chrome and Firefox) and other apps tend to be terrible only catching very basic cases and often not being able to suggest the correct word.

Does anyone have any insight into what's holding this back?

I integrated JSymSpell[0] into KeenWrite[1], which is based on SymSpell[2]. The algorithm provides numerous suggestions for "plannin'", with "planning" being the first hit[3]. Despite the English lexicon[4] not including all words for all localizations (US/UK) the results have been adequate for my use so far.

What open-source solutions have you been using that aren't working as well as you'd like?

[0]: https://github.com/rxp90/jsymspell

[1]: https://github.com/DaveJarvis/keenwrite/

[2]: https://github.com/wolfgarbe/SymSpell

[3]: https://i.ibb.co/Bs0W6w2/screenshot.png

[4]: https://github.com/DaveJarvis/keenwrite/tree/master/src/main...

The state of the art models these days would not be able to be bundled with browsers. These are typically neural models [0] and require specialized hardware to run, and also have high memory requirements. The models in browsers are light-weight to be able to run on most devices. Might be similar to those used in smartphone keyboards like GBoard [1].

[0] https://www.microsoft.com/en-us/research/blog/speller100-zer..., https://cloud.google.com/blog/products/g-suite/everyday-ai-b...

[1] https://ai.googleblog.com/2021/10/grammar-correction-as-you-...


So, this is time vs memory?

Very much so.

Edit: To determine the number of records created from permuting a single string, use this:

  for (r = 1; r <= e; r++)
     numRecs += (fact(n) / (fact(r) * fact(n - r)));
where e = edit distance and n = length of string. fact() is your standard factorial function.

This actually has a closed form!

numRecs = -1 + 2^n + (n nCr e+1) * 2F1(1, e-n+1; e+2; -1)

n nCr k combinatorial choice, 2F1 is the hypergeometric function

In a dictionary with more than a few words, some "created corruptions" in the lookup table/index collide with each other. (EDIT: e.g. for "hand" and "and" the overlap of the deletion-corrupted sets is substantial.) You only know which/how many by running the algo against a concrete dictionary. So, these expressions are at best a rough guide.

It says it does fuzzy search as well. Wonder how it compares to fzf

Is this how modern spell checkers worked? I assumed they were more heuristic at this point. For example, Google's "did you mean" is based on mapping common misspellings to what people actually clicked on.

Can this be used to match DNA sequences?

Or sounds (like Shazam does)?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact