Here's an alternative that is more efficient at large scales (eg. if you have one order of magnitude more entries than just an English dictionary): performing Approximate Nearest Neighbor search [1] over a pre-computed Levenshtein space.
- hash all your dictionary entries through a Levenshtein-locality-sensitive hash function [2] with a well-chosen hash size. Any two words within a Levenshtein distance D of each other will have the same hash (you pick D).
- store the hashes into an efficient DB for quick lookup. Each hash will map to one or several (levenshtein-similar) dictionary entries.
- when presented with a new input, compute its hash, retrieve the entry for the hash in your DB (if it exists).
- finish by sorting the retrieved dictionary entries by Levenshtein distance to your input... you just recovered all dictionary entries closer than distance D to your input (approximately).
Pretty much unbeatable, and works over spaces that have hundreds of millions of dictionary entries.
[2] Can't find one? Build your own by computing the pairwise distance matrix of your entries, then reducing this matrix through PCA to an appropriate number of components (the length of your hash). The coordinates of each entry in this new space is the entry's hash.
> Build your own by computing the pairwise distance matrix of your entries, then reducing this matrix through PCA to an appropriate number of components (the length of your hash).
You are kidding right ? I cant imagine anyone seriously suggesting PCA on a half a million by half a million matrix, or evaluating this matrix in the first place.
This actually a fairly actively researched topic: whether edit distance can be embedded, even if approximately in a vector space. It has been surprisingly resistant to such efforts. Yes there are some results but nothing very satisfactory.
Edit distance is a metric, so it was believed it would easy to map individual strings to vectors, such that the distance between two vectors was same as, or close to the edit distance between the corresponding strings. What we want, then, is an (approximate) isomorphism between these two metric spaces. In addition, in the interest of size we would like these vectors to be of small dimension. If the required dimension is high it would defeat the purpose if the string to vector map needs to be stored. Unfortunately it seems edit distance is resistant to isomorphism to a low-dimensional vector space. There is also this question which metric do we use in the vector space, Euclidean, Manhattan...? there are some results on the L1, i.e. embedding on a vector space with Manhattan distances.
Fun story: was an intern in Google's spelling correction team once upon a time long ago. So I was explaining what I was doing, the person I was explaining it to was thoroughly unimpressed: "So whats the big deal, you know/hear the word and write down the spelling". It made me realize how difficult spelling in English is compared to some of the other languages. The variability in the map from phonemes to spelling is quite high in English. This person was Italian, he told me Italian has a lot less variability in this regard.
PCA is basically just SVD and computing an SVD of that size is not unreasonable, even of it's dense: 0.5e6^2 = 0.25e12 = 1TB of 32-bit floats. That's a lot of memory but there are reasonably priced machines ($50k or so) with that much RAM these days. If it's sparse, this can be done on a laptop, and it might not even be necessarily to materialize the matrix at all if you use Lancosz iteration.
Lets put it this way then, we disagree on what's a reasonable expenditure. 1TB Ram ? $50k ? Could you fund me please :) ok adding pretty please.
Note its not just memory, its computation also. Every entry of that matrix is a solution to a dynamic program, if you dont store it you have to recompute it. The DP scales as product of the lengths of the string, those things have a fat tail.
You could still get away with it if this matrix was close to a low rank matrix and with large gaps between adjacent singular values. Its not and thats a problem. Note you have to store the SVD factors too.
EDIT: Interesting ! my comments seems to have annoyed someone. Did not expect downvotes on my comments here, thought they were pretty uncontroversial. If you dear reader can leave some clues it would be helpful and appreciated. I upvote those who explain what they found unpalatable about my comments, regardless of agreement.
It's a lot of computation, granted, but I just wanted to point out that it wasn't completely ludicrous. The proposed algorithm doesn't work anyway since there are no non-degenerate hash functions that map all words within D edits of each other to the same hash value (as I pointed out here: https://news.ycombinator.com/item?id=8158942).
I'm definitely not kidding. I'll point out that this is very much tractable (for instance Spotify runs a matrix factorization job every few days on a users x songs matrix, maybe 50 million by 30 million, which is several orders of magnitude more). But anyway if you want to do that on your laptop, you don't have to use the entire space to build a good function, just a statistically representative sample of your dictionary will suffice.
Sparse matrix SVD (or, more likely, truncated/approximate SVD) like Spotify and Netflix contestants used to do is easy. But anything dense is not quite as easy (StefanKarpinski gave a $50K estimate for a machine with enough memory to do that -- I don't consider that "easy")
Also, do you actually have any kind of experience with PCAing/SVDing an edit distance matrix? It's not easy to embed in a eucleadean geometry. If you have that experience, can you point to more reference material? If you don't, it would be prudent to say "I believe this might work though I have never tried anything similar".
The scope of this project is beyond huge. And similar things are being proposed for the last 250 years (some in jest, some seriously [1]). But the benefits do not seem sufficient given the costs.
Of course it is huge, but as something I have thought about for a long time it actually has a chance of working. Not that I can convince anyone here, but it is worth taking the time to consider in depth.
How can it be true that any two words with a given levenstein distance will have the same hash?
If we picked the threshold to be one, couldn't you come up with two words that are levenstein distance one, and then come up with a third word that's distance one to one of the words and not the other?
In that case, the first two would be in one bucket, where does the third go?
i agree. so you have to hash things into multiple buckets, ie each word needs to have multiple coordinates/buckets in hash space. i think this is what is meant by "well-chosen hash size" / "one or several", maybe. it makes sense if you think about the distance matrix / PCA sketch - PCA leaves you with a low-rank approximation of the distance matrix, and the rank would be the number of coordinates for each word in hash space.
There is no such magical hash. Any hash splits the space of words into equivalence classes and unless there's only one equivalence class, there must be words that are edit distance one apart that are in different classes since you can get from any word to any other by a sequence of single edits.
Any single hash will be suboptimal; the "magic" would be to use a bunch of different hashes to partition your space into suboptimal buckets, and use all of them at once to get very good (though still approximative) query results.
This is what Spotify's excellent Annoy library does. [1]
I can't edit anymore, so I'll reply to my own post:
People are pointing out there is no known hash function that will map every word within a same distance D to the same hash. That is correct, and I was just meaning to explain the process at a high level. Levenshtein-locality sensitive hashing is not a solved problem in CS.
And you know what else in the algorithm I described is not a solved problem? Efficient nearest neighbor search. In a space with more than a few dimensions, we actually don't know any better way to perform exact NNS than brute forcing the search space.
And you know what? None of that matter, since we have inexact yet statistically excellent processes to perform both of these tasks. That's the really cool thing about engineering as opposed to math or computer science: it doesn't have to be exact to be useful.
A trivial way to build an inexact, yet useful levenshtein hash function is to pick a random sample of words from your dictionary, define them as hash values, and assign to any input the closest word from your sample. Now work with a large family (100+) of such grossly approximate hashes and you can build a statistically excellent algorithm. (and of course this is just an example, there are countless better ways to do the same thing).
Shameless plug: I have a Java library that does this even more efficiently. It constructs a Levenshtein automaton of a word in linear time and then computes the intersection between a dictionary (stored in a deterministic acyclic minimal finite state automaton) and the Levensthein automaton. It also has other features, such as:
- Perfect hash automata (an automaton that gives a unique hash for every word/sequence).
- String-keyed maps that use a finite state automaton to store keys.
Also interesting, the cutting edge work lucene has done to make this possible in 4.x, using universal automatons to speed up fuzzy, regex, and other searches.
The advantage of basic Levenshtein is that it's a heck of a lot quicker (last time I checked) than Damerau-L or Jaro-Winkler in real world applications.
And I wouldn't go as far as to say it's useless. Basic Levenshtein only misses transpositions in comparison to Damerau-L, but it still catches them as insert/delete combos.
> searches contain mispelled words, and users will expect these searches to magically work
HELL NO, I expect searches to return "result not found". This is important information, as important as actual found results. I dont mind clearly labelled suggestions, but NEVER EVER give me what you think I want, instead of what I actually ask for.
I don't think we actually know what the typical user wants. Or is one of us aware of some really good user research on this subject?
One of our common failings as programmers is making incorrect assumptions about what the user wants. rasz_pl is assuming that the user is like him, and values precision. Gigablah is assuming that the user is not like rasz_pl, and does not value precision. Both of those are simply assumptions.
I hear this all the time. "But of course the user will want to be able to choose exactly which columns of data are in the table!". "But of course the user would rather see one nice simple summary number than the actual details!". Both of those are scary.
What frustrates me is when searches "correct" things for you, and especially when there is no way to prevent the "correction".
Case in point: Google. As far as I can tell there is no way to search for something on Google without it "correcting" it for you. + used to work, but it no longer does. Quotations used to work, but that no longer does either. Sometimes it pops up "Search instead for <search term you actually entered>", but not always.
There's really two issues. That one displays the latter but not the former - it autocorrects and there is no way to turn off the autocorrection, but yes, that one does display the "search instead for" part.
I'll give you a ring next time I run across something that doesn't have the "search instead for" part.
- hash all your dictionary entries through a Levenshtein-locality-sensitive hash function [2] with a well-chosen hash size. Any two words within a Levenshtein distance D of each other will have the same hash (you pick D).
- store the hashes into an efficient DB for quick lookup. Each hash will map to one or several (levenshtein-similar) dictionary entries.
- when presented with a new input, compute its hash, retrieve the entry for the hash in your DB (if it exists).
- finish by sorting the retrieved dictionary entries by Levenshtein distance to your input... you just recovered all dictionary entries closer than distance D to your input (approximately).
Pretty much unbeatable, and works over spaces that have hundreds of millions of dictionary entries.
[1] https://en.wikipedia.org/wiki/Nearest_neighbor_search#Approx...
[2] Can't find one? Build your own by computing the pairwise distance matrix of your entries, then reducing this matrix through PCA to an appropriate number of components (the length of your hash). The coordinates of each entry in this new space is the entry's hash.