
How do you spell boscodictiasaur? how we correct spelling and complete queries - jpschm
https://www.0x65.dev/blog/2019-12-08/how-do-you-spell-boscodictiasaur.html
======
markpapadakis
SymSpell is great for tiny datasets(based on edit distance/operations, and
dictionary size). It’s fast, but the generated index will likely be very
large; furthermore, you can’t practically encode rules and transition costs
(e.g cost from “f” to “ph” being zero for getting better results for [ifone]).

An FST is the better choice. We
([https://www.bestprice.gr/](https://www.bestprice.gr/)) used SymSpell in the
past, but have since switched to an FST based design, where the FST is used to
match prefixes. We also store all queries encoded in the FST, together with
their frequency, sorted by the query.

For each distinct corrected prefix, we determine the range in that (query,
frequency) list using binary search, and then just iterate in that list from
[first, last] in that range, and compute a score based on a language model and
the edit cost(prefix to corrected prefix) (i.e noisy channel).

We encode many different rules in the FST including language domain ones, and
building the index takes a few seconds for millions of queries. The language
model is based on 5-grams and Stupid Backoff.

~~~
wolfgarbe
[Disclaimer: I'm the author of SymSpell] >>"SymSpell is great for tiny
datasets". Dictionaries with one million word entries are no problem, even for
a maximum edit distance of 4. For comparison, the Oxford English Dictionary
contains 301,100 main entries and 616,500 word-forms in total.
[https://towardsdatascience.com/symspell-vs-bk-
tree-100x-fast...](https://towardsdatascience.com/symspell-vs-bk-
tree-100x-faster-fuzzy-string-search-spell-checking-c4f10d80a078)

SymSpell can be augmented with a Weighted Edit distance giving a higher
priority to pairs that are close to each other on the keyboard layout or which
sound similar (e.g. Soundex or other phonetic algorithms which identify
different spellings of the same sound). There are two SymSpell implementations
with a Weighted Edit distance available:
[https://github.com/MighTguY/customized-
symspell](https://github.com/MighTguY/customized-symspell)
[https://github.com/searchhub/preDict](https://github.com/searchhub/preDict)

~~~
markpapadakis
For a few million queries we needed to index, many of them close to 30
characters in length(some even longer than that), the generated index size for
max edit distance 3, was really large.

So we used 3 indices (one for unigrams, bigrams, and trigrams respectively) --
and during query processing we would segment the input query and for each
segment we ‘d consult any of those 3 indices that made sense and would keep
the top-K ones, and then we ‘d proceed to the next segment and we ‘d consider
the n-gram sequence of the “suffix” and “prefix” matches between the “carried”
top-K from the previous segment and the current/next segment, and so on, until
we have exhausted the query. Segmentation was particularly hard to get right.

It was a fairly involved process but it was what worked for us -- again,
SymSpell is _very_ fast for short tokens, but we had to execute maybe 1000s of
such SymSpell queries when processing an input query and it adds up quickly.
(We will probably open-source our SymSpell implementation soon).

------
dvh
I spell it B13R, you're welcome:
[https://ghost.sk/dinosaurs/](https://ghost.sk/dinosaurs/)

------
ma2rten
Have you considered neural network based techniques for both problems?

~~~
ssubu
Hi! Yes, we have played around with character and trigram level neural network
language models. Also, we experimented with training a supervised neural
network based on a misspellings dataset for the corrector.

Unfortunately, we had trouble getting the performance to a point where they
could replace this system. It is definitely something we will revisit soon
though!

------
bashwizard
Mboscodictiasaur obviously.

------
kirse
Interesting tidbit from Cliqz
([https://cliqz.com/en/](https://cliqz.com/en/)):

 _Europe has failed to build its own digital infrastructure. US companies such
as Google have thus been able to secure supremacy. They ruthlessly exploit our
data. They skim off all profits. They impose their rules on us. In order not
to become completely dependent and end up as a digital colony, we Europeans
must now build our own independent infrastructure as the foundation for a
sovereign future._

Why not post this article to the European "HN"? Would hate for the US Hacker
News to be ruthlessly providing traffic to a company whose stated mission is
anti-US.

