Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Rebuilding the spellchecker (zverok.github.io)
151 points by zverok on Jan 15, 2021 | hide | past | favorite | 24 comments



This is great. I read all three articles.

It's amazing how difficult it is to encode rules for dealing with natural language, considering how easy it is for a person to resolve ambiguities, misspellings, and the like. Of course, we forget how much knowledge we have encoded in our own brains.

I'm trying to do named entity recognition for chemicals and materials in quasi-natural language texts that we've developed over many years at work. It's brutal.


Thanks!


Since I'm not a native English speaker I find the Grammarly extension quite good. This creates a problem in professional settings though, since there'd be a sharing with a 3rd party of all the text that you write basically.

What if any open source projects would implement the corresponding functionality? Also would anybody have experience with this?


I believe that LanguageTool[0] is the closest open-source counterpart to Grammarly. Though, in my experience, it is not a half as useful... But multilinugal and open-source.

I have a distant dream of doing to it what I did to Hunspell (write a code/series of articles explaining how it works and why it is so hard), but we'll see.

For what I know, LanguageTool is based just on a huge set of rules (you can see them in the repo[1]); and Grammarly is a mix of rule-based and machine-learning suggestions (I heard a rumor that it is 99% rule-based, and talks about ML are mostly marketing, but I don't know how reliable this rumor was).

0: https://languagetool.org

1: https://github.com/languagetool-org/languagetool/tree/master...


Indeed, I have an installation of LanguageTool on my private server to avoid the privacy issues mentioned above. I have plugins for Thunderbird, Chromium and vim running. The browser plugin is by far the best.


https://github.com/languagetool-org/languagetool works very well (although I use the online version).


Great tool, it was my first contact with spellchecks. Back that I was working for a company that does translations powered by machine learning. Back then I was a student and as the article mentioned I was one of the naive ones to think that a spellcheck is an easy thing to build.

https://github.com/victorqribeiro/goSpellcheck

I wrote this originally in python, then I ported it to go. Back then I had plans to improve it. I believe that the most erros would be due to miss press of keys. I was sketching an algorithm to find similar words given a dictionary. Soon I had to deal with other projects (from college) and I let the spellcheck to the smart people.


Never thought about the problem of compound words like that before. Every time I read about this problem I'm impressed about the solutions; but I'm also reminded of a comment that I think I saw on Reddit.

> Learn how to spell, or use a spell-checker!

Why would I use a spell-checker, am I Gandalf?


Really, what we want isn't a spell-checker at all, but an intention-checker. "Does what you wrote locally seem to be consistent with the intention of your overall document?"

But of course, something that actually worked like that would be indistinguishable from magic.


I'm so used to abutocompletion in Bash that sometimes I hit the Tab key after typing the first few characters of the argument to the `mkdir` command. The computer responds by beeping at me to remind me that it can't read my mind.


That certainly seems like a complex problem to solve :-)


Grammarly matches patterns, the differences among Grammarly and other grammar checkers are essentially in the sizes of their lists and their cosmetics: how many writing problems can they find and correct, and (far less important) how friendly is their interface to the writer?

I think Grammarly isn't even top 5 in checking grammar tools or top 10 in helping improve style.

I think the problem for grammar checkers is that they are designed to help nonnative English speakers is vastly larger than the problem of helping native English speakers, and no grammar checker I am aware of does much to help ESL speakers who are not already fluent in English.

Also Rahul (founder of Superhuman) said there is not currently an automatic spell/grammar check library that developers could use to integrate in apps and software they create. “When you type I would love to be able to autocorrect errors in your typing in the same way the MacOS does natively", so there is maybe an opportunity out there ! Don't forget me if you make it through x) and I'm spazzed to check your upcoming writings !


> I think Grammarly isn't even top 5 in checking grammar tools or top 10 in helping improve style.

What are top 5 tools in checking grammar/improve style?


In my top 5 is one called Hemingway - I put my marketing materials in there and work them down to an 8th grade level :/


Lovely work, I was looking for something like this the other day and I'd like to thank you for sharing it! Especially since it's in Python I can understand it without too much hassle. What a good job on the documentation as well!!


Really great series of articles about spellcheckers. I wish there was a similar project written in Ruby. There is https://github.com/omohokcoj/ruby-spellchecker but it serves a bit different purpose - to do safe autocorrections.

Are you considering https://github.com/wolfgarbe/SymSpell algo to do suggestions? If I recall native hunspell suggestions are quite slow - the same algo must be even slower on python.


> I wish there was a similar project written in Ruby.

Hehe... Actually, Ruby is my primary language, but I have chosen Python for this project for a complicated set of reasons I tried to explain[0] in the first article.

> If I recall native hunspell suggestions are quite slow - the same algo must be even slower on python.

Pretty slow, yes. But the current project's goal is to "uncover" how the Hunspell works—so, I implement it the Hunspell's way. The next (several) parts of the series would explain a lot on suggest, including "why is it hard", and "why SymSpell might not be enough" ;)

0: https://zverok.github.io/blog/2021-01-05-spellchecker-1.html...


Isn't this a problem deep learning could solve without huge amounts of difficulty in implementation? Or am I just imagining getting a list of misspellings of words and phonetics for them is not intractable?


That's a huge topic, which I am planning to cover towards the end of the article series <s>please like and subscribe</s>, but in short: yes, my opinion is that spellchecking is actually a "machine learning problem in disguise", and most of existing dictionaries are more a roundabout way of storing something-not-unlike-models than analytical data.

But ML approach will raise a question of data availability. What good your "deep learning OSS spellchecker" will do if there aren't good (and open) models for it which cover as much languages as existing Hunspell dictionaries do? And what if adding a bunch of new words requires laborous model retraining? It is not unsolvable, but non-trivial.

I believe all the giants have something like this inside (I don't think spelling correction in Google search bar is handled with Hunspell, right?), but it is much harder to do as an open tool, ready to embedding into other software.

There are a notable attempts, though: JamSpell for one (https://github.com/bakwc/JamSpell), which has an open "free" models, and more precise commercial ones; source code is open (maybe also only for using "simplistic" models, haven't dug deeper).


Maybe a bit off-topic, but are there good spellcheckers written in JavaScript?


For what I know (I've mentioned it in the first part[0]), the nspell[1] is the most close to "port (some) of Hunspell", and typo.js[2] ports even less (but might be enough for some, we used it in my previous company: it uses dictionaries for lookup, but uses its own simplistic suggest, which I needed to tweak a lot).

SymSpell algorithm (which is quite different, I'll go into it in the next part to some extent) is much easier to port, so there is a JS SymSpell port[3] (which seems abandoned though).

0: https://zverok.github.io/blog/2021-01-05-spellchecker-1.html

1: https://github.com/wooorm/nspell

2: https://github.com/cfinke/Typo.js/

3: https://github.com/IceCreamYou/SymSpell


https://github.com/wolfgarbe/SymSpell lists 5 JS implementations (+ a Rust one that compiles to web assembly)


Ah, indeed :) I just googled the first one for "symspell+js"


Thanks for the information!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: