I wrote a spelling checker in the 1980's In my first job I worked for Tasman in ...

greenkey · on Dec 4, 2020

That’s awesome!

In comparison, here a quote from the OP’s blog entry:

“Fast forward to today. A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl or Python, depending on how terse you mind being. Looking up a word in this hash table dictionary is a trivial expression, one built into the language. And that's it. Sure, you could come up with some ways to decrease the load time or reduce the memory footprint, but that's icing and likely won't be needed. The basic implementation is so mindlessly trivial that it could be an exercise for the reader in an early chapter of any Python tutorial.

That's progress.”

But is a simpler, less efficient method progress? Sure it allows more words to be added/removed with ease, and I don’t want to advocate over-optimization, but the solution you made for the Spectrum seems better because words don’t change much. Why don’t we use a similar specialized hash and compressed dictionary format to increase spellchecking speed and allow more words in less space? We could still produce that format using /usr/share/dict/words and similar.

RobAley · on Dec 4, 2020

> Why don’t we

Because we don't need to and we have much more interesting problems to take up our time.

darkwater · on Dec 4, 2020

But GP already solved the problem (at least for English and other Latin script languages). Why throw away those findings?

_lffv · on Dec 4, 2020

Problems tend to have more than one solution. GP's solution should be documented, yes, but the alternate solution that won out was computers being capable of storing a million words or so in plaintext very easily, and doing the same using their compression scheme just isn't really worth the space saved nowadays.

ASalazarMX · on Dec 4, 2020

Also, compressing could actually be slower for modern computers. Remember when compressing your hard disk made your PC faster, up until disks became faster, then it actually made it slower?

Today's CPUs are very fast, so the trend could have flipped again, that would be an interesting benchmark.

RobAley · on Dec 4, 2020

Implementation takes time. Keep it simple and move on.

nenadst · on Dec 4, 2020

I always thought that we still use a trie or (to save memory) ternary search trees for that..

greenkey · on Dec 4, 2020

How many operations and objects? The method he’s talking about would seem more efficient for the purpose vs all of the strings still being created even if never used in the plain hash version.

warpech · on Dec 4, 2020

Thanks for this nice detailed memory!

How did you get the word list from Collins? Did they license it in a digital form?

selfhoster11 · on Dec 4, 2020

They could have been non-copyrighted at the time. Database rights didn't apply in the UK until 1998, so I think it would be fine to have an intern type in just the words into the computer and not infringe on anything.

RegW · on Dec 4, 2020

The list was used with permission.

I don't know/remember the terms under which this happened. I do remember that when I was trimming it down I found the list contained the trademarks of a competitor. These were removed.

brianmcc · on Dec 4, 2020

What an awesome insight, thanks for sharing. I love hearing these kinds of "unobvious" solutions. Simple and elegant but all too easily missed!

nafey · on Dec 4, 2020

Wouldn't using something like trie be useful here?

ignoramous · on Dec 4, 2020

You may be interested in any or all of these:

Minimal Acyclic Finite State Automata: http://stevehanov.ca/blog/?id=115 (related: https://web.archive.org/web/20120302104036/http://siganakis....)

Succinct Tries: http://stevehanov.ca/blog/?id=120 (related: https://alexbowe.com/succinct-debruijn-graphs/)

Compression and completion using GPT-2: https://bellard.org/textsynth/index.html (related: https://ed-von-schleck.github.io/shoco/)

Search and compression with Finite State Transducers: http://blog.burntsushi.net/transducers/ (related: https://swtch.com/~rsc/regexp/)

Collection of succinct string representations: https://github.com/simongog/sdsl-lite (related: http://pizzachili.dcc.uchile.cl/)

Search with compressed Radix Trie: https://cr.yp.to/critbit.html related: http://reports-archive.adm.cs.cmu.edu/anon/2020/CMU-CS-20-10...

RegW · on Dec 4, 2020

In the case of the ZX Spectrum +3 there wasn't enough disk or memory to hold the uncompressed word list, nevermind some form of tree structure.

The search method involved uncompressing each word (one at a time) until either the word was found or one that should follow it.

However, I did intend to use a binary search of the word list on the floppy. I arranged it so any zeroes in the list indicated the start of a word from which decompression could start. Under maximum compression there would be only 26 zero bytes in the list, but by selecting short words at regular intervals I could spinkle zeros throughout the list (approx 1 per block). A binary search could scan for the zeros, decompress the associated word and find the section that should contain the target word.

Tasman didn't go for this. They sorted all the words to be checked in memory, then opened the file and did a complete scan.