In my first job I worked for Tasman in Leeds and produced a Word Processor for IBM PC compatibles in 8086 assembler with some help, and then a spelling checker.
For the spelling checker I did a whole load of analysis on a 70,000 word list from Collins and produced a list of tokens to represent common strings of letters. However, in the end I really had to cut the original word list down to get the whole thing onto a single 360K floppy.
After I left Tasman, I was lying in bed one night still thinking about it and realised where I had gone wrong. The tokenising thing, which someone else had put me onto had blinded me. I had stared at word lists for months and hadn't pinned down the obvious pattern. All but 26 words in the 70,000 word list shares the bulk of their characters with the word before it.
So the solution is use 5 bits of the first byte as a count of chars from the word before, 3 bits indicate commons enddings (ship, s, ing) or that the following bytes are tokens for the rest of the word. With this I got the word list compressed to less than 2 bytes per word.
I took this back to Tasman. They put all 70,000 words and the spelling checker onto a 175K floppy for the ZX Spectrum +3.
In comparison, here a quote from the OP’s blog entry:
“Fast forward to today. A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl or Python, depending on how terse you mind being. Looking up a word in this hash table dictionary is a trivial expression, one built into the language. And that's it. Sure, you could come up with some ways to decrease the load time or reduce the memory footprint, but that's icing and likely won't be needed. The basic implementation is so mindlessly trivial that it could be an exercise for the reader in an early chapter of any Python tutorial.
That's progress.”
But is a simpler, less efficient method progress? Sure it allows more words to be added/removed with ease, and I don’t want to advocate over-optimization, but the solution you made for the Spectrum seems better because words don’t change much. Why don’t we use a similar specialized hash and compressed dictionary format to increase spellchecking speed and allow more words in less space? We could still produce that format using /usr/share/dict/words and similar.
Problems tend to have more than one solution. GP's solution should be documented, yes, but the alternate solution that won out was computers being capable of storing a million words or so in plaintext very easily, and doing the same using their compression scheme just isn't really worth the space saved nowadays.
Also, compressing could actually be slower for modern computers. Remember when compressing your hard disk made your PC faster, up until disks became faster, then it actually made it slower?
Today's CPUs are very fast, so the trend could have flipped again, that would be an interesting benchmark.
How many operations and objects? The method he’s talking about would seem more efficient for the purpose vs all of the strings still being created even if never used in the plain hash version.
They could have been non-copyrighted at the time. Database rights didn't apply in the UK until 1998, so I think it would be fine to have an intern type in just the words into the computer and not infringe on anything.
I don't know/remember the terms under which this happened. I do remember that when I was trimming it down I found the list contained the trademarks of a competitor. These were removed.
In the case of the ZX Spectrum +3 there wasn't enough disk or memory to hold the uncompressed word list, nevermind some form of tree structure.
The search method involved uncompressing each word (one at a time) until either the word was found or one that should follow it.
However, I did intend to use a binary search of the word list on the floppy. I arranged it so any zeroes in the list indicated the start of a word from which decompression could start. Under maximum compression there would be only 26 zero bytes in the list, but by selecting short words at regular intervals I could spinkle zeros throughout the list (approx 1 per block). A binary search could scan for the zeros, decompress the associated word and find the section that should contain the target word.
Tasman didn't go for this. They sorted all the words to be checked in memory, then opened the file and did a complete scan.
In my first job I worked for Tasman in Leeds and produced a Word Processor for IBM PC compatibles in 8086 assembler with some help, and then a spelling checker.
For the spelling checker I did a whole load of analysis on a 70,000 word list from Collins and produced a list of tokens to represent common strings of letters. However, in the end I really had to cut the original word list down to get the whole thing onto a single 360K floppy.
After I left Tasman, I was lying in bed one night still thinking about it and realised where I had gone wrong. The tokenising thing, which someone else had put me onto had blinded me. I had stared at word lists for months and hadn't pinned down the obvious pattern. All but 26 words in the 70,000 word list shares the bulk of their characters with the word before it.
So the solution is use 5 bits of the first byte as a count of chars from the word before, 3 bits indicate commons enddings (ship, s, ing) or that the following bytes are tokens for the rest of the word. With this I got the word list compressed to less than 2 bytes per word.
I took this back to Tasman. They put all 70,000 words and the spelling checker onto a 175K floppy for the ZX Spectrum +3.
[Edited for typos]