
A Spellchecker Used to Be a Major Feat of Software Engineering - denzil_correa
http://prog21.dadgum.com/29.html?repost
======
localhost
I wrote a spellchecker in the 80's for the Commodore 64. Step 1 was writing an
efficient data transfer algorithm to move the bits off of the disk so that the
1541 drive could hit the sector interleave (it would miss it using its built-
in data transfer algorithm). This sped up IO by 500%. But for reference that
was still only about ~25kbps, or a bit slower than a 28.8kbps modem. Step 2
was coming up with a data compression algorithm that exploited the
distribution of the top 30 or so 2/3-grams in English. I had a dictionary of
about 80K words.

I can confirm this is on the order of several months worth of work :)

~~~
fromdoon
This seems fascinating. Can you share a more detailed description?

Like, what kind of data compression algo you chose at that time and what were
the choices?

How did the techniques you developed at that time helped you later and what
significance/standing they have in today's programming world?

How did your spell checker turn out in terms of performance and memory?

Were there better implementations around at that time? Which one was widely
accepted as the best one?

Thanks in advance!

~~~
localhost
Happy to. I was 15 at the time, so I had limited knowledge of algorithms. So,
I invented my own :)

IIRC I packed each character into 6 bits, using the first 26 values for A-Z,
and the rest of the values for the common 2-3 grams that I found in the back
of a dictionary in the local library. I had some other scheme for building
words from stem + suffixes (eg flood, flooded).

What I learned the most from that experience was mostly around reverse-
engineering and perseverance. I had to figure out what the 1541 drive did by
disassembling its ROM and following the boot sequence. When comparing it to a
2040 drive (a older dual drive variant that was much more expensive -
something like $2500 in 1981 dollars) I knew quite a bit about the sector
interleave spacing - the disk could do sustained 2 sector reads per
revolution. So I knew there was room for improvement if I could just get the
transfer rate between the disk and the C-64 to go faster. I did that by
improving how it transferred bits down the wire. To cut costs, Commodore had a
serial interface vs. the parallel IEEE-488 interface used in the 2040 /
Commodore PETs of the previous generation. They had a super inefficient
algorithm where they did a handshake for every bit that went down the wire.
This was due to the VIC chip in the C-64 stealing cycles from the 6510 CPU
whenever it needed to read from shared RAM. Since they weren't clever enough
to figure out when the VIC chip would steal cycles they did a handshake for
every bit.

I improved this by asynchronously transferring 2 bits at a time using the
clock and the data lines for ~3 bytes. I could read a register on the VIC chip
that would tell me what scan line was being processed and from that I could
tell when the VIC chip would steal the bus. Also on the vertical flyback I
would be able to transfer bits with impunity.

This speed boost made it possible for me to hit the interleave on the disk and
get the transfer back up to ~25Kbps.

These were pretty low level hacks (that worked quite well) to get things to go
fast. Not sure how any of that transfers to the things we do today. One
thing's for sure, I don't think I'll ever understand a modern computer as well
as I understood those computers of my childhood.

The spell checker was completely IO bound. The 1541 stored ~170KB of data. The
words on disk were sorted and I generated a unique sorted list of words in
memory. It was, at the time the fastest spell checker for the Commodore 64, it
was packaged with a popular word processor of the time called WordPro 64. Most
of the other spell checkers had much smaller dictionaries and took a LONG time
to read the disk. I could read 170KB in something like 70s, which, using my
data transfer improvements was 5x faster than any other spell checker at the
time. I'm pretty sure I had more words in my dictionary too. But things are a
bit fuzzy - that was nearly 30 years ago!

------
sehugg
_A program to load /usr/share/dict/words into a hash table is 3-5 lines of
Perl or Python ... That's progress._

That's progress in terms of cheap RAM and cheap CPU cycles. In terms of
software architecture, that's not progress -- that's brute force.

Assuming that code for efficiently indexing and compressing this kind of
dictionary was written 25 years ago, there doesn't seem to be any good reason
not to reuse it. The fact that there's impedance to doing so means that our
languages and our tools still need improving.

(25 years ago I'd have predicted that in 2012 some kind of AI-driven optimizer
could have figured out the correct data structures for this problem and
automatically converted the naive hash table lookup into a more efficient
structure)

Also, jeesh, use mmap() and a binary search or something.

~~~
baddox
> _Assuming that code for efficiently indexing and compressing this kind of
> dictionary was written 25 years ago, there doesn't seem to be any good
> reason not to reuse it._

What if the efficient code is extremely "clever" and hard to understand from
reading the code (and thus hard to fix bugs, add new features, etc.)? Doesn't
it make sense to revert to the simpler and technically less efficient version
if you _know_ that the hardware will be more than enough?

I suppose you could still argue that the more efficient version is better, at
least once it is mature and packaged in some library or framework.

~~~
gizmo686
What makes sense is creating a spell checker library, and then having other
programs used that instead of implementing spell check themselves.

Granted, if your program is such that a naive, 3-5 line python implementation
is sufficient, then it is probably not worth the effort to use a library. If
my web-browser, word processor, e-mail client, and every other large program
that has text editing implements spell check themselves, I doubt I would get
comparable performance to if they all use a library designed specifically for
spell check, and optimized by many other projects that used it over history.
On the subject, I would be interested to know if most large programs I
mentioned actually impalement spell check themselves.

------
Dn_Ab
I wonder if someday in 30 years people will look back in the same way and say:
wow, you know there was once a time when search engines were major feats of
engineering - and Decentralized Networks were deemed impractical!

They just as us, were resource limited. The modern constraints have merely
shifted. They came up with amazing tricks to get around physical bottlenecks -
just as we do now. We tend to call our current foci engineering instead of
'bags of tricks/hacks'. But when technology progresses till we take things for
granted that were once hard, we young ones look back with a mix of disbelief
and awe that people used to go through so much trouble for something that is
now so trivial.

~~~
objclxt
I think search engines will always be major feats of engineering - the big
difference between a spell-checker and a search engine is in the former the
problem space rarely evolves.

An English dictionary, for example, only has a a small number of words added
each year. A search engine is a different thing entirely: not only are new
items added all the time, but the _format_ of these items evolves. Today
search engines need to deal with non-textual formats such as videos, images,
and audio. This is a very difficult problem.

The fact that there are now APIs and libraries that let developers use things
like latent semantic mapping without having to roll their own algorithms and
solutions can certainly make writing an intelligent textual search engine a
lot easier than it used to be.

But the key difference here is that what we demand from a 'search engine' (and
by this I mean Google, or Bing) ten years will be much greater than what we
demand now. We'll have new media to search for, new types of queries, new
expectations. But a spell checker will always be a spell checker.

At least, that's my opinion. I'm sure someone will disagree!

~~~
sp332
Sure, I'll disagree :) When you have enough memory bandwidth to compute the
relationships of all pages in just a few seconds, then you can re-index the
entire corpus according to whatever criteria you want (PageRank or face
detection or whatever) for each query. So you have some spiders that add
content to the database, and re-index the database for every query.

~~~
objclxt
Good point - I don't know if that solves the problem of getting different
forms of media into the system in the first place though. I think as long as
people create and new types of media there will be interesting problems to
search for them. Searching video based off of text queries, for example, is a
much harder problem than simple text search!

------
clarkmoody
I would like to know some of the details of the algorithmic tricks that were
employed to solve this problem. As I was reading, I was thinking of a tree
structure encoding the dictionary with each letter (of a valid word) being a
node. If the next letter isn't found, or if you reach a leaf with letters left
in the candidate word, then you've misspelled something.

Does anyone have an idea of how much compression could be achieved through
such a structure?

~~~
henrik_w
You're describing a trie, <http://en.wikipedia.org/wiki/Trie>

~~~
joshuahedlund
Multiple software engineers independently arriving at the same obvious
solution? Sounds like something somebody would try to patent today...

~~~
sukuriant
Thankfully many of the novel solutions that were created in the early days
weren't created by people that just wanted a buck.

------
darklajid
Actually I tend to disagree. Yeah, you can do that for a hobby project.

I'm working in OCR / ICR here. We're often depending on dictionaries to fix
recognition errors. So you want to correct names for example, by creating a
more or less complete list of all names (we're assuming you know all the names
in advance, for example for customers). The dictionary is huge.

"Looking up a word in this hash table dictionary is a trivial expression, one
built into the language. And that's it."

Well, no. A spellchecker that just answers "I don't know that word" would be
crap. You need to find similar words. Which isn't part of any standard library
I am familiar with. In addition it's still a challenge to find all candidates
that have a levenshtein distance (or LD or .. whatever metric you like) of N
from the input, for N > 2 or 3.

Think address. You have a list of all streets, which tend to be rather long
(Germany: "Strasse" = street, the pattern "<name> Strasse" is common, let's
assume you expect 12-14+ chars). The longer the expected input, the higher I'd
set the tolerance for errors. Finding all matches with 3+
substitutions/deletions/additions on a large dataset, in a small amount of
time is still interesting and a challenge.

In fact, we partner with one company that provides nothing but exactly this,
as their single (well-known, respected) product.

~~~
josephlord
You should probably also go Bayesian on word frequencies for the suggestion
ordering. Maybe also on error types (missing letter, repeated letter, nearby
(on standard keyboard) letter.

~~~
darklajid
Your comment and the reply to it are good if we're talking spell checker
software for stuff people type.

What I have to deal with as well are recognition errors. From trivial
replacements (I/l/1,C/G,i/j,P/R etc.) to 'engine said that there's some
character here according to the character segmentation results, but it has no
damn clue what character it could be' that's a different look at the same
problem.

Also, for typed content I see quite a lot of shortcuts. SwiftKey for Android,
nice as it is, seems to be too braindead to correct words that _start_ with
the wrong character (if I type vharacter, with just one single slip at the
start, it fails miserably and looks baaad imho). Assumptions like 'just check
adjacent characters on the keyboard (special points for assuming a keyboard
layout)' or 'ah, the user is certainly starting the jnput correctly and just
might misspress a letter in the middle of a word' are nice models to make the
developer's life easier, but they don't work for a good number of cases as
well..

------
humdumb
"That's progress."

Yes, but not in software. It's the hardware that's improved. So the sw can be
sloppy and inefficient and most people won't notice.

Hardware has improved. Software has not. This is hard for some folks to
accept. But it is a plain truth. Take away the added power of new hardware and
what's left? You wouldn't want to know.

It is easier to do the same basic computing tasks as in 1984, with less
effort, but it's not easier to exceed what you could do in 1984 because you
have better software. It's because you have better hardware and can afford to
be sloppy. 2-3 lines of perl/Python but how many lines are in the binary or
the libraries? (e.g. what if you discovered there was gratuitous use of
backtracking in _all_ perl's regex, even for the simplest matching; who would
care?) There is no Moore's Law for software. To put it another way, most of
the computing tasks being done today are the same ones (albeit on a different
scale) and use the same methods (or even lazier ones) as in 1984. Hence the
example of spellcheck. It's one of many tasks performed in 1984 and today and
it will be performed in the future.

It's easy to mistake advances in hardware (more power) for advances in
software (novel and more efficient approaches to problems).

~~~
quantumstate
Trying to look at how software performance has improved in the way you are
looking at it doesn't really work. There is no point putting time into
optimising things that run fast enough. There are many algorithms that have
been improved dramatically. People use this kind of matching for DNA, nobody
is busy creating slower algorithms, they are only getting faster (e.g. www-
igm.univ-mlv.fr/~lecroq/articles/cpm2009fl.pdf). Prime factorisation,
optimisation, etc. are all being actively researched.

In other areas people are doing things which weren't even possible on old
hardware, think about 3D graphics. Real time lighting approximations are doing
really clever stuff, just look at the latest unreal engine demo. If you put
these into old hardware you would get the same quality render in less time.
Modern ray tracing engines are using more sophisticated algorithms, giving
better results in less time.

Beyond the algorithmic work optimising for modern hardware is different to old
hardware. An optimal program for old hardware isn't optimal for modern
hardware and vice versa. Now if a program needs data from RAM you have to wait
hundreds to cpu cycles. Even accessing the L3 cache, which is in a modern
processor chip, takes about 75 cycles on modern chips. So you can recalculate
things and find your program runs faster.

Similar things happen with branches, I was looking at some GPU shaders
recently and there was a simple raytracing loop (parallax occlusion mapping),
the obvious way to do it is to stop when you intersect the surface. Instead it
is actually faster to always run the loop for 10 steps because this removes
the branch. So you do twice as many loop iterations on average but your code
runs faster.

Of course you can afford to be sloppy as well. People are 'sloppy' so that
their code is faster to write, easier to read and less likely to have bugs (3
lines of python is pretty likely to be easy to read and bug free). I would say
software is advancing in many different ways, which hardware has enabled to
happen.

~~~
humdumb
The numerous references to hardware in your comment only serve to illustrate
my point.

Of course there have been some new algos and improvements to existing ones
since 1984. But nothing approaching a Moore's Law. The dramatic changes the
blog post highlighted are due to hardware, not software.

Take away the hardware advances (hold the hardware as a constant), then look
at software as the variable. Then we can have a meaningful discussion of
software improvements over time. IOW, run new software on old hardware.

To measure advances in hardware, hold software as the constant. IOW, run old
software on new hardware.

If both are variables, it's difficult to assess how much software has improved
on it own, without the benefit of new hardware.

Linus Torvalds' comments on the future of RISC, CPU instructions, and
compatibility versus "cool new features" in the /. interview were interesting.

------
jones1618
Just because a problem has been "solved" for one platform (Fast PC's w/
Gigabytes of ram/disk) doesn't mean it is solved for all platforms (Web apps
for phones/tablets, etc).

John Resig (of jQuery and Khan Academy fame) had a really interesting thread
on creating a fast-loading (i.e. small, efficient) dictionary for a JavaScript
app. <http://ejohn.org/blog/revised-javascript-dictionary-search/>

The comments have some really clever ideas for encoding the dictionary and
Resig weighs in on his favorite solution with informative data. There's an
interesting "Succinct Trie" data structure that turns a 620K dictionary into a
220K string that doesn't need to be decoded to be searched.
<http://stevehanov.ca/blog/index.php?id=120>

------
ta12121
LOL, the URL is: <http://prog21.dadgum.com/29.html?repost>

That might explain why it's so familiar.

~~~
DanBC
Here are four links to threads for this url:

(<http://news.ycombinator.com/item?id=3466388>)

(<http://news.ycombinator.com/item?id=212221>)

(<http://news.ycombinator.com/item?id=1702445>)

(<http://news.ycombinator.com/item?id=2928102>)

------
IsTom
> For reference, on my MacBook, the standard dictionary in
> /usr/share/dict/words is 2,486,813 bytes

Polish dictionary is about 3,500,000 /words/, many more are generated by the
affix file. From archeology I've done some time ago, about year 1997 there
were custom scripts written by creators/maintainers of Polish ispell
dictionary, because as the files note using the standard toolset would require
over 1 GB of RAM (this was in year 1997, mind you).

------
timr
_"Writing a spellchecker in the mid-1980s was a hard problem. Programmers came
up with some impressive data compression methods in response to the
spellchecker challenge."_

It's still a hard problem. Most working programmers today couldn't write a
good spellchecker if their life depended on it.

Sure, your average coder can load the dictionary file into RAM in a hash table
(built by someone else), and a few of those could maybe use a library
implementing a Trie or a DAWG (built by someone else)...and a percentage of
those might even understand what they're doing well enough to know how to use
those tools to make things better. But once you've got the data into memory,
what do you do with it? About the most anyone can do is point to Peter
Norvig's blog post.

Sadly, very few people understand the theory behind it well enough to make
improvements. So for everyone else, it's a _major feat_ of software
engineering -- on a bigger computer.

~~~
StavrosK
Sure, it's still a hard problem, like carpentry is a hard problem. I've never
had to do it because it's been solved for 20 years, but I'm fairly sure I'd
get great at it if it were my day job.

~~~
timr
loading data into a hash table is a solved problem. coming up with clever ways
of solving problems that are beyond the scale of today's computers is the
challenging, _interesting_ part.

------
onetwothreefour
Writing a spellchecker is still a major feat of software engineering. Loading
up some words into a hash table does not a spellchecker make.

~~~
kabdib
I wouldn't call it /major/, but a good one is still more than an afternoon's
worth of mucking around with an in-memory hash table.

~~~
wonderzombie
It's pretty serious, to hear Peter Norvig tell it: <http://norvig.com/spell-
correct.html>. Also, it's one thing to do it, and another thing to make it
_fast_.

Seems like it's one of those 80%/20%, to butcher that expression, where 80% of
the effort goes into the last 20% of speed/quality. :)

------
camiller
Wow 256K, the luxury!

Apple //e, 64K memory (possibly 128K if you had the memory expansion card)
(The Apple ][+ only had 48K)

You start with the 143K floppy containing the word processor in Drive 1
(S6,D1) and the floppy with your document in Drive 2 (S6,D2). Edit away,
remember to save occasionally.

When you were done with your rough draft you save your file and exit the word
processor and swap in the floppy containing the spell check program. Point it
at your file and swap in the floppy with the dictionary. Work through the file
and approve/reject it's suggestions.

Exit the spell check and swap in the word processor floppy...

Repeat as often as necessary.

------
DanBC
We have all this power and we use it in quite dull ways.

Not using a dictionary seems like more fun:

([http://www.spellingsociety.org/journals/j20/spellchecking.ph...](http://www.spellingsociety.org/journals/j20/spellchecking.php))

> _Morris, Robert & Cherry, Lorinda L, 'Computer detection of typographical
> errors', IEEE Trans Professional Communication, vol. PC-18, no.1, pp54-64,
> March 1975._

------
jes5199
Wow, this article has no research and is just half-baked speculation. "there
were some very clever data structures" yeah? Like what? And the proposed
solution? Load the whole thing into a hash? Did you forget that spellchecker
offer suggested corrections? How do you expect to implement that?

------
obsession
I'm no expert on data structures or algorithms but couldn't you just simply
use bloom filters?

~~~
finnw
I don't think a bloom filter would be very useful for a spell checker. Bloom
filters reduce disk reads when the majority of keys searched for are _not_ in
the dictionary. But in the context of a word processor, the majority of words
tend to be spelled correctly, so very few would be rejected by the filter.

------
prezjordan
If anyone is looking to refresh their programming chops for a problem like
this, here [1] is a fairly straightforward, simple implementation of a spell-
checker. Provided for a Python Course at Harvey Mudd (the class I TA uses this
new curriculum), the task is to memoize a recursive string-distance algorithm.
The results are actually quite lovely.

[1]: <https://www.cs.hmc.edu/twiki/bin/view/ModularCS1/SPAM>!

 _Note_ : copy the entire URL, the `!` breaks the hyperlinking.

------
larrys
Interesting, the README file in /usr/share/dict contains this:

\--------------------------------------------------------

# $NetBSD: README,v 1.2 1997/03/26 07:14:32 mikel Exp $

# @(#)README 8.1 (Berkeley) 6/5/93

WEB ---- (introduction provided by jaw@riacs)

\-------------------------

Welcome to web2 (Webster's Second International) all 234,936 words worth.The
1934 copyright has elapsed, according to the supplier. The supplemental
'web2a' list contains hyphenated terms as well as assorted noun and adverbial
phrases. The wordlist makes a dandy 'grep' victim.

\--------------------------------------------------------

------
blago
A good spell checker is still a major feat of software engineering.

------
praptak
Large memory sizes did a similar thing to visibility determination in computer
graphics. Determining which parts of multiple 3D intersecting shapes are
visible used to be tricky and dependent on the shapes - triangle vs triangle
was different than triangle vs sphere, tube vs sphere and so on. Now you can
just use z-buffering which determines the visibility on a per-pixel basis, not
to mention that all of this is implemented for you in hardware.

------
Radim
A better title: "An English-only spellchecker with a fixed, static vocabulary
used to be a major feat of software engineering".

(because other cases still pretty much are)

------
tomrod
Fascinating! I look forward to the day when data collection methods are
similarly looked upon as trivial by future standards. Dealing with PDF
documents one cannot scrape, yet needs the data, gets to be ridiculous!

------
Xurinos
Grate spellcheckers continue two bee a major feet of software engineering.

