
Soundex - Ivoah
https://en.wikipedia.org/wiki/Soundex
======
drawkbox
When I worked at eMarketing / Nomadic Agency I used soundex or SOUNDEX() in
Microsoft SQL Server many times. Very useful.

One big place was on all Kraftfoods sites for search in recipes, products and
brand sites. One use was for ingredient lookups from misspellings and search
2000-2008ish, still there at
[http://www.kraftrecipes.com/](http://www.kraftrecipes.com/) on the search
function. When you put in 'chiken' you'll get 'chicken' for instance. Pretty
useful for misspellings back then and even today.

Fun fact we later also used Alta Vista search and even had a Google appliance,
back when they made those, for aggregated site searching that tied into all
search results across their brand sites. Search would check misspellings which
part of that was SOUNDEX() then also aggregate search ingredients, recipes,
products and content across their enterprise brand sites using the AV or
Google boxes.

Another fun fact, kraftfoods sites were one of the first Microsoft .NET
production uses. We worked with Microsoft in .NET 1.0 in 2001 to coincide with
the release in 2002. We switched them from a combination of Perl sites and
Java sites from Java / ATG Dynamo 10+ servers and 20+ Oracle servers to .NET
with 3-4 web/app servers and 3 Microsoft SQL Servers.

~~~
stevesimmons
Ah, ATG Dynamo... Memories of my startup days in 2001. Sun gave us a Solaris
server and pushed us hard to use ATG Dynamo. That lasted all of half a day
before I downloaded Zope, learnt Python and within a couple of days had a far
more customizable prototype site ready to demo.

As I'm now a full time Python developer, I guess I owe it to ATG Dynamo...

------
knadh
Metaphone [1] addresses a lot of issues Soundex has. While Soundex is aimed
specifically at names, Metaphone works for all English words.

PS: Inspired by Metaphone, I wrote MLphone [2] a phonetic lib for the
Malayalam (South India) language. The phonetic keys the algorithm produces are
Roman characters though.

[1]
[https://en.wikipedia.org/wiki/Metaphone](https://en.wikipedia.org/wiki/Metaphone)

[2] [https://nadh.in/code/mlphone/](https://nadh.in/code/mlphone/)

------
joezydeco
If you live in Illinois, Wisconsin, or Florida the Soundex code is used to
create your drivers license number. You can derive almost anyone’s number if
you know their full name and birthdate:

[http://www.highprogrammer.com/alan/numbers/dl_us_shared.html](http://www.highprogrammer.com/alan/numbers/dl_us_shared.html)

------
inertiatic
I will echo the opinion of others in this and say that in my experience fuzzy
matching based on string distance metrics is a better approach in most cases I
can think of.

I do search related stuff and we use phonetic algorithms for names (in a
rather interesting way as well which I haven't seen employed elsewhere) and
will occasionally get reports or inquiries of weird unexpected matches, or
questions about small typos not producing any of the expected results.

I feel these approaches maybe were a better fit for a time where talking was
absolutely the main means of communications, but in an era where people
communicate more and more by typing things into their phones, any input is
frequently a) copied over instead of transcribed or b) first seen written and
then typed out by the user, on a small touchscreen keyboard with 1 to 2 typos
of letters close to the actual intended letter.

I wonder is there such an approach that takes this key distance into account?
(ie. in a search for Nock results containing Nick should be higher than Neck)

~~~
pcwalton
> I will echo the opinion of others in this and say that in my experience
> fuzzy matching based on string distance metrics is a better approach in most
> cases I can think of.

Most likely, but keep in mind that one of the design goals of Soundex was that
it be easy for a human to work out and that it be indexable. It was developed
for the US Census at the start of the 20th century, after all…

~~~
inertiatic
Yes! I don't doubt it was/is a useful tool for some purposes, just commenting
on using related techniques to tackle text search.

------
kastnerkyle
I've had good luck with Caverphone for a number of speech specific tasks [0].
There is a python implementation directly in the pdf, I also wrote a version
here [1], no idea if it exactly matches the pdf version but it worked for my
cases.

Followups to soundex such as metaphone are encumbered by license issues as far
as I know, but Caverphone is free and clear AFAIK.

[2] is an insanely great overview of many of these algorithms, be sure to
check it out if you are into this stuff.

[0]
[https://caversham.otago.ac.nz/files/working/ctp150804.pdf](https://caversham.otago.ac.nz/files/working/ctp150804.pdf)

[1]
[https://gist.github.com/kastnerkyle/a697d4e762fa8f53c70eea7b...](https://gist.github.com/kastnerkyle/a697d4e762fa8f53c70eea7bc712eead/)

[2] [http://ntz-develop.blogspot.ca/2011/03/phonetic-
algorithms.h...](http://ntz-develop.blogspot.ca/2011/03/phonetic-
algorithms.html)

------
spc476
I used both Soundex and Metaphone to handle URLs for a Bible website:
[http://literature.conman.org/bible/](http://literature.conman.org/bible/) You
could type the a URL as:

    
    
        http://bible.conman.org/kj/genasys.1:1
    

and it would redirect to the proper page:

    
    
        http://bible.conman.org/kj/Genesis.1:1
    

You have to _really_ misspell something for it to not work properly.

------
rasz
If you are interested in phonetic algorithms, sorting, recognizing and
filtering, you will enjoy Talking Banana Twitch ban story by Useless Duck
Company :
[https://www.youtube.com/watch?v=bJ5ppf0po3k](https://www.youtube.com/watch?v=bJ5ppf0po3k)

~~~
JepZ
Hint: The relevant part to this topic starts at minute 11.

------
JepZ
While I like the idea of Soundex I always had problems with is fixed length.
In addition it is limited to the English language and for other languages you
might need different algorithms (e.g. Cologne phonetics [1] for German). As
others have mentioned, Metaphone [2] is another alternative which got some
traction in recent years, but I haven't tried it in a real world scenario yet.

For some use-cases n-gram [3] based string comparison might be an option too.
It is in no way phonetic (therefore universal for many languages), but when it
is just about finding similar words it often produces better results than the
original Soundex (mostly due to its length limitation).

[1]:
[https://en.wikipedia.org/wiki/Cologne_phonetics](https://en.wikipedia.org/wiki/Cologne_phonetics)

[2]:
[https://en.wikipedia.org/wiki/Metaphone](https://en.wikipedia.org/wiki/Metaphone)

[3]:
[https://en.wikipedia.org/wiki/N-gram](https://en.wikipedia.org/wiki/N-gram)

------
kbutler
Years (decades) ago, I read about soundex, and found this little language that
came with a soundex module. That was my introduction to Python, which I've
used for my master's thesis and in personal and professional development.

But Python has since removed the soundex module.

~~~
Abishek_Muthian
Hi, interesting; do you know why?

~~~
kbutler
Found it - it was a C module, and was available until 2.1, when many little
used modules were removed.

1.6.1 declared it "obsolete":
[https://www.python.org/download/releases/1.6.1/](https://www.python.org/download/releases/1.6.1/)

    
    
      Obsolete Modules
      ...
      soundex. (Skip Montanaro has a version in Python but it won't be included in the Python release.)
    

Looks like it was finally removed in 2.1:
[https://www.python.org/download/releases/2.1.3/notes/](https://www.python.org/download/releases/2.1.3/notes/)
matches the NEWS file in
[https://www.python.org/ftp/python/2.1/](https://www.python.org/ftp/python/2.1/)

    
    
      - Removed the obsolete soundex module.
    

[https://pypi.org/project/Fuzzy/](https://pypi.org/project/Fuzzy/) has a
modern implementation if you want to play with it.

~~~
Abishek_Muthian
Hey, much appreciated! thanks for the Fuzzy mention as well :)

------
ACow_Adonis
In truth, I've never had much luck with the phonetic algorithms, and i've
implemented Caverphone 2, Double metaphone, and NYSIIS [0].

Totally subjective, but in my domain I've had better use either using cheaper
string distance/similarity metrics (hamming, jaro/winkler, etc), or if you're
looking for some kind of resource-saving/fuzzy indexing/blocking type use, an
application that uses or extracts ngrams has worked pretty well for me. Your
mileage may vary...

[0] [https://github.com/DJMelksham/SAS-Data-Linking-
Functions](https://github.com/DJMelksham/SAS-Data-Linking-Functions)

------
dmlittle
If you're interested in soundex, you should also check out metaphone[1]

[1]
[https://en.wikipedia.org/wiki/Metaphone](https://en.wikipedia.org/wiki/Metaphone)

------
dfdashh
A few years back I worked on record linkage projects that relied in part on
Soundex. My experience is that Soundex is on the faster side of the phonetic
algorithm speed spectrum while (Double) Metaphone is on the other. In the
middle are modifications to Soundex or similar approaches like Soundex2,
Phonex, and NYSIIS.

For those interested I'd highly recommend the work of Peter Christen [1], who
does a ton of research in this space. If you want to see some code, check out
the implementations of several of these algorithms I wrote a while back [2].

[1]:
[http://users.cecs.anu.edu.au/~christen/](http://users.cecs.anu.edu.au/~christen/)

[2]:
[https://github.com/antzucaro/matchr](https://github.com/antzucaro/matchr)

------
da_chicken
I've still got some applications that use SQL Server's SOUNDEX() function for
fuzzy name matching. It's not perfect, but it works pretty well for most
names. I've used it in a student information system to look for duplicate
student entry (happens more often than you'd think).

------
vszakats
Such function existed back in Clipper '87, dBase IV, FoxPro and their
descendants. Here's a Clipper-compatible implementation in C:
[https://github.com/vszakats/harbour-
core/blob/master/src/rtl...](https://github.com/vszakats/harbour-
core/blob/master/src/rtl/soundex.c)

(Disclaimer: source code author here.)

------
endriju
A pretty easy way to discover how Soundex works can be playing with
[http://gridoc.com/fuzzy-matching/](http://gridoc.com/fuzzy-matching/) \- a
tool for fuzzy record matching that supports Soundex and Levenshtein Distance.

Disclaimer: I'm the author of the tool.

------
TeMPOraL
> _The algorithm mainly encodes consonants; a vowel will not be encoded unless
> it is the first letter._

Interesting. Isn't this similar to how Hebrew works (or at least the one used
in the Bible worked)? I wonder about the rationale (in either case).

~~~
laurieg
This is actually somewhat close to how spoken English works. We change many
vowel sounds to the neutral 'schwa' in real speech. Also, many accents use
very different vowel sounds (think New Zealand) but are still easy to
understand.

------
codemaniac
My implementation of Soundex in Python -
[https://gist.github.com/codemaniac/4b23ea0b324a25c580b1192cd...](https://gist.github.com/codemaniac/4b23ea0b324a25c580b1192cdf66327a)

------
cakes
I was surprised (years ago) when I learned about SOUNDEX() in Microsoft SQL
Server. I always wondered why SOUNDEX was in SQL Server (not that I thought it
shouldn't be, was just curious).

