
Soundex – a phonetic algorithm for indexing names by sound - jstrieb
https://en.wikipedia.org/wiki/Soundex
======
knadh
Metaphone is a much better alternative to Soundex for phonetic indexing of
English words. Phonetic indexing is an interesting concept, and inspired by
Metaphone, I was able to write algorithms for Malayalam[1] and Kannada[2], two
Dravidian (Indic) languages. They work well for searches and suggestions.

[1] [https://github.com/knadh/mlphone](https://github.com/knadh/mlphone)

[2] [https://github.com/knadh/knphone](https://github.com/knadh/knphone)

~~~
geektips
Thank you, I am a frequent user of your olam dictionary. Is there a JavaScript
port of mlphone available ?

~~~
knadh
Thanks :) There isn't one in JS, but it should be really easy to port.

------
simcop2387
Anyone interested in this should also look at the metaphone and double
metaphone algorithms. Soundex handles American English names decently well but
metaphone and double metaphone work better for more general cases

~~~
DocTomoe
Also, Kölner Phonetik if you work in a German-language context.

------
mci
It is important to distinguish two use cases of phonetic algorithms:

\- Recovering English spelling from English pronunciation. Soundex and
Metaphone are good at this.

\- Undoing the torments of Eastern-European sibilants in various
transliterations of surnames, e.g. Tchaikovsky, Tschaikowski, Tchaïkovski,
Ciajkovskij, Tsjaikovski, Tjajkovskij, Tsjajkovskij, Csajkovszkij, and
Czajkowski. Here, Double Metaphone or Daitch–Mokotoff Soundex do a better job.

[1]
[https://en.wikipedia.org/wiki/Metaphone](https://en.wikipedia.org/wiki/Metaphone)
[2]
[https://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Sounde...](https://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Soundex)

~~~
mtts
The second category essentially concerns spelling variations. Years and years
ago I built a tools for searching old documents and to deal with all the
spelling variants these contain I implemented something called the Gloria Guts
algorithm, which ranks words based on how much their spelling differs. As I
recall it worked much better than sounded for our data set

------
sdrothrock
> as pronounced in English

I was curious about whether or not there was an IPA version (which would then
work for any language) -- one of the first results I found was Eudex, which
looks a little different (since it's a hash of the pronunciation) but more
versatile.

[https://github.com/ticki/eudex](https://github.com/ticki/eudex)

~~~
harperlee
But how do you map from written form to IPA phonemes? For instance french is
going to have a big problem there. Foie /fwa/ is a good example as has o, i
and e written, but you pronounce a and some sort of u...

~~~
anoncake
Why? All you need is the rules that govern French writing. They clearly exist,
else humans couldn't read French either. It can't be harder than English where
the rules amount to "look it up in the dictionary".

------
faizshah
After using this, metaphone and other interesting similarity metrics for data
cleaning in OpenRefine, I was looking for a python library that had these
sorts of similarity metrics.

I found this quite interesting (and comprehensive) project that includes
regular string distance (fuzzy search) methods, fingerprinting methods and a
huge number of phonetic metrics including 7 different soundex methods:
[https://github.com/chrislit/abydos](https://github.com/chrislit/abydos)

A few others worth mentioning:

Scala:
[https://github.com/rockymadden/stringmetric](https://github.com/rockymadden/stringmetric)

Java/Scala:
[https://github.com/vickumar1981/stringdistance](https://github.com/vickumar1981/stringdistance)

Postgres:
[https://github.com/eulerto/pg_similarity](https://github.com/eulerto/pg_similarity)

R:
[https://github.com/markvanderloo/stringdist](https://github.com/markvanderloo/stringdist)

Clojure: [https://github.com/Yomguithereal/clj-
fuzzy](https://github.com/Yomguithereal/clj-fuzzy)

------
olsgaarddk
During my graduate degree, I was looking into phonetic search, and a lot of
the papers I stumbled across were using an expanded version of soundex called
Phonix[1] as their basis of new algorithms. However, I had a lot of trouble
finding an implementation I could use, so I implemented it in Python.

After that I build a search engine programme that would look up swear words
that were phonetically similar to input terms, as I thought that my fellow
students might get a laugh looking up their own names. In the end, neither
Phonix, double metaphone nor Soundex really produced any funny results.

plugs:

\- Blogpost: [http://olsgaard.dk/phonixpy-phonetic-name-search-in-
python.h...](http://olsgaard.dk/phonixpy-phonetic-name-search-in-python.html)

\- Github repo:
[https://github.com/olsgaard/phonetic_search](https://github.com/olsgaard/phonetic_search)

[1] Gadd, T. N. “‘Fisching Fore Werds’: Phonetic Retrieval of Written Text in
Information Systems.” Program 22, no. 3 (1988): 222–37.

------
francislavoie
I learned about this when digging through the PHP documentation. They have the
function built-in
[https://www.php.net/manual/en/function.soundex.php](https://www.php.net/manual/en/function.soundex.php)

In a similar vein, Levenshtein distance is also built-in which is very useful
for fuzzy searching
[https://www.php.net/manual/en/function.levenshtein.php](https://www.php.net/manual/en/function.levenshtein.php)

~~~
ehnto
I once built a search engine for web agencies that you could search through by
the technologies they worked with. All I did to make it fast was index all the
data for a company into a few blobs and then fuzzy search the text blobs with
Lucene, which used Levenshtein distances under the hood. It was lightening
fast, stupid simple and all hosted on the same VM.

------
jasim
Clipper had SOUNDEX function, which was used to search in flat-file databases
for descriptive names. SEEK SOUNDEX("Smith") and SEEK SOUNDEX("Smythe") would
return the same result.
[https://www.itlnet.net/programming/program/Reference/c53g01c...](https://www.itlnet.net/programming/program/Reference/c53g01c/nga0031.html)

But Harbour, an open-source implementation of Clipper superseded it with
Metaphone -
[https://harbour.github.io/doc/hbnf.html#ft_metaph](https://harbour.github.io/doc/hbnf.html#ft_metaph)

------
Raphaellll
I once trained unsupervised character embeddings and used levenshtein distance
with embedding cosine as character replacement weights. And it worked better
as a similarity metric than soundex

~~~
ahnick
Sounds interesting. Can you expand on what you mean by "with embedding cosine
as character replacement weights"?

~~~
Raphaellll
Edit distance allows for insertion, deletion and substitution. Some allow
custom cost for substitution (e.g. characters closer together on a keyboard
have lower substitution cost than other; for typo normalization). What I
describe in my comment is to use the cosine similarity between learned
character embeddings as substitution costs.

------
andy_ppp
I’m always annoyed that spellcheck/correction on iOS is really flawed in that
is doesn’t try to take a guess at phonetic matching so you can’t just guess at
how something is spelled phonetically you often have to have the first X
letters correct.

------
codemaniac
Here is a python implementation of the same by me -
[https://gist.github.com/codemaniac/4b23ea0b324a25c580b1192cd...](https://gist.github.com/codemaniac/4b23ea0b324a25c580b1192cdf66327a)

~~~
spc476
And I have an implementation in Lua/LPEG: [https://github.com/spc476/LPeg-
Parsers/blob/master/soundex.l...](https://github.com/spc476/LPeg-
Parsers/blob/master/soundex.lua)

------
edraferi
I’ve found phonetic similarly algorithms like this to be very useful for data
cleaning work, especially when your data generating process involves
transliteration. That is, if your users know Language A but your data involves
Language B, your users will record terms from Language B with highly varied
spellings that nonetheless sound very similar when pronounced. Phonetic
similarity algorithms help you cluster these spellings together by sound so
you can map them to a canonical term from Language B.

------
neilv
When I implemented Soundex, an extension I did was to also return alternative
hashes, for skipping of common surname prefixes. I didn't end up using it in
production, so I can't say whether it would've helped, in practice.

[https://www.neilvandyke.org/racket/soundex/#%28def._%28%28li...](https://www.neilvandyke.org/racket/soundex/#%28def._%28%28lib._soundex%2Fmain..rkt%29._soundex-
nara%2Fprefixing%29%29)

------
guruparan18
Interestingly, `soundex` is bundled along with the standard functions in most
commercial software. In one of my first search function I wrote, I used
`soundex` to run against previous search words and suggest a known search word
as 'did you mean?' to catch the spelling errors. This is around 2008-2010, and
was an easy way out of what Google search was doing at that time. It was quite
useful.

------
hkchad
This is used by several states to assign driver license numbers. [1]

[1] [https://www.community-insurance.com/learn-how-to-read-
your-d...](https://www.community-insurance.com/learn-how-to-read-your-drivers-
license-number/)

~~~
jimueller
I wonder why Wisconsin went through the trouble vs some incrementing id. It
seems like it was understood 2 people could end up with the same number so the
last two digits were added. That would rule out the benefit of unique ids
without requiring a single central system to ensure uniqueness.

It does make it harder to fake an id, perhaps that was the motivation.

Edit: Some more interesting info for the states that implement this. Also, TIL
Illinois drivers can appear to have the exact same DL number.
[http://www.highprogrammer.com/alan/numbers/dl_us_shared.html](http://www.highprogrammer.com/alan/numbers/dl_us_shared.html)

------
fortran77
It seems that something like this is in play in the Hacker News search. The
first results will be alike, but the rest are using some soundex-like
algorithm.

------
f2000
See
[https://en.wikipedia.org/wiki/Metaphone](https://en.wikipedia.org/wiki/Metaphone)

------
Yhippa
Used this in my last project. If you're curious, a lot of databases support
Soundex searches natively.

------
hammock
Why is this only for "names" and not words or even phrases in general?

------
knolax
Based off of whose pronounciation?

~~~
danso
It's the first sentence of the article:

> _Soundex is a phonetic algorithm for indexing names by sound, as pronounced
> in English._

~~~
TylerE
Yes, but whose pronunciation? Barbara, for instance, can be pronounced with
two syllables or three.

~~~
aidenn0
It works by favoring false positives over false negatives. In the specific
case of Barbara, it doesn't matter how many syllables, because non-initial
vowels are ignored.

