

The crazy world of stripping diacritics - gus_massa
http://blogs.msdn.com/b/oldnewthing/archive/2014/11/24/10575362.aspx

======
weinzierl
The article mentions one rationale for stripping diacritics and I won't deny
there are others. That being said:

Stripping diacritics from text will annoy people whose language uses those
diacritics. For us the difference between an e, an è and an é is significant.
An ü is something entirely different from an u. Calling me Muller when my name
is Müller is like calling someone Jan whose name is Jon.

Just as the article says: "But then again, removing diacritics is already
linguistically nonsensical. Nonsensical operation is nonsensical."

~~~
mercurial
For some languages (French) it can make sense to transliterate when sorting,
while it would be a terrible mistake in others (Danish).

~~~
ddebernardy
Mm, it actually never makes sense to transliterate for sorting. Sorting should
be based on collation rules.

In French, sorting takes diacritics into account, and how it does so
additionally depends on whether you're doing French French or Canadian French:

French:

cote, côte, coté, côté

Canadian:

cote, coté, côte, côté

[http://userguide.icu-project.org/collation/concepts](http://userguide.icu-
project.org/collation/concepts)

------
ars
If you needed to do this in PHP (or any other language with the ICU
transliterator):

    
    
        transliterator_transliterate('Any-Latin; Latin-ASCII', 'Input string');
    

It's not exactly the same thing - it will convert letters into ASCII
characters that sort of sound right, not just strip diacritics.

It's possible to simply strip diactricts too, probably something like:

    
    
        'NFD; [:Punctuation:] Remove;'

------
bkeroack
This is a pretty bad idea (and English-centric). Stripping diacritics can
fundamentally alter a word's meaning.

A small example (Portuguese):

    
    
      país  ("country")
      pais  ("fathers")

~~~
itsybitsycoder
The goal of the exercise is to use the stripped text to check for spam. That
doesn't mean that the stripped text is what ends up in the user's inbox. The
idea is to see if the text contains things like ＶᎥÄｇԻａ, not whether a given
word means "country" or "father". This would only be a problem if some set of
characters that looked like "viagra" was actually a valid non-spammy word in
some language.

There was also an article here a few weeks back about Russian government
officials securing fat contracts for their friends in private industry by
intentionally replacing one or two letters in the common search terms of their
bid requests with Latin lookalike characters. This was impossible to detect
while reading the bid request, but also made it impossible for other
contractors to find the bid request through the website. As a result only
their buddy would submit a bid, and at much higher than market rate. Something
similar to this, but converting down to Cyrillic characters instead of ASCII,
could be used to check for hinky bid requests on upload.

------
BerislavLopac
I'll just leave this here...
[https://pypi.python.org/pypi/Unidecode](https://pypi.python.org/pypi/Unidecode)

