
Projecting Unicode to ASCII - chmaynard
https://www.johndcook.com/blog/2019/01/09/projecting-unicode-to-ascii/
======
sametmax
If you limit yourself to the West Europe languages, you can just go with the
stdlib:

    
    
        >>> import unicodedata
        >>> print( unicodedata.normalize('NFKD', "éèêàùçÇ").encode('ascii','ignore'))
        eeeaucC
    

But if you need to project (transliterate to ascii) Arabic, Russian or
Chinese, unidecode is close to black magic:

    
    
        >>> from unidecode import unidecode
        >>> unidecode("北亰")
        'Bei Jing '
    

Anyway, always remember that str.encode(), str.decode(), open() and many other
related callables have an "errors" parameters that allow you to deal with
unkown solutions when encoding or decoding:

    
    
        >>> print("Père Noël".encode("ascii", errors="ignore"))
        b'Pre Nol'
        >>> print("Père Noël".encode("ascii", errors="replace"))
        b'P?re No?l'
    

I'll conclude with the mandatory "use Python 3" (3.7 if you can, it has many
utf8 fixes: [https://vstinner.github.io/posix-
locale.html](https://vstinner.github.io/posix-locale.html)), since you'll be
in a world of pain if you deal with non-ascii in Python 2, and EOL is next
year :) Tic, Toc...

~~~
kuschku
Unidecode may handle Chinese fine, but it definitely handles Western European
languages wrong.

DIN 5007 Var. 2 specifies that for the purposes of sorting, ö is replaced with
oe. This also applies to ä (ae), ü (ue) and ß (ss). This same replacement rule
is also used on passports and IDs.

Unidecode does not handle this correctly.

The unidecode author wrote about this:

> In German, there's the typographical convention that an umlaut (the double-
> dots on: ä ö ü) can be written as an "-e", like with "Schön" becoming
> "Schoen". But Unidecode doesn't do that-- I have Unidecode simply drop the
> umlaut accent and give back "Schon".

> (I chose this not because I'm a big meanie, but because generally changing
> "ü" to "ue" is disastrous for all text that's not in German. Finnish "Hyvää
> päivää" would turn into "Hyvaeae paeivaeae". And I discourage you from being
> yet another German who emails me, trying to impel me to consider a
> typographical nicety of German to be more important than all other
> languages.)

~~~
vesinisa
What do you mean by a Western European language? Germany is usually considered
geographically part of Central Europe. Finnish (and Estonian) are of course
not Germanic nor even Indo-European languages. But OTOH Swedish, a Germanic
language, also uses umlauts and considers them separate letters in collation
(Finnish collation rules are actually adopted from Swedish). The most
'correct' mapping of Swedish "Lära sig höra" would be "Lara sig hora". Is
Swedish not a Western European language?

------
CapacitorSet
Did anyone else get the impression that this is a shallow post? I expected
some detail about eg. the issues of transliteration, but instead it presented
a couple of facts about idempotency in UX.

~~~
de_Selby
By John's usual standards it is yes, but it's still interesting IMO

------
pudo
I've got to recommend PyICU here. unidecode is good, but has some holes (I
noticed the Azeri letter schwa, for example). PyICU is a binding for IBMs
International Components for Unicode, and basically has a coding language for
unicode transforms. Here's an example that is equivalent to unidecode:

[https://github.com/pudo/normality/blob/master/normality/tran...](https://github.com/pudo/normality/blob/master/normality/transliteration.py#L31)

------
peterburkimsher
Romanising Chinese is not as easy as unidecode would made it seem! "銀行" is
pronounced "yin hang", but the second character "行" is "xing" when it's alone.
This problem is made worse by the lack of spaces when writing Chinese.

Pingtype tries to solve all these problems. If there's a need for it to be
ported to Python/etc then I'd be happy to do so!

[https://pingtype.github.io](https://pingtype.github.io)

------
jamiethompson
Otherwise known as transliteration

------
umlautae
We are actually actively working on Cyrillic-ASCII transliteration for glibc
in this very moment. Please check this patch for details [1] Your help and
suggestions welcome to make sure this will be a useful and consistent fix when
it lands. The bug [2] is from 2006 (sic!) and it is a reason why
transliteration may not work with iconv for some systems/locales. [1]
[https://sourceware.org/ml/libc-
locales/2019-q1/msg00010.html](https://sourceware.org/ml/libc-
locales/2019-q1/msg00010.html)
[2][https://sourceware.org/bugzilla/show_bug.cgi?id=2872](https://sourceware.org/bugzilla/show_bug.cgi?id=2872)

------
mark-r
The formal definition of "project" here doesn't mesh well with the way it's
commonly used in English. You don't double apply a projection simply because
it doesn't make any sense to try. Think of a film projector or a Mercator map.

------
chx
Drupal 8 has a transliteration component (=independent of Drupal). It's ...
not easy. And that's still a pretty weak implementation. ICU / PHP extension
intl is a better one but in general, transliteration is just a bag of hurt.

------
sorisos
I have used `uni2ascii'[0] for this, but I guess it has less features. [0]
[https://linux.die.net/man/1/uni2ascii](https://linux.die.net/man/1/uni2ascii)

------
gumby
iconv is the standard utility for this (a standard unix/linux utility) and
includes the projection of characters that aren't in the target character set.

Not sure how well it handles CJKV chars though.

~~~
jwilk
It doesn't do anything useful with them:

    
    
      $ echo 北亰 | iconv -t ASCII//TRANSLIT
      ??

------
perlgeek
For Perl,
[https://metacpan.org/pod/Text::Unidecode](https://metacpan.org/pod/Text::Unidecode)
has never let me down so far.

~~~
otherflavors
The python module is essentially a port of that

------
samf
I use pastebot on the mac. It should be easy to add this as a "shell script
filter", but sandboxing prevents me from getting unidecode. Has anyone else
accomplished this?

------
cannedslime
What a useful library!

------
aaaaaaaaaab
I use this transform chain for asciifying filenames (ridiculous as it is, in
2019 we still can't sync unicode filenames between different OSs):

    
    
        uconv -x ':: Any-Latin; :: Latin-ASCII; [:^ASCII:] > \_'
    

Explanation:

    
    
        Any-Latin
    

Transliterates from non-latin scripts to latin script (e.g. "γραφὴν" \-->
"graphḕn").

    
    
        Latin-ASCII
    

Tries to asciify characters as much as possible by discarding accents,
splitting ligatures, replacing Unicode quotes with regular quotes, © --> (C),
etc.

    
    
        [:^ASCII:] > \_
    

Replaces any remaining non-ASCII characters with an underscore.

~~~
gumby
Why not transform them into utf-8, say with iconv?

~~~
aaaaaaaaaab
How would that help when e.g. copying files from a Linux server to macOS via
Samba? Mind you, the filenames can contain both NFC, NFD or the mixture of the
two.

~~~
gumby
Well you're talking about filenames, right? Not the content?

My understanding (which could be quite wrong) is that Windows refers NFC and
MacOS prefers NFD (as do I but I understand the NFC desire) but that the
filesystems themselves do not do normalization. In which case, modulo
delimiters, every filename in one is legit in the other.

My suggestion of utf8 was to make a byte-order and code point width invariant
form that should be completely reversible.

But if that doesn't work, never mind.

------
rurban
That really is a problem of the search engine. Poincaré should be normalized
and stemmed before being indexed and queried. (You don't say projected).
Wonder which engine failed to do that.

~~~
yorwba
TFA is the author of the "search engine" describing how they learned about
having to do normalization.

~~~
rurban
I see, thanks. Now I know which search to avoid. This unicode business should
be the most trivial problem in building a search engine. What will he with
East-Asian languages? Transliterate to English ASCII? Pretty sure he will skip
them. At least there's no stemming there.

