Hacker News new | past | comments | ask | show | jobs | submit login
Projecting Unicode to ASCII (johndcook.com)
85 points by chmaynard 67 days ago | hide | past | web | favorite | 41 comments

If you limit yourself to the West Europe languages, you can just go with the stdlib:

    >>> import unicodedata
    >>> print( unicodedata.normalize('NFKD', "éèêàùçÇ").encode('ascii','ignore'))
But if you need to project (transliterate to ascii) Arabic, Russian or Chinese, unidecode is close to black magic:

    >>> from unidecode import unidecode
    >>> unidecode("北亰")
    'Bei Jing '
Anyway, always remember that str.encode(), str.decode(), open() and many other related callables have an "errors" parameters that allow you to deal with unkown solutions when encoding or decoding:

    >>> print("Père Noël".encode("ascii", errors="ignore"))
    b'Pre Nol'
    >>> print("Père Noël".encode("ascii", errors="replace"))
    b'P?re No?l'
I'll conclude with the mandatory "use Python 3" (3.7 if you can, it has many utf8 fixes: https://vstinner.github.io/posix-locale.html), since you'll be in a world of pain if you deal with non-ascii in Python 2, and EOL is next year :) Tic, Toc...

Unidecode maintainer here. I really need to add some commentary to that "Bei Jing" example in the README.

Unidecode doesn't do language-specific transliteration and really works best for user-invisible things, like database identifiers or normalization.

CJK characters in particular are very problematic, since they must be transliterated differently depending on the locale. Over the years I have received many angry mails from people that were deeply offended by an error in transliteration they saw in an URL or something.

Unihandecode is a fork of Unidecode that tries to address this:


One of those angry users here -- e.g. making flash card decks for Japanese on Memrise where they generate URLs based on some choice set of Chinese transliterations for no discernible reason.

Thank you for maintaining this package. I wish I'd run across it sooner.

> if you need to project (transliterate to ascii) Chinese unidecode is close to black magic

That might be a dangerous assumption. The project page explicitly states:

> Transliteration of languages like Chinese is a very complex issue and this library does not even attempt to address it. It draws the line at context-free character-by-character mapping.

I.e. it is not black magic; just a mapping of Unicode characters to static ASCII transliterations. The results will certainly be incorrect in some contexts.

Perhaps for some applications, e.g. hash keys, consistency is more important than correctness.

Just fixing the French here: it’s Père Noël, not Pére Noël.

'Tin la honte.

Fredrik Lundh describes a DIY patch on top of unicodedata here: http://effbot.org/zone/unicode-convert.htm It solves the ä->ae problem and its ilk for Western European languages.

Time has devoured my comment on that post, which extended the transliteration table to Eastern European languages and proposed to use mnemonic names like int(u'\N{Latin capital letter AE}') instead of 0xc6.

ä -> ae is a German transliteration that would not be recognized by a non-German (e.g. a Dutchman).

While a similar transliteration (ø-> oe) would be understood by Danes, it can create misunderstandings: køn (pretty) != koen (the cow), søn (son) != soen (the sow), røde (red) != roede (rowed), tør (dry) != toer (a two) and rør (pipe) != roer (rower).

Context though is the missing piece to all of those, all languages, particularly english however is context driven.

Unidecode may handle Chinese fine, but it definitely handles Western European languages wrong.

DIN 5007 Var. 2 specifies that for the purposes of sorting, ö is replaced with oe. This also applies to ä (ae), ü (ue) and ß (ss). This same replacement rule is also used on passports and IDs.

Unidecode does not handle this correctly.

The unidecode author wrote about this:

> In German, there's the typographical convention that an umlaut (the double-dots on: ä ö ü) can be written as an "-e", like with "Schön" becoming "Schoen". But Unidecode doesn't do that-- I have Unidecode simply drop the umlaut accent and give back "Schon".

> (I chose this not because I'm a big meanie, but because generally changing "ü" to "ue" is disastrous for all text that's not in German. Finnish "Hyvää päivää" would turn into "Hyvaeae paeivaeae". And I discourage you from being yet another German who emails me, trying to impel me to consider a typographical nicety of German to be more important than all other languages.)

What do you mean by a Western European language? Germany is usually considered geographically part of Central Europe. Finnish (and Estonian) are of course not Germanic nor even Indo-European languages. But OTOH Swedish, a Germanic language, also uses umlauts and considers them separate letters in collation (Finnish collation rules are actually adopted from Swedish). The most 'correct' mapping of Swedish "Lära sig höra" would be "Lara sig hora". Is Swedish not a Western European language?

Unidecode is a projection into the ASCII plane, not your favourite language specific preferred transliteration of certain characters. As such, calling it "definitely [...] wrong" is somewhat overblown, in particular since the author of the library explicitly addresses it.

I agree with this -- simply dropping the umlaut is adequately clear, weird as it may look to me, just as dropping a circumflex is preferred to adding an 's' after it in French. Adding an e doesn't produce a reversible result (there are, for example, ordinary German words written with ae, oe, and ss).

Every such projection is a lossy projection and is expected to possibly generate in an ambiguous result that must be interpreted through context. Hence the strange but comprehensible ,,Godel'' and ,,Malmo''. Cook is not claiming that his code replaces the need for Unicode!

(There is a minor linguistic irony that the origin of the umlaut was scribe's shorthand when an E vowel inflection was turned into tiny E written above a letter, almost like a ligature, which became a pair of dots. But that character was then used in other languages differently, much as a loanword from a different language usually changes its meaning in the new language).

Did anyone else get the impression that this is a shallow post? I expected some detail about eg. the issues of transliteration, but instead it presented a couple of facts about idempotency in UX.

By John's usual standards it is yes, but it's still interesting IMO

I've got to recommend PyICU here. unidecode is good, but has some holes (I noticed the Azeri letter schwa, for example). PyICU is a binding for IBMs International Components for Unicode, and basically has a coding language for unicode transforms. Here's an example that is equivalent to unidecode:


Romanising Chinese is not as easy as unidecode would made it seem! "銀行" is pronounced "yin hang", but the second character "行" is "xing" when it's alone. This problem is made worse by the lack of spaces when writing Chinese.

Pingtype tries to solve all these problems. If there's a need for it to be ported to Python/etc then I'd be happy to do so!


Otherwise known as transliteration

We are actually actively working on Cyrillic-ASCII transliteration for glibc in this very moment. Please check this patch for details [1] Your help and suggestions welcome to make sure this will be a useful and consistent fix when it lands. The bug [2] is from 2006 (sic!) and it is a reason why transliteration may not work with iconv for some systems/locales. [1] https://sourceware.org/ml/libc-locales/2019-q1/msg00010.html [2]https://sourceware.org/bugzilla/show_bug.cgi?id=2872

The formal definition of "project" here doesn't mesh well with the way it's commonly used in English. You don't double apply a projection simply because it doesn't make any sense to try. Think of a film projector or a Mercator map.

Drupal 8 has a transliteration component (=independent of Drupal). It's ... not easy. And that's still a pretty weak implementation. ICU / PHP extension intl is a better one but in general, transliteration is just a bag of hurt.

I have used `uni2ascii'[0] for this, but I guess it has less features. [0] https://linux.die.net/man/1/uni2ascii

iconv is the standard utility for this (a standard unix/linux utility) and includes the projection of characters that aren't in the target character set.

Not sure how well it handles CJKV chars though.

It doesn't do anything useful with them:

  $ echo 北亰 | iconv -t ASCII//TRANSLIT

For Perl, https://metacpan.org/pod/Text::Unidecode has never let me down so far.

The python module is essentially a port of that

I use pastebot on the mac. It should be easy to add this as a "shell script filter", but sandboxing prevents me from getting unidecode. Has anyone else accomplished this?

What a useful library!

I use this transform chain for asciifying filenames (ridiculous as it is, in 2019 we still can't sync unicode filenames between different OSs):

    uconv -x ':: Any-Latin; :: Latin-ASCII; [:^ASCII:] > \_'

Transliterates from non-latin scripts to latin script (e.g. "γραφὴν" --> "graphḕn").

Tries to asciify characters as much as possible by discarding accents, splitting ligatures, replacing Unicode quotes with regular quotes, © --> (C), etc.

    [:^ASCII:] > \_
Replaces any remaining non-ASCII characters with an underscore.

Do you care to chime in with your use case for our patch discussion? I am trying to argue for the translit to be most useful for asciifying filenames among the others. Another actual user will help to make the case. https://sourceware.org/ml/libc-locales/2019-q1/msg00014.html

Why not transform them into utf-8, say with iconv?

How would that help when e.g. copying files from a Linux server to macOS via Samba? Mind you, the filenames can contain both NFC, NFD or the mixture of the two.

Well you're talking about filenames, right? Not the content?

My understanding (which could be quite wrong) is that Windows refers NFC and MacOS prefers NFD (as do I but I understand the NFC desire) but that the filesystems themselves do not do normalization. In which case, modulo delimiters, every filename in one is legit in the other.

My suggestion of utf8 was to make a byte-order and code point width invariant form that should be completely reversible.

But if that doesn't work, never mind.

What does "::" mean?

It’s part of uconv’s transform rule syntax:

>Each transform rule consists of two colons followed by a transform name.


That really is a problem of the search engine. Poincaré should be normalized and stemmed before being indexed and queried. (You don't say projected). Wonder which engine failed to do that.

TFA is the author of the "search engine" describing how they learned about having to do normalization.

I see, thanks. Now I know which search to avoid. This unicode business should be the most trivial problem in building a search engine. What will he with East-Asian languages? Transliterate to English ASCII? Pretty sure he will skip them. At least there's no stemming there.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact