

Unicode In Python, Completely Demystified - coderdude
http://farmdev.com/talks/unicode/

======
ludwigvan
The main trick is to remember is, to get utf-8, you have to encode, not
decode. When I didn't know how unicode works that well, I mistakenly assumed
unicode is coded, so I have to decode it. Internally, unicode is of course
coded (that is represented by some coding), but you don't have to know it. As
the slides say, just encode to utf8 when you print text out, or display it in
a web browser etc; and you will be fine.

One other unicode library I can recommend is icu, Python bindings to IBM's
ICU. It solved some Turkish specific problems I have. (Turkish alphabet has
I,İ, ı, i which makes upper-lower case conversions tricky. For example, mayıs
becomes MAYIS when capitalized while ENGLISH becomes englısh when lowercased.)

Read <http://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I> ,
<http://www.joelonsoftware.com/articles/Unicode.html> and
[http://www.codinghorror.com/blog/2008/03/whats-wrong-with-
tu...](http://www.codinghorror.com/blog/2008/03/whats-wrong-with-turkey.html)
for more info.

Another issue, if you are doing web dev is, how to represent non-ascii
characters in urls. One choice is url encoding, the other is slugifying the
url, that is, choosing an ascii equivalent for the non-ascii character. For
example, Django has a slugify function that helps with this. It converts, for
example, über to uber.

~~~
zepolen
>It converts, for example, über to uber.

What does it convert 'ぬびばざべ' into?

~~~
admp
ICU transliterator is fairly sophisticated (and configurable), actually. Take
a look at the examples:

    
    
      キャンパス -> kyanpasu
      Αλφαβητικός Κατάλογος -> Alphabētikós Katálogos
      биологическом -> biologichyeskom
    

From: <http://userguide.icu-project.org/transforms/general>

Edit: reread, OP was talking about Django transliteration, which is much
simpler.

~~~
ludwigvan
Yes, Django slugifier turns ぬびばざべ to an empty text. Maybe that's because my
language is set to tr-tr, I am not sure. One possible downside with icu is
that, afaik, you can't run it in Google App Engine.

------
beaumartinez
Dive Into Python 3 does a very good job of explaining what a character and
what a byte is, their difference, and why it is important (particularly for
Python 3). <http://diveintopython3.org/strings.html>

Spolsky's article on character encoding is equally as good.
<http://www.joelonsoftware.com/articles/Unicode.html>

I recommend both to every programmer.

------
est
They main problem with Python encoding is the cmd.exe on Windows does not
support certain >127 ASCII, thus Ordinal >127 error is pretty common python
encoding error. But Python could handle >127 ASCII well internally. The real
devil in the python encoding myth is the `print` function.

BTW using `mbcs` encoding on Windows is more compatible than ASCII in most
cases.

~~~
adamtj
There are no ASCII characters > 127\. ASCII is a 7-bit encoding. That's why
you run into cp1252 all the time if you work with Western European languages,
such as English. cp1252 is an 8-bit superset of ASCII designed for Western
European languages. I believe it's true that the first 256 unicode code points
are actually cp1252, which implies that the first 128 code points correspond
with ASCII.

~~~
beaumartinez
Windows-1252 (cp1252) is similar to ISO-8859-1 (Latin-1), which _is_ a subset
of Unicode; both are a superset of ASCII. Windows-1252 however has a few
characters mapped to different codepoints.

~~~
adamtj
That's right, thank you.

------
gorset
CPython aren't actually using unicode strings. Instead of having a sequence of
codepoints, you will have either UCS2 or UCS4 strings depending on compile
time options.

For example, freshly installed python3.2 from macports:

    
    
        >>> len(chr(119074))
        2
        >>> chr(119074)
        '𝄢'
        >>> print chr(119074)
        𝄢
    

(Surrogates are so much fun).

Python 2.7.1 compiled with UCS4:

    
    
        >>> len(unichr(119074))
        1
        >>> unichr(119074)
        u'\U0001d122'
        >>> print(chr(119074))
        𝄢
    

Notice the capital U takes 8 instead of 4 hexidecimals.

------
joeyh
So, that sounds fairly similar to "unicode in $arbitrary_language": Unicode is
supported internally; there is still complexity associated in getting data
decoded/encoded when bringing it in/out; new versions of the language are
working to reduce the complexity but somehow not there yet; handling of
unicode by libraries tends to be inconsistent and probably undocumented.

Someone should do a comparative analysis across languages. My guess is that
(in-browser) javascript is probably one of the few languages to not have
significant unicode problems, although there are surely some.

------
crccheck
If you have a hard time reading this, disable JavaScript.

~~~
yuvadam
If you have a hard time reading this, just press the left and right buttons to
run the slides.

~~~
admp
Pretty hard when all you have is a touch screen. :-)

------
BoppreH
What baffles me that we are trying to convince programmers to support Unicode
while many programs and websites won't even accept special symbols. In
passwords.

I mean, how dumb does the system have to be to reject the @ char in the
password field? Or worse yet, accept only _numbers_, as my university does.

I think we did something terribly wrong somewhere in the char set conventions.

------
julian37
Appears to be mostly good information, though I'm surprised he paints UTF-16
in such a positive light ("optimized for languages residing in the 2 byte
character range."). UTF-16 should be considered harmful:

[http://benlynn.blogspot.com/2011/02/utf-8-good-
utf-16-bad.ht...](http://benlynn.blogspot.com/2011/02/utf-8-good-
utf-16-bad.html)

[http://stackoverflow.com/questions/1049947/should-
utf-16-be-...](http://stackoverflow.com/questions/1049947/should-utf-16-be-
considered-harmful/1855375#1855375)

