
How to Use UTF-8 with Python - joeyespo
http://www.evanjones.ca/python-utf8.html
======
grifaton
This was published in 2005. Unicode in python has come a long way since then,
especially with respect to python 3 [1], [2].

[1]
[http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-...](http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-
vs-data-instead-of-unicode-vs-8-bit) [2]
<http://docs.python.org/release/3.0.1/howto/unicode.html>

~~~
joeyespo
Python 2 is still the status quo though.

One of my favorite things about Python 3 is its new string handling. It is
fantastic and marginally more intuitive. Having an explicit (and a more
general) immutable `bytes` class, no longer sharing a base class with
`unicode`, and having Unicode instead of a potentially-encoded string as the
default are all wonderful decisions and moves Python forward by miles in my
opinion.

Until Python 3 becomes the norm, we're still stuck with the confusing `str`
and `unicode` constructs.

------
js2
For decoding a byte string, I prefer bytes.decode("utf8"). Note that decode
takes an optional second argument that says what to do if an invalid byte
sequence is encountered - by default an exception is raised.

Also, I think <http://docs.python.org/howto/unicode.html> may be a better
reference than this article.

------
oconnore

        $ python3
        Python 3.2.2 (default, Nov 21 2011, 05:01:42)
        [GCC 4.4.5] on linux2
        Type "help", "copyright", "credits" or "license" for more 
        information.
        >>> print('hello world') # <-- Unicode!

------
sukhbir
To add to this, Kumar McMillan's talk, "Unicode In Python, Completely
Demystified" (<http://farmdev.com/talks/unicode/>) is really good.

------
jrockway
_First, you can place a UTF-8 byte-order marker at the beginning of your file,
if your editor supports it. Secondly, you can place the following special
comment in the first or second lines of your script:

# -_\- coding: utf-8 - _-_

Does Python actually parse comments, or is this just to get your editor to do
the right thing?

~~~
js2
_Python supports writing Unicode literals in any encoding, but you have to
declare the encoding being used. This is done by including a special comment
as either the first or second line of the source file:_

    
    
      # -*- coding: latin-1 -*-
    

_The syntax is inspired by Emacs’s notation for specifying variables local to
a file. Emacs supports many different variables, but Python only supports
‘coding’. The dash-splat-dash symbols indicate to Emacs that the comment is
special; they have no significance to Python but are a convention. Python
looks for coding: name or coding=name in the comment._

<http://docs.python.org/howto/unicode.html>

------
krosaen
some useful helpers from tornado's escape.py:

    
    
        def utf8(value):
            if isinstance(value, unicode):
                return value.encode("utf-8")
            assert isinstance(value, str)
            return value
    
        def _unicode(value):
            if isinstance(value, str):
                return value.decode("utf-8")
            assert isinstance(value, unicode)
            return value

