
Python string formatting and UTF-8 problems workaround - bwitten
http://blog.endpoint.com/2015/07/python-string-formatting-and-utf-8.html
======
kalenx
Well... That, or you just use Python 3, which already supports UTF-8 as
default encoding for strings...

~~~
zokier
I'm pretty sure that is not true. I don't think that py3k strings even have a
default encoding per se.

~~~
maxerickson
Text mode file IO defaults to a platform specific encoding:

[https://docs.python.org/3/library/functions.html#open](https://docs.python.org/3/library/functions.html#open)

(which will probably be UTF-8 on recent systems that are not Windows)

------
masklinn
"workaround" is the operative word here, rather than fix the issue of their
program being unicode-unaware (despite getting unicode data from their XML API
as should be) they just sprinkled .encode calls until the issue disappeared,
then fabulated something to explain it.

------
rspeer
Okay, someone with a blog just learned he had to use Unicode because not
everyone speaks English and not all text is ASCII.

Why is this an interesting link?

~~~
extc
It's not and is actually misleading as people who do not already know this are
not gaining any concrete information, just an anecdote.

------
bmh_ca
For those interested in Python 2 string encoding, I've spent some time writing
about it, here:

>
> [http://stackoverflow.com/a/6539952/19212](http://stackoverflow.com/a/6539952/19212)

------
adamtj
The fix isn't quite right. It may technically produce correct output _now_ ,
but it's sloppy. The sloppy code is brittle and dangerous and perfect food for
bugs, but that's a minor problem. After all, a single mistake or a small bit
of sloppy code can only cause a few bugs at most. The major problem is that
the sloppiness indicates a possible lack of understanding. A misunderstanding
can continue to produce bugs and brittle code indefinitely. Misunderstandings
are the devil!

The symptom is that the .encode() comes too early. The general principle is to
.decode() as early as possible and to .encode() as late as possible. The
results of .encode() should be as temporary as possible -- preferably never
even assigned to a variable.

Seeing the encode in the wrong place leads me to suspect that the author is
confusing byte arrays and strings. These are two distinct things, but most
documentation makes that distinction clear as mud.

The key thing to realize is that strings are not bytes, and bytes are neither
characters nor strings. Think of strings as abstract data structures, like
hash tables or linked lists. Bytes are binary integers. On the surface, byte
arrays are integer arrays, not strings nor hash tables nor lists of objects.

Programs interface with the world via bytes. Files are bytes. The ntetwork is
bytes. Everything is bytes. Bytes are not strings. A byte is an 8-bit integer.
When you do I/O and get bytes from the world, you must deserialize them into
whatever abstract data structure they represent. Ignore the C language and
it's misnamed "char" type. A string is an abstract data structure, as is a
hash table, or a list. In some sense even binary integers are abstract data
structures that need to be serialized. String serializations are called
"encodings". Binary integers are serialized by choosing a byte order (big or
little endian). There are various standard ways to serialize hash tables and
lists, like json, various XML formats, python's "pickle" and "shelve", etc.

When you get bytes from the network or a file and those bytes are supposed to
represent a string, you must deserialize the bytes into a string object. This
is called _decoding_. Often you're using a web framework or other library that
does this for you. Python 3's file objects do it. If it's not done
automatically, then you must do it yourself. You or your framework should
decode bytes into a unicode string object as soon as possible. You should do
this everywhere that you do input, and then leave your strings as strings for
as long as possible. Do all of your operations on strings ("unicodes"), not
bytes. You parse strings, join strings, replace characters in strings, trim,
find lengths and match regexes on strings ("unicodes"). Doing any of those
operations on byte arrays is nonsensical and will lead to bugs. Only when you
have your final string completely ready to go should you worry about
serializing it for printing or to write to a file or the network. Only then,
at the last possible moment, should thoughts like utf-8 or ascii enter your
mind.

As written, it's unclear whether "freqs" containing byte arrays or unicode
strings. Getting that mixed up can result in failing to find and item which
really is in the dict, or miscounting frequencies, or it can even cause more
UnicodeEncode/DecodeErrors. By decoding as early as possible and encoding as
late as possible, such sneaky bugs are much less likely to occur.

In Python 2, I would have fixed the problem like this:

    
    
      for e in results:
          simple_author=e['author'].split('(')[1][:-1].strip()
          if freqs.get(simple_author,0) < 1:
              print ("%(date)s -- %(author)s -- %(title)s" % {
                  'date':   parse(e['published']).strftime("%Y-%m-%d"),
                  'author': simple_author,
                  'title':  e['title'],
              }).encode('utf-8')
    

You might dislike my multi-line print and would prefer to .join() a list,
possibly with a temporary variable. Or, maybe you'd prefer the newer
.format(). Regardless, the important point is that the .encode() should happen
later than it does in the article.

