
Codec Confusion in Python - easonchan42
http://lucumr.pocoo.org/2012/8/11/codec-confusion/
======
loqi
That's the most unfortunate Python 3 change I've seen. I use byte codecs like
hex, zlib, and base64 quite a bit more than text codecs. In Python 2, a
programmer with forward-compatible habits can write

    
    
      from __future__ import unicode_literals
      from io import open
    

with the understanding that migration to Python 3 will remove that
boilerplate. But taking a similar approach for byte codecs requires knowledge
and reference of the right module name (instead of the encoding name) and the
names of the corresponding encode and decode functions (instead of just encode
and decode). So we've got

    
    
      .encode('base64') -> import base64; base64.b64encode()
      .encode('zlib')   -> import zlib; zlib.compress()
      .encode('hex')    -> ?
    

and unlike the text boilerplate, it's a permanent uglification. I don't know
of an idiomatic replacement for the last one off the top of my head. Hopefully
it's something nicer and more symmetrical than

    
    
      ''.join(map('{:02x}'.format, foo)).encode('ascii')

~~~
jrockway
Explicit is better than implicit. If you want to call base64.b64encode on a
piece of data, do that.

~~~
DrJosiah
It's less convenient, it's a different API for every type of transformation,
and the change has made code demonstrably worse.

Further, the use of encode/decode is explicit. The only thing that was
implicit was the automatic transformation of string -> unicode when people
mistakenly used unicode codecs on string objects, or the reverse for string
codecs on unicode objects. The proper answer to both of these is to just not
do automatic type conversion... which is what was done in Python 3.

So actually, had we left all of the codec machinery intact, those codec errors
described by Armin wouldn't ever occur again! Instead, you'd get a TypeError
caused by passing the wrong type of object to the underlying encoder/decoder.

~~~
jrockway
The API is nice, but it is difficult to maintain. To get encoders/decoders
into the string class in the first place, you have to maintain a global
registry. (I suppose you could pass them all to the constructor of the string
object, but nobody's going to do that.) The global codec registry leads to
naming conflicts. If you import a module that globally adds a "foo" encoder,
then you import another module that globally adds a "foo" encoder, now what?
Both modules break because of their dependency on the global name "foo".
Because of the details of the codec.register implementation, you can't even
catch the conflict at registration time and refuse to load the second module,
you simply have to wait until your program returns subtly-incorrect results.

Compare this to the scheme where you import codec modules explicitly and just
call their functions. Your imports are lexically-scoped, and if you happen to
need two encoders that use the same name, you can just alias one of them at
import time. This strategy can't introduce unexpected errors as your program
grows larger, because the side effects are constrained to one module. It
either works now and will always work, or doesn't work and fails quickly while
you are developing.

Ultimately, people use Python because they want a bit of discipline in their
lightweight language. This isn't Javascript or PHP, after all :)

~~~
DrJosiah
Codec registration has never been an issue. Let me repeat that with some
emphasis, because it's an important point. Codec registration has NEVER BEEN
AN ISSUE.

And global registries are not inherently a bad thing. If you were to say "I
don't want a json/pickle/messagepack decoder built into the codecs module by
default", I would agree - because it's not a string/unicide <-> string/unicode
transformation. But it wouldn't bother me for someone to add that support in
their stuff because data.decode('json') is terribly convenient. Arguably
better than peppering your code with the following (or loading it in a shared
space, or injecting it into __builtins__, ...)

try: import simplejson as json

except ImportError: import json

But ultimately we are adults. If _you_ would prefer to import a cluster of
modules just to convert your strings to hex or compress your string with zlib,
you are free to do so. It's just unfortunate that due to _misunderstandings_
about the fundamental underlying problem (TypeErrors), functionality was
_removed_.

------
csense
I've glanced at internationalization API's at various times over the years,
and I've never understood them.

You have encodings, Unicode, ASCII, UTF-8, ISO 9660, Latin-1, code pages,
UTF-16, byte order masks, gettext macros, po files, ... the terminology and
model of the problem domain are extremely complex and difficult to understand.

Every time I've dealt with internationalization it's been in the context of it
causing strange problems and issues.

For example, one time I downloaded some tarball (I forget what it was) that
had a few bytes of binary garbage at the beginning of every file. After some
research I found out that it's called the BOM and has something to do with
international text, and I ended up having to WRITE A SCRIPT WHICH GOES THROUGH
AND DELETES THE FIRST FEW BYTES OF EVERY FILE IN A TREE in order to use the
tarball's contents.

Another time, I downloaded some Java source which contained the author's name
in comments. The author was German and his name contains an "o" with two dots
over it. That was the only non-ASCII character in the files. Eclipse and
command-line javac WOULD NOT PROCESS THE FILE and I ended up removing his name
from all comments; after that it compiled without a hitch. This was the
official Oracle (then Sun) javac. A fricking SOURCE TO BYTECODE COMPILER
SHOULD NOT DEPEND ON YOUR SYSTEM'S NATIONALITY SETTINGS -- OR ANY LOCAL SYSTEM
SETTINGS! -- TO DO ITS JOB. But it does.

Whenever you debootstrap a new Debian / Ubuntu system, using apt-get causes
complaints about using the C locale until you do some magic incantation called
"generating locales." Exactly what has to be generated and why the generated
files can't either be included with binaries and other generated files, or
auto-generated during the installation of the distro, defies explanation.

Playing Japanese import games sometimes requires you to do strange things to
your Windows installation.

And of course internationalization issues are often cited as one of the things
holding back many Web frameworks and other libraries from porting from Python
2 to Python 3; and of course a lack of library support has been the major
showstopper for Python 3 for years now.

My advice to startups: Don't worry about non-English markets until your VC
funding and/or revenue is substantial enough to support at least one full-time
developer to work on the issue. A working technical understanding of
internationalization is going to be a huge sink of development resources and
intellectual bandwidth, which you probably can't afford while bootstrapping.

~~~
guns
> _You have encodings, Unicode, ASCII, UTF-8, ISO 9660, Latin-1, code pages,
> UTF-16, byte order masks, gettext macros, po files, ... the terminology and
> model of the problem domain are extremely complex and difficult to
> understand._

In the beginning, there was ASCII [1]. It was a simple encoding that mapped a
byte stream to standard American letters, numerals, punctuation marks, as well
as some common non-printing control codes.

ASCII only used the lower 7 bits of the 8-bit byte, reserving the upper 128
positions for any non-American characters needed for national encodings.

And indeed, many dozens of national character encodings appeared that used
ASCII for its lower 128 positions and implemented their own character table in
the upper 128. One very popular encoding was Latin-1 [2]. This became the
standard encoding in much Western software because it adequately handled the
most widely used Western languages.

One major problem with these 8-bit national encodings is that the upper 128
codes are mutually exclusive with one another. Confusingly, they almost all
shared the base 128 ASCII codes, so programmers and users began to equate
"plain text" and "sane encoding" with 7-bit ASCII, since one could effectively
communicate universally by restricting the characters used to those in the
printable ASCII table.

As it became clear that the proliferation of 8-bit encodings was untenable,
there emerged Unicode. Unicode is not an encoding, but a standard that
provides a table of universal code points, along with some recommendations
about how to combine and display certain code points. [3]

Unicode is implemented in the modern day by UTF-8, UTF-16, and UTF-32, which
are primarily distinguished, as you might guess, by the base size of the code
unit.

UTF-32 is a simple encoding that simply maps every Unicode code point to a
32-bit sequence. This is simple to parse, but potentially very wasteful, so is
rarely used.

UTF-16 uses 16-bit code units, and is able to simply translate the most
commonly used portion of Unicode, the Basic Multilingual Plane. For code
points above U+ffff, a scheme is used to span the code point along two code
units. This encoding is used frequently in Windows, and in Java.

UTF-8 is a variable length Unicode encoding like UTF-16, but defaults to a
small one-byte code unit and has a famously elegant algorithm, so it appeals
strongly to miserly Unix hackers.

> _For example, one time I downloaded some tarball (I forget what it was) that
> had a few bytes of binary garbage at the beginning of every file. After some
> research I found out that it's called the BOM and has something to do with
> international text, and I ended up having to WRITE A SCRIPT WHICH GOES
> THROUGH AND DELETES THE FIRST FEW BYTES OF EVERY FILE IN A TREE in order to
> use the tarball's contents._

The Byte Order Mark is clunky solution to the fundamental problem of divining
the character encoding of an arbitrary byte stream. It's great if all your
tools transparently support it, but annoying if not. However, some sort of
convention or metadata is necessary to correctly encode your data. Python,
Ruby, and other scripting languages have begun to coalesce around the magic
encoding comment for source files (i.e. `# encoding: utf-8` as the first or
second line).

Most everybody falls back to ASCII if no encoding is specified, or cannot be
inferred from the stream itself. The better fallback is UTF-8, because
breakages like yours are less likely to occur, which is why it is encouraged
as the default system encoding in most cases.

> _Another time, I downloaded some Java source which contained the author's
> name in comments. The author was German and his name contains an "o" with
> two dots over it. That was the only non-ASCII character in the files.
> Eclipse and command-line javac WOULD NOT PROCESS THE FILE and I ended up
> removing his name from all comments; after that it compiled without a hitch.
> This was the official Oracle (then Sun) javac. A fricking SOURCE TO BYTECODE
> COMPILER SHOULD NOT DEPEND ON YOUR SYSTEM'S NATIONALITY SETTINGS -- OR ANY
> LOCAL SYSTEM SETTINGS! -- TO DO ITS JOB. But it does._

These tools likely have ways of setting the encoding without inheriting from
the environment, but they do fall back on the environment as a simple
convention.

The trouble is that there is no reason any longer to assume that all text
_must_ be 7-bit ASCII. Unix and programming languages are evolving to handle
this new multilingual digital world. The only obstacle that really remains are
programmers, so I think it's fair to spend a little time learning the basics
of the subject.

[1]: There were other antediluvian encodings (like EBCDIC)

[2]: a.k.a. ISO-8859-1. Windows used a slightly modified version of this and
called it Windows-1252 in order to complicate matters

[3]: The actual display of composite glyphs is left the implementor. For
instance, both ready-made composite glyphs like é are provided, as well as a
"non-spacing" acute accent mark

~~~
masklinn
And you didn't even touch on fixed and variable-width asian character sets,
like Shift-JIS (variable 1 or 2 bytes) or Big5 + extensions (ETEN or CP950,
fixed 2 bytes)

------
gitarr
This just got much clearer in Python3, all string literals are now Unicode by
default, making it much easier to code internationalized programs in Python.

There is an "Explicit Unicode Literal" (u"string"), to make it easier for
library authors to run their libs in one codebase on Python 2 and Python 3.
(In Python 3 "string" is the same as u"string")

Codecs are in their own modules, like for example base64, where they belong.

The confusions seems to be with Python 2, Python 3 has fixed it, so move on.

