
There's a reason they're called "encodings" - diwank
http://stackoverflow.com/questions/368805/python-unicodedecodeerror-am-i-misunderstanding-encode/370199#370199
======
chris_wot
I wrote a potted history of the precursors to Unicode.[1] Mainly because I was
dealing with Oracle and Unicode far, far too much.

Might be useful to someone...

1\.
[http://www.randomtechnicalstuff.blogspot.com.au/2009/05/unic...](http://www.randomtechnicalstuff.blogspot.com.au/2009/05/unicode-
and-oracle.html)

------
ot
This should be a mandatory read before anyone asks questions about Unicode:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know
About Unicode and Character Sets

<http://www.joelonsoftware.com/articles/Unicode.html>

~~~
adobriyan
This reminds me of kids buying surrogate books with short description of
"Peace and War" and other lengthy literature pieces because they don't want or
don't have time to read the original.

I personally think the first and the only mandatory thing one should read
about Unicode is the Standard itself.

Don't read "UTF-8 is cool!!!" blog posts.

Don't read RFC 3629.

Read original.

[http://www.unicode.org/versions/Unicode6.1.0/UnicodeStandard...](http://www.unicode.org/versions/Unicode6.1.0/UnicodeStandard-6.1.pdf)

It is nicely written, it has normative definitions (code points, glyphs et
al).

It describes BOM (which Joel On Software apparently has failed to understand).

It describes UTF-8 error handling and normalization and many necessarily non-
trivial things.

------
compay
I made a little practical demo about using Unicode with Ruby for a
presentation a while back, perhaps of interest to some people:

<https://github.com/norman/enc/blob/master/equivalence.rb>

------
wcoenen
There is an explanation missing in that answer about how python 2 goofed up
and python 3 strings fixed things. Without that explanation, showing how the
Python 2 strings must be "decoded" is just terribly confusing.

It doesn't really make sense to "decode" a proper string type. Ideally the
language should never reveal how it represents strings internally in memory,
so you can think of strings as a sequence of abstract unicode codepoints with
no specific encoding at all. And Python 3 strings are like that.

~~~
hcarvalhoalves
Python 3's string type is simply Python 2's unicode. Python 3's bytes type is
2's string. Python 3 didn't fixed anything, they just cleared up the semantics
because way too many people expect a type called "string" to hold "text" - but
often what they really mean is "unicode".

