
Unicode in Python 3 - buttscicles
http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/
======
wbond
Having written a bunch of Python 2 and porting it to 3 where I deal with
unknown encodings (FTP servers), I can't help but disagree with Armin on most
of his Python 3 posts.

The crux of his argument with this article is "unix is bytes, you are making
me deal with pain to treat it like Unicode." Python 2 just allowed to take
crap in and spit crap out. Python 3 requires you to do something more
complicated when crap comes in. In my situation, I am regularly putting data
into a database (PostgreSQL with UTF-8 encoding) or working with Sublime Text
(on all three platforms). You try to pass crap along to those and they
explode. You HAVE to deal with crappy input.

In my experience, Python 2 explodes at run time when you get weird crappily-
encoded data. And only your end users see it, and it is a huge pain to
reproduce and handle. Python 3 forces you to write code that can handle the
decoding at the get go. By porting my Python 2 to 3, I uncovered a bunch of
places where I was just passing the buck on encoding issues. Python 3 forced
me to address the issues.

I'm sure there are bugs and annoyances along the way with Python 3. Oh well.
Dealing with text input in any language is a pain. Having worked with Python,
C, Ruby and PHP and dealing with properly handling "input" for things like
FTP, IMAP, SMTP, HTTP, etc, yeah, it sucks. Transliterating, converting
between encodings, wide chars, Windows APIs. Fun stuff. It isn't really Python
3 that is the problem, it is undefined input.

Unfortunately, it seems Armin happens to play in areas where people play fast
and loose (or are completely oblivious to encodings). There is probably more
pain generally there than dealing with transporting data from native UI
widgets to databases. Sorry dude.

Anyway, I never write Python 2 anymore because I hate having this randomly
explode for end-users and having to try and trace down the path of text
through thousands of lines of code. Python 3 makes it easy for me because I
can't just pass bytes along as if they were Unicode, I have to deal with
crappy input and ask the user what to do.

Python 2 is a dead end with all sorts of issues. The SSL support in Python 2
is a joke compared to 3. You can't re-use SSL contexts without installing the
cryptography package, which requires, cffi, pycparsers and bunch of other
crap. Python 2 SSL verification didn't exist unless you roll your own, or use
Requests. Except Requests didn't even support HTTPS proxies until less than a
year ago.

Good riddance Python 2.

~~~
the_mitsuhiko
> Python 3 requires you to do something more complicated when crap comes in.

Or in most cases: Python 3 falls flat on the floor with all kinds of errors
because you did not handle unicode with one of the many ways you need to
handle it.

On Python 2 you decoded and encoded. On Python 3 you have so many different
mental models you constantly need to juggle with. (Is it unicode, is it latin1
transfer encoded unicode, does it contain surrogates) and then for each of
them you need to start thinking where you are writing it to. Is it a bytes
based stream? then surrogate errors can be escaped and might result in
encoding garbage same as in python 2. If it a text stream? Then that no longer
works then you can either crash or write different garbage. If it's latin1
transfer encoded then most people don't even know that they have garbage. I
filed lots of bugs against that in WSGI libs.

If you write error free Python 3 unicode code, then teach me. (Or show me your
repo and I show you all the bugs you now have)

~~~
zzzeek
> (Or show me your repo and I show you all the bugs you now have)

this would be great. Show me! I'd love to know:

[https://bitbucket.org/zzzeek/sqlalchemy/](https://bitbucket.org/zzzeek/sqlalchemy/)

[https://bitbucket.org/zzzeek/mako/](https://bitbucket.org/zzzeek/mako/)

[https://bitbucket.org/zzzeek/alembic/](https://bitbucket.org/zzzeek/alembic/)

I'm guessing you'd go for Mako first since it has the most unicode intense
stuff going on (and it uses lots of your code).

~~~
the_mitsuhiko
As an example mako cli. You can call this an error or not, but with C locale
your cmdline will die with UnicodeErrors when you open a non existing file
with unicode filename on Python 3 but not so on Python 2 where it will do the
correct thing. It will also die with unicode errors under the same situation
when your template renders any unicode characters. Again, something that
probably works fine on python 2 and correctly.

Or if you would put unicode characters into your README.rst you could no
longer safely install mako. Again, Python 3 only.

These are just two things I found on github.

Another easy one: alembic README's now no longer can safely contain unicode.
They would break on Python 3, but work just fine on Python 2 because of the
code in list_templates.

~~~
zzzeek
the cmdline template runner at the moment isn't doing unicode in Py2K either,
crashes there too.

------
twic
There was a related discussion on the Mercurial mailing list a while back. Not
about Python 2 vs 3, but about filename encoding.

Mercurial follows a policy of treating filenames as byte strings. Matt Mackall
is very clear about this. Because unix treats filenames as byte strings, this
makes Mercurial interoperate with other programs on a unix machine pretty
well: you can manage files of any encoding, you can embed filenames in file
contents (eg in build scripts) and be confident they will always be byte-for-
byte identical with the names managed by Mercurial, etc.

However, it also means Mercurial falls flat on its face when it's asked to
share files between machines using different encodings. Names which work fine
on one machine will, to human eyes, be garbled nonsense on the other.

This is a problem which does actually happen; there is a slow trickle of bug
reports about it. And because of the commitment to unix-style filenames, it
will probably never be fixed. List members did try and come up with some ideas
to fix it which preserved the unix semantics normal cases, but they weren't
popular.

And before anyone gets lippy, i assume Git has the same problem.

Ultimately, i would say this comes down to a conflict between two
fundamentally different kinds of users of strings: machines and people.
Machines are best served by strings of bytes. People are best served by
strings of characters. Usually. And sadly, unix's lack of a known filesystem
encoding is too well-established for there to be much chance of building a
bridge.

~~~
andreasvc
What do you mean by "share files between machines"? Do you mean over a
protocol? In that case the protocol over the wire should be well-defined and
would avoid problems. If you mean by sharing files over an USB-stick then it's
not so much an application problem as an OS issue.

I don't think the argument about machines wanting bytes is true. Machines will
accept anything as long as it is well-defined. I'm really curious why there
isn't yet some Linux or Posix standard that mandates utf-8. What's the problem
with just decreeing that version +1 of the standard now expects utf-8?

~~~
twic
_What do you mean by "share files between machines"?_

Commit files into a repository on one machine. Move it to another on a USB
stick, by FTP, with the DVCS's transport protocol, whatever. All of those
result in repositories containing byte-for-byte identical commits.

 _In that case the protocol over the wire should be well-defined and would
avoid problems._

Oh, all of these are well-defined. They're defined to produce filenames which
comprise the same sequence of bytes everywhere. That's the problem!

 _If you mean by sharing files over an USB-stick then it 's not so much an
application problem as an OS issue._

Bear in mind that the problem is not what the OS does with the names of files
being moved around, it's with what the DVCS does with the names that are
embedded in the content of its data files.

~~~
andreasvc
Ah, thanks for clarifying.

------
overgard
I had to deal with this a lot at a job I used to have (not python
specifically, but just with unicode issues), and there's really just not a
right answer to how to do any of this. Any solution you pick is going to suck
for someone.

One thing he's leaving out of the Python 2 being better aspect: Ok, for cat
you can treat everything as one long byte array. But what if, say, I need to
count how many characters are in that string? Or what if I need to write a
"reverse cat", which reverses the string? Python 2's model is entirely broken
there.

Armin suggests that printing broken characters is better than the application
exploding and I agree.. sometimes. On the other hand, try explaining to a
customer why the junk text they copy pasted from microsoft word into an html
form has question marks in it when it shows on your site.

The problem with the whole "treat everything as bytes" thing is that you'll
never have a system that quite works. You'll just have a system that _mostly_
works, and _mostly_ for languages closer to english. Going the rigorous route
is the hard way, but it will end up with systems that actually work right.

------
rdtsc
> There is a perfectly other language available called Python 2, it has the
> larger user base and that user base is barely at all migrating over. At the
> moment it's just very frustrating.

I come from a different perspective, I looked at the benefits of Python 3 and
looked at my existing code base and how it would be better if was written in
Python 3 and apart from bragging rights, and having a few built-in modules
(that now I get externally) it wouldn't actually be better.

To put it plainly, Python 3, for me, doesn't offer anything at the moment.
There is no carrot at the end. I have not seen any problems with Unicode yet.
Not saying they might not be lurking there, I just haven't seen them. And,
most important, Python 2 doesn't have any stick beating me on the head with,
to justify migrating away from it. It is just a really nice language, fast,
easy to work with, plenty of libraries.

From from _my_ perspective Python 3 came at the wrong time and offered the
wrong thing. I think it should have happened a lot earlier, I think to justify
incompatibilities it should have offered a lot more, for example:

* Increased speed (a JIT of some sort)

* Some new built-in concurrency primitives or technologies (something greenlet or message passing based).

* Maybe a built-in web framework (flask) or something like requests or ipython.

It is even hard to come with a list, just because Python 2 with its library
ecosystem is already pretty good.

~~~
tormeh
Well, the payoff is probably not that great, but how much effort is really
required to move to 3? Rewriting print statements and changing a couple import
statements? Anything else? There's no carrot and no stick, but you're only
being asked to stand up for a second so someone can switch your chair into
something more comfortable. You're not exactly rewriting it in Perl.

~~~
rdtsc
> but how much effort is really required to move to 3? Rewriting print
> statements and changing a couple import statements?

Most important -- risk. Risk that stuff will break. One of the biggest ones is
change to the .keys(), .values() to behave like iterkeys.

Also unicode vs byte strings.

Also time. And time = $ in most places.

So far need the benefits are just not there. It is something like this:

benefit(switch) = code_improvements - time - risk + possible_future_benefits

(time has an opportunity cost folded into it, if I am porting to 3 I am not
working on other stuff).

So far benefit is either negative or just too small for me.

~~~
zzzeek
> Most important -- risk. Risk that stuff will break.

that's what test coverage is for. if you don't have coverage, then your code
is already broken.

~~~
rdtsc
Well customers pay for it, use it and like it. That my book qualifies it as
not broken. They can chose other software but they pick this one.

Also, just because unit tests cover the code and pass doesn't mean product is
not broken. Two working units of code adding together in a system don't
guarantee that system will do what it is supposed to do. So yes there is risk.

The bigger problem is that there are not tangible benefits of Python 3. That
is its tragedy the way I see it.

And time-wise, it is pretty sad, it might take me less than a few days to work
through it, but it is still not worth it.

------
ak217
Is sys.getfilesystemencoding() not a good way to get at filename encoding?

I think on the face of it I do like the Go approach of "everything is a byte
string in utf-8" a lot, but I haven't really worked with it so there's
probably some horrible pain there somewhere, too. In the meantime Python 3 is
a hell of a lot better than Python 2 to me because it doesn't force unicode
coercion with the insane ascii default down my throat (by the time most new
Python 2 coders realize what's going on, their app already requires serious
i18n rework). Also, I don't really know why making sure stuff works when
locale is set to C is important - I would simply treat such a situation as
broken.

In writing python 2/3 cross-compatible code, I've done the following things
when on Python 2 to stay sane:

\- Decode sys.argv asap, using sys.stdin.encoding

\- Wrap sys.stdin/out/err in text codecs from the io module
([https://github.com/kislyuk/eight/blob/master/eight/__init__....](https://github.com/kislyuk/eight/blob/master/eight/__init__.py#L78-L98)).
This approximates Python 3 stdio streams, but has slightly different buffering
semantics compared to Python 2 and messes around with raw_input, but it works
well. Also, my wrappers allow passing bytes on Python 2, since a lot of things
will try to do so.

~~~
the_mitsuhiko
getfilesystemencoding() is unreliable on Linux has linux has no file system
encoding. It just returns the first match of LC_ALL, LC_CTYPE, LANG (not sure
in which order).

~~~
pilif
I wouldn't call it unreliable then because whatever LC_CTYPE is set to is what
the user expects their file names to be interpreted as.

If the contents of LC_CTYPE is wrong for a particular file name, at least you
get consistency between your python program and everything else on the system.

------
inklesspen
If you want to work with bytes on stdin and stdout, Python 3 documents how to
do that, at the same place it documents the stdin and stdout streams.

[https://docs.python.org/3/library/sys.html#sys.stdin](https://docs.python.org/3/library/sys.html#sys.stdin)

All you have to do is use sys.stdin.buffer and sys.stdout.buffer; the caveat
is that if sys.stdin has been replaced with a StringIO instance, this won't
work. But in Armin's simple cat example, we can trivially make sure that won't
happen.

I'd be a lot more willing to listen to this argument if it didn't overlook
basic stuff like this.

~~~
CatMtKing
I guess it's a little odd that Python 3 treats stdin and stdout by default as
unicode text streams. And sys.argv is a list of unicode strings, too, instead
of bytes.

~~~
pekk
I like Python 3's unicode handling but I agree that this seems strange. It is
because people expect to see "characters" from these interfaces after treating
them as ASCII-only for so long. If Python 3 had insisted on real purity with
bytes objects I think it would have died a long time ago. Which is sad.

------
mangecoeur
I get that Armin runs into pain points with Py3, but on the other hand I get
annoyed with the heavily English centric criticims - its easy to think py2 was
better when you're only ever dealing with ASCII text anyway.

Fact is, most of the world doesn't speak english and needs accents, symbols,
or completely different alphabets or characters to represent their language.
If POSIX has a problem with that then yes, it is wrong.

Even simple things like french or german accents can make the Py2 csv module
explode, while Py3 works like a dream. And anyone who thinks they can just
replace accented characters with ASCII equivalents needs to take some language
lessons - the result is as borked and nonsensical as if, in some parallel
univese, I had to replace every "e" with an "a" in order to load simple
english text.

~~~
the_mitsuhiko
My libraries are all supporting unicode on Python 2. And in fact, they do it
better than on Python 3. File any unicode bugs you might encounter on Python 2
against me please.

~~~
mangecoeur
This is true, and in fairness I've never had any problems with unicode using
any of your libraries, probably because you take a lot of care in explicitly
dealing with encoding.

But that's not always the case with other libraries like the csv module. The
core unicode support in py3 means that a lot of librares which are not written
with explicit unicode in mind Just Work with it in py3, and its a huge time
saver.

------
lmm
If you're happy with Go's "everything is a unicode string" approach then you
should be happy to just treat everything as unicode. Don't handle the decode
errors - if someone sends some data to your stdin that's not in the correct
encoding, too bad.

Yes, python3 makes it hard to write programs that operate on strings as bytes.
This is a good thing, because the second you start to do anything more
complicated than read in a stream of bytes and dump it straight back out to
the shell (the trivial example used here), your code will break. Unix really
is wrong here, and the example requirement would seem absurd to anyone not
already indoctrinated into the unix approach: you want a program that will
join binary files onto each other, but also join strings onto each other, and
if one of those strings is in one encoding and one is in another then you want
to print a corrupt string, and if one of them is in an encoding that's
different from your terminal's then you want to display garbage? Huh? Is that
really the program you want to write?

~~~
chimeracoder
> If you're happy with Go's "everything is a unicode string" approach then you
> should be happy to just treat everything as unicode.

That's actually not really Go's approach. In Go, strings do not have encodings
attached to them.

 _Source files_ are defined to be UTF-8 (by the compiler), so string literals
are always unicode. That's not quite the same thing as saying that the
"string" type in Go is always Unicode (it's not). And when you're dealing with
a byte slice ([]byte), you cannot make any assumptions about the encoding.

It took a bit to wrap my head around this when I first read about it[0], but
now that I think about it, I think it's the right way to go[1].

[0] [http://blog.golang.org/strings](http://blog.golang.org/strings)

[1] And for what it's worth, Go and UTF-8 were designed by (some of) the same
people, so one would hope they'd get it right!

~~~
lmm
You're right, I was lazily responding to the article on its own terms rather
than engaging properly with go's string handling.

I think the approach of keeping strings encoded has promise but it would need
to be supported by a stronger type system than go's. When you're carrying
around two encoded byte arrays it's really important that you know what their
encodings are and don't try and e.g. concatenate them. Ruby can do this right
because it can give you a runtime type error, but that's not acceptable in a
compiled language. So you need to distinguish between byte arrays with
statically known encodings, byte arrays with dynamically known encodings and
byte arrays with unknown encodings. And you should really e.g. disallow
slicing a byte array that represents a string, so that you don't cut a
character in half.

I know that one of the UTF-8 guys worked on go and I'm sure go will work well
when everything's in UTF-8. But all languages work well when everything's in
UTF-8; if anything this makes me _more_ worried that go's authors won't give
proper support to those who have to work with strings in non-UTF8 encodings.
(By contrast one of the reasons Ruby's support is good is that the author is
Japanese and therefore pretty much has to work with strings in multiple non-
unicode encodings, because of han unification)

~~~
burntsushi
> So you need to distinguish between byte arrays with statically known
> encodings, byte arrays with dynamically known encodings and byte arrays with
> unknown encodings.

I don't buy this. You've just introduced some rather large complexity into the
types of byte strings just for the sake of handling non-UTF8 cases, which seem
to be getting less common than they used to be.

Rust is taking a similar approach to Go (except that their string type _must_
be valid UTF8). You can see a recent debate here:
[https://mail.mozilla.org/pipermail/rust-
dev/2014-May/009725....](https://mail.mozilla.org/pipermail/rust-
dev/2014-May/009725.html)

~~~
lmm
If you don't care about handling non-UTF8 cases, you can use pretty much any
language (python3 included - the issues the OP is complaining about are when
you have filenames in a different encoding from your terminal or the like),
write the obvious thing, and it will work fine.

For many use cases that's good enough. But the cases where languages are
different, the cases where it gets interesting, are when that isn't enough.
(And it won't be enough if you want to sell your software in Japan, for
example).

~~~
pcwalton
There's actually a bit more to what Rust does: there is a well-known community
library called rust-encoding that adds new string types that support various
encodings. You can use this library if you need to support other encodings.
The standard library supports only UTF-8, but it's simple enough to abstract
over strings in multiple encodings if you need to (thanks to generics).

I like this approach: it allows simplicity in the common case, for software
that only needs to work in UTF-8, while allowing support for arbitrary other
encodings. "Easy things should be easy, and hard things should be possible."

------
cool-RR
Worth it if only for `copyfileobj`. As a seasoned Python expert, I was not
familiar with that function. From the docs:

 _shutil.copyfileobj(fsrc, fdst[, length]) Copy the contents of the file-like
object fsrc to the file-like object fdst. The integer length, if given, is the
buffer size. In particular, a negative length value means to copy the data
without looping over the source data in chunks; by default the data is read in
chunks to avoid uncontrolled memory consumption. Note that if the current file
position of the fsrc object is not 0, only the contents from the current file
position to the end of the file will be copied._

------
andreasvc
I think the main problem here is an impedance mismatch caused by forcing
things to be Unicode. While the Python developers are technically correct (the
best kind they say..) in claiming that LANG=C means ASCII, that's not how
everything else in UNIX works until now, most applications don't crash because
of encoding errors. And filenames are byte strings, so forcing Unicode on them
is a bad idea.

It would be great if everyone fixed their locale settings and all their
filename encodings but in the meantime this will cause even more friction for
Python 3 adoption.

------
andrewstuart
It's a great concern that some of Python's most respected developers such as
mitsuhiko and Zed Shaw are not on board with the current future direction of
Python. It would be a better world for all if somehow Python 4 could be
something that everyone is happy with - I want the mitsuhikos and Zed Shaws of
the world to be writing code that I can run as a Python 3 user, written in a
language that these top level developers feel enthused about.

Is there no way forward that everyone agrees on? Has anyone ever proposed a
solution?

------
shadowmint
> That I work with "boundary code" so obviously that's harder on Python 3 now
> (duh)

mhm. I tell people now and then that python 3 (and the python 3 developers)
are hostile to people embedding it and using it for low level tasks
specifically because of this unicode stuff, and they tend to tell me I should
just suck it up.

I suppose I'm morbidly glad not the only one feeling the pain, but really, it
honestly feels like python 3 line is just not making any effort towards making
this stuff easier and simpler. :/

~~~
ygra
Unicode, dealing with text, i18n are _never_ easy and simple. That being said,
there are lots of things that work on both Windows and Unix and use Unicode
internally, even for file names and paths (e.g. Qt and the already-mentioned
Java). Qt is even used by a popular-ish desktop environment. If that approach
were that unsuitable and utterly incompatible with the Unix approach on
encodings I wonder why it apparently does work.

~~~
tormeh
Isn't utf-8 just a string of bytes where each byte represents a sign? Why is
it hard?

~~~
sp332
No.
[http://en.wikipedia.org/wiki/UTF-8#Description](http://en.wikipedia.org/wiki/UTF-8#Description)
How could you encode millions of characters into just 256 values?

~~~
tormeh
Honestly, I thought that was what utf-16 was for. I thought the number was the
bitlength of a sign.

~~~
ygra
The number says how wide a _code unit_ is. However, it doesn't say anything
about how many code units are required to encode a single _code point_. UTF-8
needs 1–4, UTF-16 needs 1–2 and UTF-32 needs 1. They all are able to represent
all encoded characters of Unicode, it's just that the individual bytes are
different.

------
andrewstuart
I hear and understand and agree with the issues raised, the question is what
is the right way to fix this stuff? How can we get there?

How can we get the Python 2 stalwarts and the Python 3 folks to all sit in the
same figurative room and create a future that everyone is happy with?

It would be nice to see the ongoing grumbling about Python 3 replaced with a
tangible peace process.

Are the warring parties talking about solutions?

~~~
rectangletangle
It'd be nice, I don't want Python to go the way of PERL.

~~~
andrewstuart
The Python 2 "stayers" choices are:

1: stick with Python 2 forever

2: move their skills to another language

3: let go of Python 2 and move to Python 3 despite their concerns

Armin seems so disillusioned that I get the sense he'll either go for option 1
or 2, which is very concerning for people using Jinja2 and Flask and all his
other stuff (most of which he has converted to Python 3 albeit
unenthusiastically). He has said in one of his blog posts that although he has
ported his code to Python 3, he does not use it himself at work and doesn't
intend to. Having said that, his stance has softened significantly over the
most recent 12 months as evidenced by the full porting of Flask to Python 3.

Zed I'm guessing initially go for option 1 and given his previous disposition
to change technologies, might get sick of Python 2's deadness and go for
option 2. Zed's public proclamations appear to have invested him quite heavily
in not going with Python 3 so it's hard to see what would ever lead him there.
Zed's "Learn Python the Hard Way" is the gateway through which new Python
programmers are learning and thus all those new developers are starting out as
Python 2 only people. If a way can be found to satisfy Zed that some future
version of Python is "a good thing", then he will bring his students/followers
with him.

But who knows.

It would be good if there was:

4: Zed and Armin and the other most vocal Python 2 advocates specify what they
want to see in Python 4, somehow it gets included, everyone happy.

Zed and Armin are by no means the only first class Python developers there's
tons of others, but they write really interesting stuff, they are extremely
outspoken in their criticism of Python 3, and they have a large and loyal
following who respect their opinions, so it would be nice to see them happily
participating in Python's future.

What can be done to bring the Python 2 stayers to the most recent releases of
Python? Who knows. It's not healthy for Python 3 to have such vocal critics so
something should be done.

Even as the 2 versus 3 war continues, Python 3 seems to be gaining real
momentum at least as measured by the number of libraries that are now
available for Python 3 - it seems that even though some people are sticking
with Python 2 there's a groundswell of support for Python 3. After all, for
the ordinary programmer trying to get a website built, what's the point in
learning the six year old version, all that leads to is the question of "ok,
at what time will I learn Python 3?". I'm a beginner and a very ordinary
programmer and I find Python 3 much easier to wrap my head around than Python
2 - I dread the times I have to dig into Python 2.

Python 3 will be fine. It has momentum, it will grow, eventually Python 2 will
be so far in the past that there will be no way to look at it except in the
same way we see OS/2, Amiga and DOS - long gone. It would be much better
though if everyone was happy.

Python 4 (maybe it should be 3.6) should be the version that ends the civil
war and gives the stayers what they want somehow.

------
e12e
I don't know... I get an error from the first script with python3:

    
    
        $ ls
        test  test3.py  test.py  tøst  日本語
        $ python2.7 test.py *
        hello hellø こにちは tøst 日本語
        import sys
          # (…)
        hello hellø こにちは tøst 日本語
        hello hellø こにちは tøst 日本語
       
        $ python3 test.py *
        Traceback (most recent call last):
          File "test.py", line 13, in <module>
            shutil.copyfileobj(f, sys.stdout)
          File "/usr/lib/python3.2/shutil.py", line 68, in copyfileobj
            fdst.write(buf)
        TypeError: must be str, not bytes
        
        #But I can make it work with:
        $ diff test.py test3.py 
        8c8
        <             f = open(filename, 'rb')
        ---
        >             f = open(filename, 'r')
        
        $ python3 test3.py *
        # same as above
    

Now, these two scripts are no longer the same, the python3 script outputs
text, the python2 script outputs bytes:

    
    
        $ python3 test3.py /bin/ls
        Traceback (most recent call last):
          File "test3.py", line 13, in <module>
            shutil.copyfileobj(f, sys.stdout)
          File "/usr/lib/python3.2/shutil.py", line 65, in copyfileobj
            buf = fsrc.read(length)
          File "/usr/lib/python3.2/codecs.py", line 300, in decode
            (result, consumed) = self._buffer_decode(data, self.errors, final)
        UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 24: invalid start byte
    

The other script works like cat -- and dumps all that binary crap to the
terminal.

So, yeah, I guess things are different -- not _entirely_ sure that the python3
way is broken, though? It's probably correct to say that it doesn't work well
with the "old" unix way in which text was ascii and binary was just bytes --
but consider:

    
    
        $ cat /bin/ls |wc
        403    2565  114032
        e12e@stripe:~/tmp/python/unicodetest $ du -b /bin/ls
        114032  /bin/ls
    

Does that "wordcount" and "linecount" from wc make any sense? For that matter,
consider:

    
    
        $ cat test
        hello hellø こにちは tøst 日本語
        e12e@stripe:~/tmp/python/unicodetest $ wc test
         1  5 42 test
    

(Here the word count does make sense, but just because it's an artificial
example, it wouldn't make sense for actual Japanese).

The character count is pretty certainly wrong unless you cared about what "du
-b" thinks of the number of bytes...

~~~
lifeisstillgood
The japanese example is interesting - because wc really rather depends on the
language. So does regex. And quite a lot of other things that are useful in a
Latin-derived world kind of get harder in a right to left inflected written
language (if there is one, some Arabic comes to mind).

I think if anything will force us to rethink the underlying assumptions of
Unix, its unicode.

~~~
e12e
Please note that:

    
    
       $ echo "wc can't count æøæ either" |wc
          1       5      29
       $ echo "wc can't count aaa either" |wc
          1       5      26
    

[edit: Also, note that Japanese is both left-to-right _and_ top-to-
bottom,right-to-left]

~~~
EdiX
wc counts bytes, to make it count characters use -m in the GNU version.

~~~
lifeisstillgood
I think the point being made is that -m does not count characters, it counts
multi-bytes. Or at least tries to. So the same Unicode point in utf-8 and
utf-16 (and utf-32) could be very different strings of bytes. No way to tell
unless you know before hand you are dealing with utf-8 or 16. Hence BOM, but
no one likes that.

Its hard. And possibly we have to abandon tools like wc when we leave the
Latin world.

------
keyme
Strings should be byte strings. Not ASCII, not Unicode. Bytes.

Strings don't represent Text lest I decide they do. For this a UnicodeString
object should exist, and it should _not_ be the default.

In my latest project I've made myself use Python 3.4 over 2.7, for its new
great features. So many steps forward, except this one thing.

What a stupid decision are these default Unicode strings...

~~~
andrewstuart
Wouldn't people be complaining if "the unicode problem" hadn't been solved in
Python rather than leaving it an undefined mess? Now it is a solved problem
even if the solution is seen as a problem by some.

------
pekk
From the one person who has complained most about this topic, making him an
expert on complaining about Python 3 but not necessarily as much of an expert
on how to cope.

------
skizm
Bit off topic, but can anyone recommend a good tutorial/book/whatever for
python 2 programmers looking to move to (or at least become familiar with)
python 3?

~~~
maxerickson
What's New in 3.0 has lots of information:

[https://docs.python.org/3/whatsnew/3.0.html](https://docs.python.org/3/whatsnew/3.0.html)

and maybe take a look at what standard modules have moved to different name or
namespace:

[https://docs.python.org/3/py-modindex.html](https://docs.python.org/3/py-
modindex.html)

------
im3w1l
>For instance it will attempt decoding from utf-8 with replacing decoding
errors with question marks.

Please don't do this. Replacing with question mark is a lossy transformation.
If you use a lossless transformation, a knowledgeable user of your program
will be able to reverse the garbling, in their head, or using a tool. Consider
Ã¥Ã¤Ã¶, the result of interpreting utf8 åäö as latin1. You could find both the
reason and solution by googling on it.

------
Retr0spectrum
Did anyone else find the title font hard to read?

~~~
rectangletangle
Yeah, real narrow fonts seem to have pretty poor readability.

------
jrochkind1
I have to admit I can't follow this completely -- dealing with file system
file names that are not in ascii is a very confusing thing, and one I haven't
done before -- plus I am not very familiar with python.

But I have done a lot of dealing with char encoding issues though -- in ruby.

In ruby 1.9+, I find ruby's char encoding handling to be quite good. Which
does not mean it's not incredibly challenging and confusing to deal with char
encoding issues. But it means I haven't been able to come up with any better
approach than ruby 1.9+'s, anything I wish ruby 1.9+ did differently.

The mental model is simple (relatively, for the domain anyway) -- any strings
are tagged with an encoding. If your string contains illegal bytes for the
encoding it's tagged with, it's gonna raise if you try to concatenate it or do
much anything else with it. Concatenating strings of two different encodings
is probably going to raise too (some exceptions if they are both ascii
supersets and happen to contain only ascii-valid 7-bit chars). You can easily
check if a string contains any illegal bytes; change the tagged encoding to
any encoding you like (including the 'binary' null encoding); remove bad
bytes; or trans-code from one encoding to another.

It means that you have all the tools you need to deal with char encoding
issues, but you still need to think through some complicated and confusing
issues to deal with em. It is an inherently confusing domain (which is why
it's nice that more and more of the time you can just assume UTF8 everywhere
-- but yes, I've written plenty of code that can't assume that too, or that
has to deal with bad bytes in presumed UTF8)

(The biggest frustrations can be when using gems (libraries) that themselves
aren't dealing with char encoding correctly, and then you find yourself
debugging someone elses code and trying to convince them that their code is
incorrect when they're putting up a fight cause it's so damn confusing. There
are still plenty of encoding related bugs. But I'm not sure that's ruby's
fault).

You certainly can deal with everything as a byte stream (the 'binary' null
encoding) if you want to in ruby, as far as the language is concerned,
although I don't think you actually usually want to. (and some open source
gems might not play well with that approach either)

It would be interesting to see someone who understands both ruby and python
take the OP and analogize the problem case to ruby 1.9+ and see if it's any
different.

(One important thing ruby was missing prior to 2.1 is the new String#scrub
method. It was possible to write it yourself though, which I figured out
eventually. Another thing I still wish ruby had built-in to stdlib was more of
the Unicode algorithms (sort collation, case change, etc.), although there are
gems for most of em these days, thanks open source.)

