Hacker News new | past | comments | ask | show | jobs | submit login

As always, the one true link is this:

https://nedbatchelder.com/text/unipain.html

Read, understand, and be glad that the interpreter isn't trying to "help" any more!




I just want to chime in and say that Ned Batchelder is one of the greatest human beings alive.

Or, at least, to me he is. This man helps so many people with nothing in return. He's a regular on some python IRC channels, he has personally helped me so much. He makes difficult concepts easy to understand. I encourage everyone to watch his pycon talks. Start with his talk on loops: https://www.youtube.com/watch?v=EnSu9hHGq5o


There's a few outstanding people that make me love the Python ecosystem, Nedbat is definitely in that list.

FunkyBob (Curtis Maloney) is also a staple with Django, he's spent years and years helping people and asking for nothing in return.


I'll throw in Raymond Hettinger & Jack Diederich -- excellent peeps.


Agreed. I run into Ned every 3-4 years, he always remembers me, and he's super nice and helpful.


The problem isn't that there's no implicit conversion between bytes and unicode anymore. The problem is that almost no code beyond the lowest level interfaces is correctly handling it when you need to use bytes. In this case, filepaths (which are bytes in posix, and no using LOCALE is not good enough) don't work. Other examples of things that are bytes:

- Command line arguments

- Environment variables

- stdin/stdout/stderr

- Files in general

- Many expected-to-be-human-readable fields of popular network protocols

In short, there are many situations where you want to treat a bytestring as a string, not as an array of integers.

If bytes in python 3 had acted like str in python 2 (except for the implicit conversions / comparisons with unicode strings), the situation would be a lot better. As it is, they feel like a second-class citizen designed to discourage use, and as a result are unsupported in most libraries that NEED to support them.

(edited for formatting)


Python 3 making 'bytes' an array of integers was a minor mistake, IMHO. I.e. b'abc'[0] should be b'a', not 97. That change made it harder to port code and also makes the bytes() object a bit more unhandy to use. Much too late to change that now.

However, as someone who as done a mixture of low level (e.g. system tools), high level (e.g. web apps) and network protocol programming, the Python 3 bytes/str model works well. If you really want to treat a 8-bit byte string as a string, you can always decode as "latin1". In my modern Python 3 code, I don't find a good reason to ever do that anymore.


> Read, understand, and be glad that the interpreter isn't trying to "help" any more!

As someone who thinks the Python 3 behaviour is largely the right behaviour (I'm unconvinced the solution used for filenames on POSIX is the right one, nor that assuming stdin/out isn't arbitrary bytes is the right choice), I still have a lot of issue with the Python 2 -> 3 migration. (Note I haven't read the article because it won't load here, nor on archive.is.)

As someone who has dealt with a fair number of codebases migrating over the past decade, I would like to have seen a clearer migration path. The route taken basically asked developers to go from:

    def foo(x):
      return x == b"a"
    print foo(u"a")
The fact this went from printing True (Python 2) to False (Python 3) without there ever being any way to know your codebase was doing this, unless you had tests for all such codepaths, meant it was hard to have confidence behaviour was maintained after porting (and I've worked with enough projects that have used Python for scripting without extensive testing of the scripts, often because they're largely doing I/O).

If Python 2.6/7 had a mode like -b (which warns when bytearray and unicode are compared) that warned when str/unicode are compared, that would already have been a big improvement for the migration path. As it is, people have written tools that do this (unicode-nazi), but then you quickly run into the fact that the Python 2 stdlib does this all the time, making it hard to just try and resolve such comparisons within a Python 2 codebase. (Note Python 3's -b does warn for bytes/str!)

Now, at the same time as the behaviour of u"a" == u"b" changed, Python also changed the return type of (e.g.) os.listdir(). This means if you want to compare a different list loaded from elsewhere, you need to have that list in different types depending on whether you're running on Python 2 or Python 3. In a dynamically typed language, it's hard to make all these changes with confidence that you're actually fixing everywhere.


> If Python 2.6/7 had a mode like -b...

Agreed. This was a major mistake in the migration story for Py3. They bet on the static translation approach of 2to3, which is just inappropriate for a dynamic language like Python. Better to have doubled down on Python's dynamism by adding modes to the interpreter to suss out the code that wouldn't work on Py3.


Nah, even with tooling the sloppy-2-to-tighter-3 migration was always going to be a semi-manual job. The right thing to do was embrace that transition: do it early, once, and never touch it again. Get everyone developing their code solely in Python 3, and provide fully automated 3-to-2 conversion for those who still need to deploy on Python 2. (Ideally as part of the module packaging, so that all Python packages automatically support 2 and 3.)

Instead of which, we got a lot of “write-once, run-everywhere” nonsense, with everyone vying to bend their code out of shape in the most creatively unproductive ways possible. Absolutely ridiculous makework, and the Python community should’ve called itself on it. Unfortunately, the geeks love a challenge, far more than being told when they’re having a brain fart. Oh well, at least that whole shambles has finally just about run its course; here’s hoping its lessons are learned for Python 4.:)


If I would have to port a code base from 2 to 3 today, I think the first thing I would do is add type annotations to to the code base via typing/mypy and go from there. A fully typed Python 2 code base shouldn't be too difficult to port with the help of mypy and proper editor support.


Someone in the past must have invented a wide encoding unlike the popular transition format. I wonder if using that for some use cases would have been less pain because it would remove the whack-a-mole symptom: nothing would work at all until you did all the work to consume/transcode as appropriate.

I suppose it's moot at this point.


The problem, which is always the problem, is that like 12 someones invented different wide encodings at many times in the past.

The result is the chaos of the present day.


I don't think today's chaos is related to other wide encodings (those are probably very rarely used). Today's chaos is like Batchelder describes, but I'm suggesting that some of that is due to the ambiguity of the encoding: is this data I'm consuming iso-8859-x or is it utf-8? It's this ambiguity that contributes to the whack-a-mole (and this is a big part of the chaos IMO).

That said, would anyone have been interested in a totally new encoding? For European languages which use mostly the same 26 latin characters with occasional diacritics and accents, UTF-8-with-incompatible-consumer degrades into occasional unreadable characters. But if your out-of-date browser or application gave you a "cannot decode this encoding" error, that might have caused a whole lot of pain during that transition. Not to mention that some of the same issues with OS/filesystem/language library interaction would probably remain.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: