Hacker News new | past | comments | ask | show | jobs | submit login

The simple reason is that there was no compelling feature to reward you for upgrading. You'd spend a tremendous amount of effort for dubious return and (until recently) a smaller ecosystem.

1. Unicode support was actually an anti-feature for most existing code. If you're writing a simple script you prefer 'garbage-in, garbage-out' unicode rather than scattering casts everywhere to watch it randomly explode when an invalid byte sneaks in. If you did have a big user-facing application that cared about unicode, then the conversion was incredibly painful for you because you were a real user of the old style.

2. Minor nice-to-haves like print-function, float division, and lazy ranges just hide landmines in the conversion while providing minimal benefit.

In the latest py3 versions we've finally gotten some sugar to tempt people over: asyncio, f-strings, dataclasses, and type annotations. Still not exactly compelling, but at least something to encourage the average Joe to put in all the effort.

> Unicode support was actually an anti-feature for most existing code. If you're writing a simple script you prefer 'garbage-in, garbage-out' unicode rather than scattering casts everywhere to watch it randomly explode when an invalid byte sneaks in. If you did have a big user-facing application that cared about unicode, then the conversion was incredibly painful for you because you were a real user of the old style.

Actually that's the behavior of python 2, it works fine, until you send invalid characters then it blows up.

In python 3 it always blows up when you mix bytes with text so you can catch the issue early on.

> In the latest py3 versions we've finally gotten some sugar to tempt people over: asyncio, f-strings, dataclasses, and type annotations. Still not exactly compelling, but at least something to encourage the average Joe to put in all the effort.

That's because until 2015 all python 2.7 features were from python 3. Python 2.7 was basically python 3 without the incompatible changes. After they stopped backporting features in 2015. Suddenly python 3 started looking more attractive.

> Actually that's the behavior of python 2, it works fine, until you send invalid characters then it blows up.

> In python 3 it always blows up when you mix bytes with text so you can catch the issue early on.

Sometimes you don't care about weird characters being print as weird things. In python 2 it works fine: you receive garbage, you pass garbage. In python 3 it shuts down your application with a backtrace.

Dealing with this was one of my first Python experiences and it was very frustrating, because I realized that simply using #!/usr/bin/python2 would solve my problem but people wanted python3 just because it was fancier. So we played a lot of whack-a-mole to make it not explode regardless of the input. And the documentation was particularly horrible regarding that, not even the experienced pythoners knew how to deal with it properly.

Those issues are common when you're having python 2 code that uses unicode datatype and you have a task to migrate it to python 3.

You run your python 2 code on python 3 and it fails, most people at that point will place encode() or decode() in place where you have a failure. When the correct fix would be to place encode/decode at I/O boundary (writing to files (and in python 3 even that is not needed if you open files in text mode), network etc).

Ironically a python 2 code that doesn't use unicode is easier to port.

When you program in python 3 from the start it's very rare to need encode/decode strings. You only do that if you are working on I/O level.

> And the documentation was particularly horrible regarding that, not even the experienced pythoners knew how to deal with it properly.

Because it's not really python specific knowledge. It's really about understanding what the unicode is, what bytes are, and when to use each.

The general practice is to keep everything you do as text, and do the conversion only when doing I/O. You should think of unicode/text as as a representation of a text, as you think of a picture or sound. Similarly to image and audio text can be encoded as bytes. Once it is bytes it can be transmitted over network or written to a file etc. If you're reading the data, you need to decode it back to the text.

This is what Python 3 is doing:

- by default all string is of type str, which is unicode - bytes are meant for binary data - you can open files in text and binary mode, if you open in text the encoding is happening for you - socket communication - here if you need to convert string to bytes and back

Python 2 is a tire fire in this area:

- text is bytes - text also can be unicode (so two ways to represent the same thing) - binary data can also be text - I/O accepts text/bytes, no conversion happening - a lot (most? all?) stdlib is actually expecting string/bytes as input and output - cherry on top is that python2 also implicitly converts between unicode and string so you can do crazy thing like my_string.encode().encode() or my_string.decode()

So now you get a python 2 code, where someone wanted to be correct (it is actually quite hard to do it, mainly because of the implicit conversion) so the existing code will have plenty of encode() and decode() because some functions now expect str some unicode.

At different functions you might then have bytes or unicode as a string.

Now you take such code and try to move it to python 3, which no longer has implicit conversion and will throw an error when it expected text and got bytes and vice versa. The str now is unicode, unicode type no longer exists and bytes is now not the same thing as str. So your code now blows up.

Most people see an error so they add encode() or decode() often trying which one works (like what you were removing) when the proper fix would be actually removing encodes() and decodes() in other places of the code.

It's quite difficult task when your code base is big, so this is why Guido put a lot of effort with type annotations, mypy. One of its benefits supposed to help with these issues.

The worst part about Unicode in Python 2 isn't even that everything defaults to bytes. It's that the language will "helpfully" implicitly convert bytes/str, using the default encoding that literally makes no sense in practically any context - it's not the locale encoding. It's ASCII!

Native English speakers are usually the ones blissfully unaware of it, because it just happens to cover all their usual inputs. But as soon as you have so much as an umlaut, surprise! And there are plenty of ways to end up with a Unicode string floating around even in Python 2 - JSON, for example. And then it ends up in some place like a+b, and you get an implicit conversion.

I've been struggling with this recently when trying to print stdout from subprocess.communicate with code that runs on both 2 and 3. Such a headache - got any recommended reading around this area?

I don't think this is exactly what you're asking but a good starting point:


With 2 vs 3 code is easiest to write your code for python 3 and then in 2 import everything you have in __future__ package including unicode literals. That's still not enough and you still might need to do extra work. In python 3 there's argument encoding, which could do the encoding which doesn't look like it is available in python 2. So you probably shouldn't be use it and treat all input/output as bytes (i.e. call encode() when sending data to stdin, and decode() on what you get back from stdout and stderr).

Perhaps that might be enough for your case, although many things is hard to get right in python 2 even when you know what you should do, because of the implicit conversion.

Edit: this also might be useful: https://unicodebook.readthedocs.io/good_practices.html

Also this could help: https://unicodebook.readthedocs.io/programming_languages.htm...

> In python 3 it always blows up when you mix bytes with text so you can catch the issue early on.

This is definitely the case. I've been wrestling with bytes and strings all the time during the port of a Django application to Python 3 for a costumer. I can see myself encoding and decoding response bodies and JSON for the time being. For reasons I didn't investigate I don't have to do that with projects in Ruby and Elixir. It seems everything is a string there and yet they work.

I’ve worked in a variety of Django codebases, and the last time I had trouble with string encoding/decoding was with Python 2. Since moving to Python 3, I have rarely needed to manually encode or decode, and I genuinely can't remember the last time I did.

Perhaps there’s something about a port that requires encoding/decoding bytes/strings?

The encoding/decoding is heavy in codebases that have to run on Python 2 and Python 3 at the same time, and authors are worried about handling unicode correctly on python 2.

Ironically when your python 2 app doesn't care about unicode, the porting to python 3 is actually much easier.

you don't have to do these things in python 3 either, your problem was that you had python 2 code that was already broken and you are started adding encode/decode to fix it, typically making the problem worse.

If you write code in python 3 from the start you rarely need to use encode() and decode(). Typically what you always want is a text not bytes.

Exception to it might be places where you want to serialize like IO (network or files, although even files are converted on the fly unless you open file in a binary mode).

The problem are external APIs returning whatever they want no matter what they should return. The world is messy.

Example, I just had to write this

  return urllib.request
    .urlopen(url, timeout=60)  
    .decode("utf-8", errors="backslashreplace")
(probably not valid code because of the newlines but you'll forgive me) Then I use that string in a regexp, etc.

This is the only language where I have to explicitly deal with encodings at such low level. I don't feel like I want to use it for my pet projects.

Why is that bad? The result returned from an URL is always binary. In certain situations it could be text but it doesn't have to be. If the result was an image and you would want to convert the data to image, if it was sound file same, you should think of text as another distinguished type.

Of course urllib could have method text() that would do such conversion, but then urllib is not requests. It never was user friendly.

Edit: personally I use aiohttp, the interface is much nicer: https://aiohttp.readthedocs.io/en/stable/client_reference.ht... if I can't use asyncio then would use requests.

> Actually that's the behavior of python 2, it works fine, until you send invalid characters then it blows up.

Not that I've seen.

Example of where Python 3 has rained shit on my parade: I wrote a program that backs up files for Linux. It works fine in python 2, but in python 3 you rapidly learn you must treat filenames as bytes otherwise your backup program blows up on valid Linux filenames. It's not just decoding errors, it's worse. Because Unicode doesn't have a unique encoding for each string, so the round trip (binary -> string -> binary) is not guaranteed to get you the same binary. If you make the mistake of using that route (which Python3 does by default) then one day Python3 will tell you can't open a file you os.listdir() microseconds ago and can clearly see is still there.

Later, you get some sort of error when handling one of those filenames, so you sys.stderr.write('%s: this file has an error' % (filename,)). That worked in python2 just fine, but in python3 generates crappy looking error messages even for good filenames. You can't try to decode the filename to a string because it might generate a coding error. This works: sys.write('b%b: this file has an error' % (filename,)), but then you find you've inserted other strings into error messages and soon the only "sane" thing to do is to to convert every string in your program to bytes. Other solutions like sys.write('%s: this file has an error' % (filename.decode(errors='ignore'),)) but corrupt the filename the user sees, are verbose, and worst of all if you forget it isn't caught by unit tests but still will cause your program to blow up in rare instances.

I realise that for people who live in a land of clearly delineated text and binary, such as the django user posting here, these issues never arise and the clear delineation between text and bytes is a bonus. But people who use python2 as a better bash scripting language than bash don't live in that world. For them python2 was a better scripting language than bash, but is being being depreciated in favour of python3 that's actually more fragile than bash for their use case. (That's a pretty impressive "accomplishment".) Perhaps they will go to back to Perl or something, because it stands Python3 isn't a good replacement.

>For them python2 was a better scripting language than bash

This! IMO Python 2 has better usability for prototyping and thinking and doing things on the fly. Python 3 also often seems to have deprecated the functions I want to use in favor of those that are more cumbersome and take more keystrokes. More explicit sure, but less fluid.

Filenames need to be treated as binary because of bad designs decades ago. Rust handles this correctly imho, by having a separate type for such strings, OsStr.

Python has pathlib nowadays. But I'm not sure whether that stores them as raw bytes or Unicode internally - the API provides for either.

Rust had the luxury to learn from mistakes of others :)

When python was created the Unicode didn't even exist.

Anyway in python 3, many os functions accept string and bytes, and might behave depending on it. For example os.walk, if you pass path as byte string, will output paths as bytes.

> Actually that's the behavior of python 2, it works fine, until you send invalid characters then it blows up.

Not always. As far as I can tell writing garbage bytes to various APIs works fine unless they explicitly try to handle encoding issues. First time I noticed encoding issues in my code was when writing an xml structure failed on windows, all because of an umlaut in an error message I couldn't care less about. The solution was to simply kill any non ascii character in the string, not a nice or clean solution but the issue wasn't worth more effort.

> In python 3 it always blows up when you mix bytes with text so you can catch the issue early on.

That is nice if your job involves dealing with unicode issues. My job doesn't, any time I have to deal with it despite that is time wasted.

So you don't have to deal with it until user data includes _any non-ascii character_ (including emoji, weird spaces copied from other stuff, or loan words like café)

"Dealing with unicode" is really just about dealing with it at the input/output boundaries (and even then libraries handle it most of the time). But without the clear delineation that Python 3 provides, when you _do_ hit some issue you probably insert a "fix" in the wrong space. Leading to the classic Py2 "I just call decode 1000 times on the same string because I've lost track"

> So you don't have to deal with it until user data includes _any non-ascii character_ (including emoji, weird spaces copied from other stuff, or loan words like café)

Interesting text follows company set naming schemes, which means all english and ascii. The rest could be random bytes for all I have to care about. Many formats like plain text or zip don't have a fixed encoding and I am not going to start guessing which one it is for every file i have to read, there is no way to do that correctly. Dealing with that mess is explicitly something I want to avoid.

What kind of text do you have to process at your job, that you never meet any unicode in it? Nowadays unicode is everywhere, especially with emojis. Even a simple IRC bot needs to handle that.

A lot of scientific/numeric work (up until quite recently, it's slowly, slowly changing) involves text processing of inputs and outputs of other programs, using Python as the glue language.

This is a lot of old code, and it's all ASCII, no matter what the locale of the system is. And even if the code was updated, all the messages would still be in some text == bytes encoding, because there's no "user data" involved, and the throughput desired is in many gigabytes of text processed per second.

So yeah, unicode is not "everywhere": it may be everywhere on the public internet, but there is a world beyond this.

I deal with file formats that like plain text files and zip do not specify an encoding and have different encodings depending on where they come from. I think the generic approach is to guess, which means trying encodings until one successfully converts unknown input to garbage unicode resulting in output that is both wrong and different from the original input. Most of the time I can just treat the text contents as byte array, with a few exceptions that are specified to use ascii compatible names.

So you can throw in your emoji and they might not correctly show up on the xml logging metadata I write, because I don't care. But they will end up in the processed file the same way they came in instead of <?> or some random Chinese or Japanese symbol that the guessing algorithm thought appropriate.

In that case you should be opening files in binary mode "b", then you will be operating in bytes.

Also, there's no guessing happening in this instance. A locale configured in your environment variable are used if you open files using text mode.

Doesn't always blow up. Notably b"key" and "key" are now distinct dictionary keys, and both can coexist in the same dict. Is the absence of an optional key a fatal error? No, the program runs, and just does the wrong thing, or fails to copy the right value to the next stage, or whatever. Fun to debug.

To get b'key' and 'key' in a dictionary in python 3 you really need to try hard.

The only reasonable scenario I can think of is when you are porting python 2 code to python 3 and play with .decode() and .encode().

>Actually that's the behavior of python 2, it works fine, until you send invalid characters then it blows up.

We're talking about simple scripts, the solution is to not send in invalid characters.

even in very simple scripts you don't get invalid characters until you actually get them.

Solid take. I'd add that performance was worse for a number of releases, and there were significant warts and incompatibilities in versions before 3.4.

Personally, asyncio and type annotations are a big turnoff. I know this is a bit contrarian, but I've always favored the greenlet/gevent approach to doing cooperative multi-tasking. Asyncio (neé twisted) had a large number of detractors, but now that the red/blue approach has been blessed, it seems like many are just swallowing their bile and using it.

Type annotations really chafe because they seem so unpythonic. I like using python for it's dynamicity, and for the clean, simple code. Type annotations feel like an alien invader, and make code much more tedious to try and read. If I want static typing, I'll use a statically typed language.

Another problem with python’s type annotations is that false negatives are common in partially type annotated code bases: i.e. an annotation which is untrue, but for which there are no supporting calls/usages causing the type checker to reject it. This is pretty pathological in my experience: it means that annotations have the semantic status of comments (i.e. might be true, might not, who knows) while being given the syntactic status of “real code”.

I’m writing Elixir code currently and find the red/blue approach in JavaScript a pain. Never used asyncio beyond trying a few "hello world" and it was just baffling. In Rust async seems not terrible with the newer syntax, typing, and of course, huge speed improvement making it worthwhile. But in a dynamic VM? Just a pain. Julia’s approach with "tasklets" seems intriguing as well.

I and many others are totally with you when it comes to asyncio vs. gevent.

They really should have used the breaking nature of v3 to drop features that prevented good JIT implementations or speedups in cpython.

I am flabbergasted every time I see a software project eschew backwards-compatibility.

No one wants to spend energy re-programming to stay in place.

Especially APIs.

Yes python 3 was clearly a mistake. There could have been less hostile ways to make improvements in the language.

Probably the mistake was not dropping support much sooner.

Python 3 came out in 2008 so say no backported features after 2009 no bug fixes after 2012. All announced in 2008 of course.

Given 4 years to migrate most would have made the jump sooner.

Dropping support is about as user-hostile as it gets.

Once again, how can you ask/require users to expend precious limited energy to re-program just to stay in place? It's totally obnoxious and completely unnecessary.

It's absolutely amazing to me that people can pay nothing for something and frame the party providing that software for free opting not to spend even more effort to provide bug fixes for an old version of their software as them effectively taxing them. This is especially true of python 2 which is was supported from the release of python 3 in 2008 until 2020 and further supported by red hat until 2024.

This is exactly backwards of reality. It's as if they were eating at someones home and had turned a cup off coffee into a week long stay during which they rudely complained when the host asked them to please do something about their pile of dishes, trash, laundry, and leavings.

Nobody is after all taking away your version of python 2 or ability to use and maintain it. It takes active effort to keep fixing bugs in software that may be network facing. If you want to do that maintenance you can of course but people it seems aren't going to be doing this for python 2 forever. If you disagree either take up the reigns or pool your funds to pay someone to do it.

The thing to do back in 2008 was to figure out when you wanted to switch and schedule a bit of time to learn python 3. Anyone who did this by oh 2009 or 2010 would have virtually no work to do now. Any work that has been created since based on something you were told 11 years ago was going away is most assuredly work that you have created for yourself and will be obliged to take up.

Anyone who did this in 2014 would have a decade of runway before they can no longer run their python 2 apps on rhel/centos. Anyone who switches TODAY 11 years late to the party can run python + redhat for another 4 years.

>completely unnecessary

It would be more work to do otherwise. Nobody wants to do that work. You don't and they don't.

Two of the worst responses ever:

1. it's free so it's OK to be user-hostile

2. if you don't like the direction, just fork/fix it

If you don't pay anything the project gets to decide to what extent serving your interests is a worthy goal.

You don't have to fork it to fix it personally. You may also consider putting your money where your mouth is and organizing an effort to fund the change you want to see in the world. If you succeed the world will have additional value it wouldn't without it and owe you kudos. Everyone likes options. If you fail you ought to move on you have no basis for complaint. I think this is informative.

Open Source is Not About You


I'm amazed at the lengths people go to justify user-hostility.

Do you regularly go to restaurants you can't afford and declare their desire not to make you a plate hostile?

From where do you derive the requirement to graciously work for free to serve your ends?

“ Nobody is after all taking away your version of python 2 or ability to use and maintain it” No, but they are taking away the right of anyone who does this to call the result “python”, and that is user-hostile.

Why do you believe you have a right to call such a work python? Trademark when not abused is the one form of intellectual property that is trivially defensible.

If anyone can call anything anything then how is it even possible for the consumer to make intelligent choices? Having it be called something else allows your consumers to make an informed choice about using it rather than allowing you to incorrectly trade on the official projects reputation. Of course YOU might merely want someone to competently maintain python 2.

Others might opt to do so badly and thus damage the actual python brand. Worst yet others might opt to make changes to projects that serve their nefarious needs like folding in ads or data collection. Without a defining line between official and unofficial how do we prevent such?

Call it cobra and brand it pythons cooler cousin if you like.

I know its simple, but it wasn't until I learned about f-strings that I actually switched for good.

I thought the reason was because Py2 was still getting new features too for some time. I’ve only just started learning And using Python so it isn’t my world.

asyncio is actually really nice and with ThreadPoolExecutor / ProcessPoolExecutor it fit a lot of use cases I had hacked together things for in Python2. That alone was worth it to me.

i like the condescending bit at the end of your post. python 3 is for average joe’s.

Again with the 'Tremendous amount of effort' meme. I've done many ports and they were all trivial:

    - run 2to3
    - spend 2h max fixing any failing tests
    - cook of any remaining issues in a few days of beta testing like you'd do for any new release
Now now doubt Python 2.7 is a excellent and solid release and will remain so for as long anyone keeps the bitrot in check, but to keep using it because porting is 'hard' is patent bs.

It's not so much that it's "hard", but that it's time consuming when you have hundreds or even thousands of python scripts to port -- and since those scripts already work and you probably weren't going to have to touch them at all, you're not really gaining anything for all of that porting effort.

Maybe whomever should have stopped writing new ones by 2009 a decade ago.

Then you wouldn't have much to port.

If you'd have been writing python a decade ago, you'd know why people couldn't transition immediately to Python 3 even if they wanted to. I no longer work for the company that has hundreds of Python scripts left to migrate, but I don't think all of libraries needed (including some API libraries from vendors) were ported to python3 until a few years ago.

Behold the tremendous amount of effort for Mercurial:


They've been porting hg into Python 3 for the last 10 years and are only now nearing completion.

I've written a bit more about this in Lobsters:


Honest question, how can it possibly take 10 years to port hg to Python 3? If I am to believe the Wikipedia source for the first release of Mercurial[0], it would've been only 4 years old at the start of the 10 year porting process. How on earth does it take 10 years to port 4 year old software?

Even taking into account the fact that new features were still being added and not all focus was on porting, this doesn't really seem like a reasonable representation of what's going on; I have a suspicion that "10 years" of porting here does not entail nearly as much work as it seems.

[0] https://lkml.org/lkml/2005/4/20/45

Please follow the links to answer your questions. That should help.

Yes of course there will be exceptions. But the vast majority off Python code bases are not mercurial or dropbox or imgur. Just like the vast majority of software using companies are not google or facebook.

The average few hundred to few thousand loc app, which should be 98% of all production code-bases will almost certainly port with no issue.

Maybe now. When python3 came out, anything that touched the filesystem was a hideous mess to port. Let's say you have a simple script that takes a file name as an argument, reads the file, prints some message to stdout (which includes the name of the file), and creates a new output file whose name is based on the name of the first file.

In python2, that's trivial. Whatever system you're on would normally be configured so that filename bytes dumped to the terminal would be displayed correctly, so you could just treat the strings as bytes and it would be fine.

In python3, it was a nightmare. No, you could not just decode from/encode to UTF-8, even if that was what your system used! Python had its own idea of what the encoding of the terminal was, and if you used the wrong one, it wouldn't let you print. And if you tried to convert from UTF-8 to whatever it thought the terminal was using, it would also break, because not all characters were representable. And your script could just not tell Python to treat the terminal as UTF-8, either; you had to start digging into locale settings, and if you tried to fix those, then _everything else_ would break, and nobody had any idea what the right thing to do was, because you were using an obscure OS (the latest macOS at the time).

I assume that it works better now.

You're assuming a lot.

What about codebases with python2 third party dependencies that don't work in python3? Now you have to port that entire library as well, or write it yourself while crossing your fingers that it is well documented and easy to work through.

What about codebases without decent test suites? I'd argue most production codebases don't have good test suites, or at least the most complex of code is usually poorly tested. You'll end up spending most of your time digging for regressions especially if your code creates large amounts of user interfaces.

What about code bases that were written by scientists, mathematicians, or other professionals who may not be as fluent in writing "good" code?

dependency rot is a more general problem. in my work we deal with lots of seldomly used applications, and I've found that reducing external dependencies tends to keep life happier

It happens... but maybe you are assuming to little.

There are almost no relevant 3'rd party libs that has not been ported at this stage. If they have not they probably have been abandoned and the client codebase have bigger issues. Same for uncovered code bases, and 'unprofessional' Python production code. That's hardly Python's fault.

No one wants to spend a ton of energy just to remain in place. I don't understand how software providers are so cavalier about eschewing backwards compatibility.

What's the largest codebase you've migrated?

would you be willing to port my 796,113 line program for two hours of pay at $45.00/ hour? Because if so it would be a bargain to hire you. Last time I tried to plan the conversion by looking over the codebase it took me two days of concerted effort to just come to the conclusion that it wasn't worth the effort.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact