1. Unicode support was actually an anti-feature for most existing code. If you're writing a simple script you prefer 'garbage-in, garbage-out' unicode rather than scattering casts everywhere to watch it randomly explode when an invalid byte sneaks in. If you did have a big user-facing application that cared about unicode, then the conversion was incredibly painful for you because you were a real user of the old style.
2. Minor nice-to-haves like print-function, float division, and lazy ranges just hide landmines in the conversion while providing minimal benefit.
In the latest py3 versions we've finally gotten some sugar to tempt people over: asyncio, f-strings, dataclasses, and type annotations. Still not exactly compelling, but at least something to encourage the average Joe to put in all the effort.
Actually that's the behavior of python 2, it works fine, until you send invalid characters then it blows up.
In python 3 it always blows up when you mix bytes with text so you can catch the issue early on.
> In the latest py3 versions we've finally gotten some sugar to tempt people over: asyncio, f-strings, dataclasses, and type annotations. Still not exactly compelling, but at least something to encourage the average Joe to put in all the effort.
That's because until 2015 all python 2.7 features were from python 3. Python 2.7 was basically python 3 without the incompatible changes. After they stopped backporting features in 2015. Suddenly python 3 started looking more attractive.
> In python 3 it always blows up when you mix bytes with text so you can catch the issue early on.
Sometimes you don't care about weird characters being print as weird things. In python 2 it works fine: you receive garbage, you pass garbage. In python 3 it shuts down your application with a backtrace.
Dealing with this was one of my first Python experiences and it was very frustrating, because I realized that simply using #!/usr/bin/python2 would solve my problem but people wanted python3 just because it was fancier. So we played a lot of whack-a-mole to make it not explode regardless of the input. And the documentation was particularly horrible regarding that, not even the experienced pythoners knew how to deal with it properly.
You run your python 2 code on python 3 and it fails, most people at that point will place encode() or decode() in place where you have a failure. When the correct fix would be to place encode/decode at I/O boundary (writing to files (and in python 3 even that is not needed if you open files in text mode), network etc).
Ironically a python 2 code that doesn't use unicode is easier to port.
When you program in python 3 from the start it's very rare to need encode/decode strings. You only do that if you are working on I/O level.
> And the documentation was particularly horrible regarding that, not even the experienced pythoners knew how to deal with it properly.
Because it's not really python specific knowledge. It's really about understanding what the unicode is, what bytes are, and when to use each.
The general practice is to keep everything you do as text, and do the conversion only when doing I/O. You should think of unicode/text as as a representation of a text, as you think of a picture or sound. Similarly to image and audio text can be encoded as bytes. Once it is bytes it can be transmitted over network or written to a file etc. If you're reading the data, you need to decode it back to the text.
This is what Python 3 is doing:
- by default all string is of type str, which is unicode
- bytes are meant for binary data
- you can open files in text and binary mode, if you open in text the encoding is happening for you
- socket communication - here if you need to convert string to bytes and back
Python 2 is a tire fire in this area:
- text is bytes
- text also can be unicode (so two ways to represent the same thing)
- binary data can also be text
- I/O accepts text/bytes, no conversion happening
- a lot (most? all?) stdlib is actually expecting string/bytes as input and output
- cherry on top is that python2 also implicitly converts between unicode and string so you can do crazy thing like my_string.encode().encode() or my_string.decode()
So now you get a python 2 code, where someone wanted to be correct (it is actually quite hard to do it, mainly because of the implicit conversion) so the existing code will have plenty of encode() and decode() because some functions now expect str some unicode.
At different functions you might then have bytes or unicode as a string.
Now you take such code and try to move it to python 3, which no longer has implicit conversion and will throw an error when it expected text and got bytes and vice versa. The str now is unicode, unicode type no longer exists and bytes is now not the same thing as str. So your code now blows up.
Most people see an error so they add encode() or decode() often trying which one works (like what you were removing) when the proper fix would be actually removing encodes() and decodes() in other places of the code.
It's quite difficult task when your code base is big, so this is why Guido put a lot of effort with type annotations, mypy. One of its benefits supposed to help with these issues.
Native English speakers are usually the ones blissfully unaware of it, because it just happens to cover all their usual inputs. But as soon as you have so much as an umlaut, surprise! And there are plenty of ways to end up with a Unicode string floating around even in Python 2 - JSON, for example. And then it ends up in some place like a+b, and you get an implicit conversion.
With 2 vs 3 code is easiest to write your code for python 3 and then in 2 import everything you have in __future__ package including unicode literals. That's still not enough and you still might need to do extra work. In python 3 there's argument encoding, which could do the encoding which doesn't look like it is available in python 2. So you probably shouldn't be use it and treat all input/output as bytes (i.e. call encode() when sending data to stdin, and decode() on what you get back from stdout and stderr).
Perhaps that might be enough for your case, although many things is hard to get right in python 2 even when you know what you should do, because of the implicit conversion.
Edit: this also might be useful: https://unicodebook.readthedocs.io/good_practices.html
Also this could help: https://unicodebook.readthedocs.io/programming_languages.htm...
This is definitely the case. I've been wrestling with bytes and strings all the time during the port of a Django application to Python 3 for a costumer. I can see myself encoding and decoding response bodies and JSON for the time being. For reasons I didn't investigate I don't have to do that with projects in Ruby and Elixir. It seems everything is a string there and yet they work.
Perhaps there’s something about a port that requires encoding/decoding bytes/strings?
Ironically when your python 2 app doesn't care about unicode, the porting to python 3 is actually much easier.
If you write code in python 3 from the start you rarely need to use encode() and decode(). Typically what you always want is a text not bytes.
Exception to it might be places where you want to serialize like IO (network or files, although even files are converted on the fly unless you open file in a binary mode).
Example, I just had to write this
This is the only language where I have to explicitly deal with encodings at such low level. I don't feel like I want to use it for my pet projects.
Of course urllib could have method text() that would do such conversion, but then urllib is not requests. It never was user friendly.
Edit: personally I use aiohttp, the interface is much nicer: https://aiohttp.readthedocs.io/en/stable/client_reference.ht... if I can't use asyncio then would use requests.
Not that I've seen.
Example of where Python 3 has rained shit on my parade: I wrote a program that backs up files for Linux. It works fine in python 2, but in python 3 you rapidly learn you must treat filenames as bytes otherwise your backup program blows up on valid Linux filenames. It's not just decoding errors, it's worse. Because Unicode doesn't have a unique encoding for each string, so the round trip (binary -> string -> binary) is not guaranteed to get you the same binary. If you make the mistake of using that route (which Python3 does by default) then one day Python3 will tell you can't open a file you os.listdir() microseconds ago and can clearly see is still there.
Later, you get some sort of error when handling one of those filenames, so you sys.stderr.write('%s: this file has an error' % (filename,)). That worked in python2 just fine, but in python3 generates crappy looking error messages even for good filenames. You can't try to decode the filename to a string because it might generate a coding error.
This works: sys.write('b%b: this file has an error' % (filename,)), but then you find you've inserted other strings into error messages and soon the only "sane" thing to do is to to convert every string in your program to bytes. Other solutions like sys.write('%s: this file has an error' % (filename.decode(errors='ignore'),)) but corrupt the filename the user sees, are verbose, and worst of all if you forget it isn't caught by unit tests but still will cause your program to blow up in rare instances.
I realise that for people who live in a land of clearly delineated text and binary, such as the django user posting here, these issues never arise and the clear delineation between text and bytes is a bonus. But people who use python2 as a better bash scripting language than bash don't live in that world. For them python2 was a better scripting language than bash, but is being being depreciated in favour of python3 that's actually more fragile than bash for their use case. (That's a pretty impressive "accomplishment".) Perhaps they will go to back to Perl or something, because it stands Python3 isn't a good replacement.
This! IMO Python 2 has better usability for prototyping and thinking and doing things on the fly. Python 3 also often seems to have deprecated the functions I want to use in favor of those that are more cumbersome and take more keystrokes. More explicit sure, but less fluid.
When python was created the Unicode didn't even exist.
Anyway in python 3, many os functions accept string and bytes, and might behave depending on it. For example os.walk, if you pass path as byte string, will output paths as bytes.
Not always. As far as I can tell writing garbage bytes to various APIs works fine unless they explicitly try to handle encoding issues. First time I noticed encoding issues in my code was when writing an xml structure failed on windows, all because of an umlaut in an error message I couldn't care less about. The solution was to simply kill any non ascii character in the string, not a nice or clean solution but the issue wasn't worth more effort.
That is nice if your job involves dealing with unicode issues. My job doesn't, any time I have to deal with it despite that is time wasted.
"Dealing with unicode" is really just about dealing with it at the input/output boundaries (and even then libraries handle it most of the time). But without the clear delineation that Python 3 provides, when you _do_ hit some issue you probably insert a "fix" in the wrong space. Leading to the classic Py2 "I just call decode 1000 times on the same string because I've lost track"
Interesting text follows company set naming schemes, which means all english and ascii. The rest could be random bytes for all I have to care about.
Many formats like plain text or zip don't have a fixed encoding and I am not going to start guessing which one it is for every file i have to read, there is no way to do that correctly. Dealing with that mess is explicitly something I want to avoid.
This is a lot of old code, and it's all ASCII, no matter what the locale of the system is. And even if the code was updated, all the messages would still be in some text == bytes encoding, because there's no "user data" involved, and the throughput desired is in many gigabytes of text processed per second.
So yeah, unicode is not "everywhere": it may be everywhere on the public internet, but there is a world beyond this.
So you can throw in your emoji and they might not correctly show up on the xml logging metadata I write, because I don't care. But they will end up in the processed file the same way they came in instead of <?> or some random Chinese or Japanese symbol that the guessing algorithm thought appropriate.
Also, there's no guessing happening in this instance. A locale configured in your environment variable are used if you open files using text mode.
The only reasonable scenario I can think of is when you are porting python 2 code to python 3 and play with .decode() and .encode().
We're talking about simple scripts, the solution is to not send in invalid characters.
Personally, asyncio and type annotations are a big turnoff. I know this is a bit contrarian, but I've always favored the greenlet/gevent approach to doing cooperative multi-tasking. Asyncio (neé twisted) had a large number of detractors, but now that the red/blue approach has been blessed, it seems like many are just swallowing their bile and using it.
Type annotations really chafe because they seem so unpythonic. I like using python for it's dynamicity, and for the clean, simple code. Type annotations feel like an alien invader, and make code much more tedious to try and read. If I want static typing, I'll use a statically typed language.
No one wants to spend energy re-programming to stay in place.
Python 3 came out in 2008 so say no backported features after 2009 no bug fixes after 2012. All announced in 2008 of course.
Given 4 years to migrate most would have made the jump sooner.
Once again, how can you ask/require users to expend precious limited energy to re-program just to stay in place? It's totally obnoxious and completely unnecessary.
This is exactly backwards of reality. It's as if they were eating at someones home and had turned a cup off coffee into a week long stay during which they rudely complained when the host asked them to please do something about their pile of dishes, trash, laundry, and leavings.
Nobody is after all taking away your version of python 2 or ability to use and maintain it. It takes active effort to keep fixing bugs in software that may be network facing. If you want to do that maintenance you can of course but people it seems aren't going to be doing this for python 2 forever. If you disagree either take up the reigns or pool your funds to pay someone to do it.
The thing to do back in 2008 was to figure out when you wanted to switch and schedule a bit of time to learn python 3. Anyone who did this by oh 2009 or 2010 would have virtually no work to do now. Any work that has been created since based on something you were told 11 years ago was going away is most assuredly work that you have created for yourself and will be obliged to take up.
Anyone who did this in 2014 would have a decade of runway before they can no longer run their python 2 apps on rhel/centos. Anyone who switches TODAY 11 years late to the party can run python + redhat for another 4 years.
It would be more work to do otherwise. Nobody wants to do that work. You don't and they don't.
1. it's free so it's OK to be user-hostile
2. if you don't like the direction, just fork/fix it
You don't have to fork it to fix it personally. You may also consider putting your money where your mouth is and organizing an effort to fund the change you want to see in the world. If you succeed the world will have additional value it wouldn't without it and owe you kudos. Everyone likes options. If you fail you ought to move on you have no basis for complaint. I think this is informative.
Open Source is Not About You
From where do you derive the requirement to graciously work for free to serve your ends?
If anyone can call anything anything then how is it even possible for the consumer to make intelligent choices? Having it be called something else allows your consumers to make an informed choice about using it rather than allowing you to incorrectly trade on the official projects reputation. Of course YOU might merely want someone to competently maintain python 2.
Others might opt to do so badly and thus damage the actual python brand. Worst yet others might opt to make changes to projects that serve their nefarious needs like folding in ads or data collection. Without a defining line between official and unofficial how do we prevent such?
Call it cobra and brand it pythons cooler cousin if you like.
- run 2to3
- spend 2h max fixing any failing tests
- cook of any remaining issues in a few days of beta testing like you'd do for any new release
Then you wouldn't have much to port.
They've been porting hg into Python 3 for the last 10 years and are only now nearing completion.
I've written a bit more about this in Lobsters:
Even taking into account the fact that new features were still being added and not all focus was on porting, this doesn't really seem like a reasonable representation of what's going on; I have a suspicion that "10 years" of porting here does not entail nearly as much work as it seems.
The average few hundred to few thousand loc app, which should be 98% of all production code-bases will almost certainly port with no issue.
In python2, that's trivial. Whatever system you're on would normally be configured so that filename bytes dumped to the terminal would be displayed correctly, so you could just treat the strings as bytes and it would be fine.
In python3, it was a nightmare. No, you could not just decode from/encode to UTF-8, even if that was what your system used! Python had its own idea of what the encoding of the terminal was, and if you used the wrong one, it wouldn't let you print. And if you tried to convert from UTF-8 to whatever it thought the terminal was using, it would also break, because not all characters were representable. And your script could just not tell Python to treat the terminal as UTF-8, either; you had to start digging into locale settings, and if you tried to fix those, then _everything else_ would break, and nobody had any idea what the right thing to do was, because you were using an obscure OS (the latest macOS at the time).
I assume that it works better now.
What about codebases with python2 third party dependencies that don't work in python3? Now you have to port that entire library as well, or write it yourself while crossing your fingers that it is well documented and easy to work through.
What about codebases without decent test suites? I'd argue most production codebases don't have good test suites, or at least the most complex of code is usually poorly tested. You'll end up spending most of your time digging for regressions especially if your code creates large amounts of user interfaces.
What about code bases that were written by scientists, mathematicians, or other professionals who may not be as fluent in writing "good" code?
There are almost no relevant 3'rd party libs that has not been ported at this stage. If they have not they probably have been abandoned and the client codebase have bigger issues. Same for uncovered code bases, and 'unprofessional' Python production code. That's hardly Python's fault.