Iterators everywhere are incredibly annoying, especially with my development workflow, where I don't put a line of code into a file before I run it manually in the interpreter. When I run a map over a list, I just want to see the freaking results.
Default unicode strings are obscenely annoying to me. Almost all of my code deals with binary data, parsing complex data structures, etc. The only "human readable" strings in my code are logs. Why the hell should I worry about text encoding before sending a string into a TCP socket...
The fact that the combination of words "encode" and "UTF8" appear in my code, and str.encode('hex') is no longer available, is a very good representation of why I hate Python 3.
In Python 2, the rule of thumb was "If it makes sense, it's going to work". In Python 3, this isn't true. Not often, but often enough to annoy. This makes Python "feel" like Java to me.
And worst of all, Python 3 has so many excellent features, like sub-generators, a better performing GIL, etc. These sometimes force me into coding with Python 3. I hate it every freaking time.
I said to myself that with Python 3.4, due to great stuff like the new asyncio module, I'll have to make the switch. It's really sad that this is because I "have to" do it, and not because I "want to".
> Default unicode strings are obscenely annoying to me. Almost all of my code deals with binary data, parsing complex data structures, etc. The only "human readable" strings in my code are logs. Why the hell should I worry about text encoding before sending a string into a TCP socket...
If your Python 3 code is dealing with binary data, you would use byte strings and you would never have to call encode or touch UTF-8 before passing the byte string to a TCP socket.
What you're saying about Unicode scares me. If you haven't already, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) before writing any libraries that I might depend on.
> you would use byte strings and you would never have to call encode or touch UTF-8 before passing the byte string to a TCP socket.
I'll start by adding that it's also incredibly annoying to declare each string to be a byte string, if this wasn't clear from my original rant.
Additionally, your advice is broken. Take a look at this example, with the shelve module (from standard library).
s=shelve.open('/tmp/a')
s[b'key']=1
Results in:
AttributeError: 'bytes' object has no attribute 'encode'
So in this case, my byte string can't be used as a key here, apparently. Of course, a string is expected, and this isn't really a string. My use case was trying to use a binary representation of a hash as the key here. What's more natural than that. Could easily do that in Python 2. Not so easy now.
I can find endless examples for this, so your advice about "just using byte strings" is invalid. Conversions are inevitable. And this annoys me.
> What you're saying about Unicode scares me.
Yeah, I know full well what you're scared of. If I'm designing everything from scratch, using Unicode properly is easy. This, however, is not the case when implementing existing protocols, or reading file formats that don't use Unicode. That's where things begin being annoying when your strings are no longer strings.
> Additionally, your advice is broken. Take a look at this example, with the shelve module (from standard library).
His advice was sound, and referred to your example of TCP stream data (which is binary). Your example regards the shelve library.
> So in this case, my byte string can't be used as a key here, apparently.
shelve requires strings as a keys. This is documented, though not particularly clearly.
> so your advice about "just using byte strings" is invalid.
Let me attempt to rephrase his advice. Use bytestrings where your data is a string of bytes. Use strings where your data is human-readable text. Covert to and from bytestrings when serializing to something like network or storage.
> Conversions are inevitable.
Absolutely, because bytestrings and (text)strings are two different types.
> And this annoys me.
There is no real alternative though, because there is no way to automatically convert between the two. Python 2 made many assumptions, and these were often invalid and led to bugs. Python 3 does not; in places where it does not have the required information, you must provide it.
> when implementing existing protocols, or reading file formats that don't use Unicode.
I'm afraid it's still the same. A protocol that uses unicode requires you to code something like "decode('utf-8')" (if UTF-8 is what it uses), one that does not requires "decode('whatever-it-uses-instead')". If it isn't clear what encoding the file format or protocol stores textual data in, then that's a bug with the file format or protocol, not Python. Regardless though, Python doesn't know (and can't know) what encoding the file or protocol uses.
If all your code is dealing with binary data, why the heck are you using strings for it? There's a bytes type there for a reason, which doesn't deal with encodings, and you won't accidentally try and treat a blob like a string.
Keyme, you and I are the few that have serious concern with the design of Python 3. I started to embrace it in a big way 6 months ago when most libraries I use are available in Python. I wish to say the wait is over and we should all move to Python 3 then everything will be great. Instead I find no compelling advantage. Maybe there will be when I start to use unicode string more. Instead I'm really annoyed by the default iterator and the binary string handling. I am afraid it is not a change for the good.
I come from the Java world when people take a lot of care to implement things as streams. It was initially shocking to see Python read an entire file into memory, turn it into list or other data structure with no regard to memeory usage. Then I have learned this work perfectly well when you have a small input, a few MB or so is a piece of cake for modern computer. It takes all the hassle out of setting up streams in Java. You optimize when you need to. But for 90% of stuff, a materialized list works perfectly well.
Now Python become more like Java in this respect. I can't do exploratory programming easily without adding list(). Many times I run into problem when I am building complex data structure like list of list, and end up getting a list of iterator. It takes the conciseness out of Python when I am forced to deal with iterator and to materialize the data.
The other big problem is the binary string. Binary string handling is one of the great feaute of Python. It it so much more friendly to manipulate binary data in Python compare to C or Java. In Python 3, it is pretty much broken. It would be an easy transition I only need to add a 'b' prefix to specify it as binary string literal. But in fact, the operation on binary string is so different from regular string that it is just broken.
In [38]: list('abc')
Out[38]: ['a', 'b', 'c']
In [37]: list(b'abc') # string become numbers??
Out[37]: [97, 98, 99]
In [43]: ''.join('abc')
Out[43]: 'abc'
In [44]: ''.join(b'abc') # broken, no easy way to join them back into string
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-44-fcdbf85649d1> in <module>()
----> 1 ''.join(b'abc')
TypeError: sequence item 0: expected str instance, int found
All the other commenters here that are explaining things like using a list() in order to print out an iterator are missing the point entirely.
The issue is "discomfort". Of course you can write code that makes everything work again. This isn't the issue. It's just not "comfortable". This is a major step backwards in a language that is used 50% of the time in an interactive shell (well, at least for some of us).
The converse problem is having to write iterator versions of map, filter, and other eagerly-evaluated builtins. You can't just write:
>>> t = iter(map(lambda x: x * x, xs))
Because the map() call is eagerly evaluated. It's much easier to exhaust an iterator in the list constructor and leads to a consistent iteration API throughout the language.
If that makes your life hard then I feel sorry for you, son. I've got 99 problems but a list constructor ain't one.
The Python 3 bytes object is not intended to be the same as the Python 2 str object. They're completely separate concepts. Any comparison is moot.
Think of the bytes object as a dynamic char[] and you'll be less inclined to confusion and anger:
>>> list(b'abc')
[97, 98, 99]
That's not a list of numbers... that's a list of bytes!
> >>> list(b'abc')
> [97, 98, 99]
>That's not a list of numbers... that's a list of bytes!
No, it's a list of numbers:
>>> type(list(b'abc')[0])
<class 'int'>
I think the GP mis-typed his last example. First, he showed that ''.join('abc') takes a string, busts it up, then concatenates it back to a string. Then, with ''.join(b'abc'), he appears to want to bust up a byte string and concatenate it back to a text string. But I suspect he meant to type this:
>>> b''.join(b'abc')
That is, bust up a byte string and concatenate back to what you start with: a byte string. But that doesn't work, when you bust up a byte string you get a list of ints; and you cannot concatenate them back to a byte string (at least not elegantly).
Well, yes. Python chooses the decimal representation by default. So? It could be hex or octal and still be a byte.
My example was merely meant to be illustrative and not an example of real code. The byte object is simply not a str; so I don't understand where this frustration with them not being str is coming from. If you use the unicode objects in Python 3 you get the same API as before. The difference is now you can't rely on Python implicitly converting up ASCII byte strings to unicode objects and have to explicitly encode/decode from one to the other. It removes a lot of subtle encoding bugs.
Perhaps it's just that encoding is tricky for developers who never learned it in the first place when they started learning the string APIs in popular dynamic languages? I don't know. It makes a lot of sense to me since I've spent years fixing Python 2 code-bases that got unicode wrong.
You are not making useful comment because you don't understanding the use case. Python 2 is very useful in handling binary data. This complain is not about unicode. This is about binary files manipulation.
I'm thrill about the unicode support. If they only add unicode string and leave the binary string alone and just require an additional literal prefix b, it will be an easy transition. Instead the design is changed for no good reason and the code are broken too.
I have a hard time believing that the design was arbitrarily changed.
The request to add string-APIs to the bytes object have been brought up before [0]. I think the reasoning is quite clear: byte-strings are not textual objects. If you are sending textual-data to a byte-string API, build your string as a string and encode it to bytes.
For working with binary data there's a much cleaner API in Python 3 that is less prone to subtle encoding errors.
edit: I realize there is contention but I'm of the mind that .format probably isn't the right method and that if there were one it'd need it's own format control string syntax.
> The converse problem is having to write iterator versions of map, filter, and other eagerly-evaluated builtins
Well, in Python 2 you just use imap instead of map. That way you have both options, and you can be explicit rather than implicit.
> That's not a list of numbers... that's a list of bytes!
The point being made here is not that some things are not possible in Python 3, but rather than things that are natural in Python 2 are ugly in 3. I believe you're proving the point here. The idea that b'a'[0] == 97 in such a fundamental way that I might get one when I expected the other may be fine in C, but I hold Python to a higher standard.
What you are looking for is imap(). In Python 2 there are entire collection of iterator variants. You can choose to use either the list or the iterator variants.
The problem with Python 3 is the list version is removed. You are forced to use iterator all the time. Things become inconvenient and ugly as a result. Bugs are regularly introduced because I forget to apply list().
>>> "".join(map(lambda byte: chr(byte), b'abc'))
Compares to ''.join('abc'), this is what I call fuck'd. Luckily maxerickson suggested a better method.
I don't have enough experience with either version to debate the merits of the choice, but the way forward with python 3 is to think of bytes objects as more like special lists of ints, where if you want a slice (instead of a single element) you have to ask for it:
Which is obviously worse if you are doing it a bunch of times, but it is at least coherent with the view that bytes are just collections of ints (and there are situations where indexing operations returning an int is going to be more useful).
In Java, String and Char are two separate types. In Python, there is no separate char type. It is simply a string of length of 1. I do not have great theory to show which design is better either. I can only say the Python design work great for me in the past (for both text and binary string), and I suspect it is the more user friendly design of the two.
So in Python 3 the design of binary string is changed. Unlike the old string, bytes and binary string of length 1 are not the same. Working codes are broken, practice have to be changed, often it involves more complicated code (like [0] becomes [0:1]). All these happens with no apparent benefit other than it is more "coherent" in the eye of some people. This is the frustration I see after using Python 3 for some time.
this post is a great example of why python 3 didn't go far enough. it's too close to python 2 to be still called python and too far from python 2 to be still called python 2.
personally, coming from a country that needs to deal with non-ascii, i love unicode by default and there are b'' strings if you need them. str.encode is a non-issue - you wasted more words on it than the two-line function enc() it takes to fix it.
i have a program in python 2 that uses this approach, because i have a lot of decoding from utf and encoding to a different charset to do. python 3 is absolutely the same for me.
Well I don't think I will be able to change your opinion if you feel so strongly about it. However some pointers perhaps might make it less painful:
> Iterators everywhere are incredibly annoying, especially with my development workflow, where I don't put a line of code into a file before I run it manually in the interpreter. When I run a map over a list, I just want to see the freaking results.
That's what list() is for:
>>> list(map(lambda x: x * x, xs))
> Default unicode strings are obscenely annoying to me. Almost all of my code deals with binary data, parsing complex data structures, etc. The only "human readable" strings in my code are logs. Why the hell should I worry about text encoding before sending a string into a TCP socket...
> The fact that the combination of words "encode" and "UTF8" appear in my code, and str.encode('hex') is no longer available, is a very good representation of why I hate Python 3.
I'm afraid I don't understand your complaint. If you're parsing binary data then Python 3 is clearly superior than Python 2:
>>> "Hello, Gådel".encode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)
Because they're not the same thing. Python 2 would implicitly "promote" a bytestring (the default literal) to a unicode object so long as it contained ASCII bytes. Of course this gets really tiresome and leads to Python 2's, "unicode dance." Armin seems to prefer it to the extra leg-work for correct unicode handling in Python 3 [0] however I think the trade-off is worth it and that pain will fade when the wider world catches up.
> In Python 2, the rule of thumb was "If it makes sense, it's going to work". In Python 3, this isn't true. Not often, but often enough to annoy. This makes Python "feel" like Java to me.
Perhaps this is because you're a used to Python 2 and anything else is going to challenge your perceptions of the way it should work?
I don't understand the Java comparison.
> And worst of all, Python 3 has so many excellent features, like sub-generators, a better performing GIL, etc. These sometimes force me into coding with Python 3. I hate it every freaking time.
> I said to myself that with Python 3.4, due to great stuff like the new asyncio module, I'll have to make the switch. It's really sad that this is because I "have to" do it, and not because I "want to".
Well you can always attempt to backport these things into your own modules and keep Python 2 alive.
I think going whole-hog and embracing Python 3 is probably easier in the long run. I'm not sure how long the world is going to tolerate ASCII as the default text encoding given that its prevalence has largely been an artifact of opportunity. Unicode will eventually supplant it I have no doubt. It's good to be ahead of the curve.
> Why the hell should I worry about text encoding before sending a string into a TCP socket...
A string represents a snippet of human readable text and is not merely an array of bytes in a sane world. Thus is it fine & sane to have to encode a string before sticking it into a socket, as sockets are used to transfer bytes from point a to b, not text.
Not arguing that you're wrong, but Unix/Linux is not a sane world by your definition. Whether we like it or not (I do like it), this is the world many of us live in. Python3 adds a burden in this world where none existed in Python2. In exchange, there is good Unicode support, but not everyone uses that. I can't help but wonder if good Unicode support could have been added in a way that preserved Python2 convenience with Unix strings.
(Please note that I'm not making any statement as to what's appropriate to send down a TCP socket.)
ASCII by default is only an accident of history. It's going to be a slow, painful process but all human-readable text is going to be Unicode at some point. For historical reasons you'll still have to encode a vector of bytes full of character information to send it down the pipe but there's no reason why we shouldn't be explicit about it.
The pain is painful [in Python 3] primarily for library authors and only at the extremities. If you author your libraries properly your users won't even notice the difference. And in the end as more protocols and operating systems adopt better encodings for Unicode support that pain will fade (I'm looking at you, surrogateescape).
It's better to be ahead of the curve on this transition so that users of the language and our libraries won't get stuck with it. Python 2 made users have to think (or forget) about Unicode (and get it wrong every time... the shear amount of work I've put into fixing codebases that mixed bytes and unicode objects without thinking about it made me a lot of money but cost me a few years of my life I'm sure).
I was careful to say "Unix strings", not "ASCII". A Unix string contains no nul byte, but that's about the only rule. It's certainly not necessarily human-readable.
I don't think a programming language can take the position that an OS needs to "adopt better encodings". Python must live in the environment that the OS actually provides. It's probably a vain hope that Unix strings will vanish in anything less than decades (if ever), given the ubiquity of Unix-like systems and their 40 years of history.
I understand that Python2 does not handle Unicode well. I point out that Python3 does not handle Unix strings well. It would be good to have both.
This is the first time I encounter the idiom Unix strings. I'll map it to array of bytes in my table of idioms.
> I don't think a programming language can take the position that an OS needs to "adopt better encodings".
I do think that programming languages should take a position on things, including but not limited to how data is represented and interpreted in itself. A language is expected to provide some abstractions, and whether a string is an array of bytes or an array of characters is a consideration of a language designer, who will end up designing a language takes one or another of the sides available.
Python has taken the side of language user: enabled Unicode names, defaulted to Unicode strings, defaulted to classes being subclasses of the 'object' class... Unix has taken the side of machine (which was the side at the time of Unix's inception.
> [...] probably a vain hope that Unix strings will vanish [...]
If only we wait for them to vanish, doing nothing to improve.
> Python must live in the environment that the OS actually provides.
Yes, Python must indeed live in the OS' environment. Regardless, one need not be a farmer because they live among all farmers, need they?
> This is the first time I encounter the idiom Unix strings
The usual idiom is C-strings, but I wanted to emphasize the OS, not the language C.
>> [...] probably a vain hope that Unix strings will vanish [...]
>If only we wait for them to vanish, doing nothing to improve.
The article is about the lack of Python3 adoption. In my case, Python3's poor handling of Unix/C strings is friction. It sounds like you believe that Unix/C strings can be made to go away in the near future. I do not believe this. (I'm not even certain that it's a good idea.)
I do not insist that C strings must die, I insist that C strings are indeed arrays of bytes, and we cannot use them to represent text correctly at present. I fully support strings to be Unicode-by-default in Python, as most people will put text in between double quotes, not a bunch of bytes represented by textual characters.
I do not expect C or Unix interpretations of strings to change, but I believe that they must be considered low-level and require higher-level language user to explicitly request the compiler to interpret a piece of data in such fashion.
My first name is "Göktuğ". Honestly, which one of the following is rather desirable for me, do you think?
I'm not arguing against you. I just don't write any code that has to deal with people's names, so that's just not a problem that I face. I fully acknowledge that lack of Unicode is a big problem of Python2, but it's not my problem.
A Unix filename, on the other hand, might be any sort of C string. This sort of thing is all over Unix, not just filenames. (When I first ever installed Python3 at work back when 3.0 (3.1?) came out, one of the self tests failed when it tried to read an unkosher string in our /etc/passwd file.) When I code with Python2, or Perl, or C, or Emacs Lisp, I don't need to worry about these C strings. They just work.
My inquiry, somewhere up this thread, is whether or not it would be possible to solve both problems. (Perhaps by defaulting to utf-8 instead of ASCII. I don't know, I'm not a language designer.)
> I insist that C strings are indeed arrays of bytes, and we cannot use them to represent text correctly at present
OK, maybe I do see one small point to argue. A C string, such as one that might be used in Unix, is not necessarily text. But text, represented as utf-8, is a C string.
It seems like there's something to leverage here, at least for those points at which Python3 interacts with the OS.
Iterators everywhere are incredibly annoying, especially with my development workflow, where I don't put a line of code into a file before I run it manually in the interpreter. When I run a map over a list, I just want to see the freaking results.
Default unicode strings are obscenely annoying to me. Almost all of my code deals with binary data, parsing complex data structures, etc. The only "human readable" strings in my code are logs. Why the hell should I worry about text encoding before sending a string into a TCP socket...
The fact that the combination of words "encode" and "UTF8" appear in my code, and str.encode('hex') is no longer available, is a very good representation of why I hate Python 3.
In Python 2, the rule of thumb was "If it makes sense, it's going to work". In Python 3, this isn't true. Not often, but often enough to annoy. This makes Python "feel" like Java to me.
And worst of all, Python 3 has so many excellent features, like sub-generators, a better performing GIL, etc. These sometimes force me into coding with Python 3. I hate it every freaking time.
I said to myself that with Python 3.4, due to great stuff like the new asyncio module, I'll have to make the switch. It's really sad that this is because I "have to" do it, and not because I "want to".