
Grokking Python 3’s str - type0
https://sircmpwn.github.io/2017/01/13/The-problem-with-Python-3.html
======
nneonneo
Python 3 has arguably one of the best built-in string implementations around.

In Python 2, "unicode" was a type whose codepoint width depended on the
interpreter build - 2 bytes on "narrow" builds and 4 bytes on "wide" builds.
Since most builds were "narrow", in practice non-BMP codepoints were a real
challenge to use.

Furthermore, most languages with a "wide" string type use a fixed 2 bytes per
codepoint (Java, C#, C++, ...), which wastes space when you're dealing with
ASCII a lot, and is a pain to work with if you have non-BMP codepoints, since
now indexing requires scanning the string from start to end.

In Python 3, with PEP 393, strings now have a flexible internal representation
which can use 1, 2, or 4 bytes per codepoint depending on the largest
codepoint in the string. This saves both space and processing time (in most
common situations) since common string operations like indexing and slicing
are constant time. This representation also allows it to scale from ASCII to
astral codepoints with ease.

Python 3 is worth the switch. Correct, strict separation of 'bytes' (bucket of
octets) and 'str' (sequence of codepoints) is really the only way to preserve
sanity and interoperability with today's encoding-rife reality.

~~~
tannhaeuser
I'm not disagreeing but the situation is a bit more complex. For example ISO
10646 ("Unicode") has multi-code point sequences denoting still a single
character (such as variation sequences, which are actually used in standard
HTML entities).

Moreover, there are languages/scripts challenging the notion that a particular
byte-unit corresponds to a single character.

Treating strings as byte sequences (with optional UTF-8
interpretation/checking, uppercase/lowercase conversion if the concept even
applies, trim functions based on all Unicode spaces not just those characters
in US-ASCII etc.) is entirely a defensible choice for a programming language.

~~~
perlgeek
> For example ISO 10646 ("Unicode") has multi-code point sequences denoting
> still a single character

I only know two programming languages that deal correctly with that: Swift and
Perl 6. If you know more, please tell me.

~~~
steveklabnik
Rust separates this out; the standard library gives you bytes and Unicode
Scalar Values. Grapheme cluster stuff is in a package on Cargo, maintained by
the Servo team.

~~~
bquinlan
You can iterate over Grapheme clusters using the standard library:
[https://doc.rust-
lang.org/1.3.0/std/str/struct.Graphemes.htm...](https://doc.rust-
lang.org/1.3.0/std/str/struct.Graphemes.html)

~~~
steveklabnik
Those are unstable docs for 1.3; they don't exist in today's Rust.

------
ceronman
The fact that Python 2 is so permissive when mixing bytes and strings I think
is the fundamental reason why people think Python 3 strings are broken.

In practice what happens is that Python 3 won't allow you to mix bytes and
strings at all, and it will force you decode and encode properly. Python 2
will happily try to implicitly convert to and from ascii when needed. Now,
this might seem simpler at the beginning because in Python 2 you require less
boilerplate code, but then later, when you have to deal with a different
encoding or when you use non ascii characters, you get really weird, hard to
debug problems that cause real pain. Python 3's strictness is helping you
preventing these kind of cases.

~~~
sheeshkebab
Taking away permissiveness from a programming language just leads to
developers not upgrade. No surprises there.

Yes, Python 3's stricter model is less error prone (vs python 2 u').

the more annoying thing for me personally is the "print" function rather than
previous "print" statement... Bugs the hell out of me.

~~~
joejev
I hear this complaint about print all the time but I have never understood it.
Why is print so special that it needs dedicated synta? Most languages don't do
this and people don't seem to mind. Also, I almost never physically type
print, but I do type a lot of function calls, so if anything I am more
accustomed to typing it as a call.

~~~
kevin_thibedeau
The print statement is a little more convenient in the REPL. Otherwise there
is little to complain about having a more consistent, extensible print without
any magic. Every Py2 coder should be sticking in "from __future__ import
print_function" to train themselves for the transition.

~~~
klibertp
> The print statement is a little more convenient in the REPL.

IIRC ipython has a feature where you can prepend slash to any function (at the
beginning of a line) to have parens inserted automatically, ie.

    
    
        In [0]: /print "foo"
    

is executed as print("foo"). There - problem solved, and for all functions,
not just one special case.

Honestly, the crazy `print >>fileobject, "foo"` syntax is enough of a reason
for removing it (print statement) from the language. It was unpythonic and I'm
surprised it lasted this long in the language.

~~~
vram22
Yes, I don't like that last usage too. Non-orthogonal syntax is slightly
harder to learn / remember, and the more of if there is, the more the
cognitive / memory load. Of course, no language can be perfectly orthogonal or
regular, though I guess Lisp comes close. (Not an expert on language design or
theory.)

------
ankitml
Last month, we had to add support for HINDI language in our APIs within 2
weeks as Indian govt wanted to launch learning program in both languages. We
being a young startup of less than 2 years could not afford to lose this
opportunity by asking more time. Because all of our code was in python3.5,
there was nothing we had to do. Hindi support was magically available and
everything worked flawlessly.

Fingers crossed on the program launch... :)

~~~
vram22
What sort of APIs and learning program, if not confidential?

~~~
ankitml
Its a learning program for budding entrepreneurs. We are at www.upgrad.com
There are APIs for question, answers, feedbacks, discussion forum etc

~~~
vram22
Sounds interesting, thanks.

------
CrLf
It may be true that one of the issues with Python 3 strings is people not
grokking them, thus littering their code with useless/redundant/wrong
`.decode()` and `.encode()` calls. But I say this is a problem with Python 3
_itself_ , since it obviously just replaced one set of problems with another.

I think the fundamental mistake of Python 3's approach to strings is assuming
that programmers mostly meant "string" when they used strings in the past.
This may be true in web circles, but in most other areas of application they
actually meant (opaque) bytes. (`from __future__ import unicode_literals`
makes more sense in this scenario, and I've been using it since forever.)

It's the changing of default behavior that confuses people!

Also, the standard library has a few places where the maintainers don't seem
to grok strings either. Take the "json" module for example: Why does
`json.dumps()` return an "str" and `json.loads()` doesn't accept "bytes" as
input? (Hint: JSON is, by definition, UTF-8 encoded, so "dumps" should return
"bytes" and "loads" should accept both.)

(I should mention that my native tongue uses characters outside of ASCII, so I
it would seem that I should be asking for unicode everywhere.)

~~~
d0mine
JSON is a text format, not binary [http://www.ecma-
international.org/publications/files/ECMA-ST...](http://www.ecma-
international.org/publications/files/ECMA-ST/ECMA-404.pdf)

JSON text is a sequence of Unicode code points, not bytes. Python 3 str type
is ideal to represent JSON text.

On the internet "JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32"
[https://tools.ietf.org/html/rfc7159](https://tools.ietf.org/html/rfc7159)

json.loads() accepts binary input on Python 3.6. using the encoding detection
scheme from the obsolete rfc 4627 that relies on now false assumptions (that
json text represents either array or object).

------
Animats
In Python 3, a "bytes" type is too much like a string. It's supposed to be an
array of [0..255]. But

    
    
        >>> s = bytes([97,98,99]) # created an array of "bytes" from a list of ints.
        >>> type(s)
        <class 'bytes'>   # it's really a type "bytes"
        >>> s
        b'abc'            # but it prints as a string
        >>> s[1]          # each element, however, prints as an integer
        98
    

Python 3 thus isn't rigorous about "bytes" as an array of byte values. It's
become less rigorous; you can now use regular expressions on "bytes" types. If
Python 3 had taken a harder line, that "bytes" is just an array of bytes, the
distinction would be clearer.

Actually, Unicode has worked fine in Python since Python 2.6. You just had to
write "unicode" and "u'foo'" a lot. In an exercise of sheer obnoxiousness,
those were originally disallowed in Python 3, instead of making them null
operations.

Strings in Python 3 appear to be arrays of Unicode characters. This is a bit
tricky, because Python doesn't have a Unicode character type. Elements of a
string are also strings. There's no type like Go's "rune".

Python has successfully hidden the internal structure of UTF-8 strings.
Internally, representations can be 1-byte, 2-byte, or 4-byte, plus an optional
UTF-8 representation. This means a lot of run-time machinery.

Go and Rust both have a string type that is both internally and visibly UTF-8.
They're subscriptable, but at the byte level. An element of a Go or Rust
string is not a string, a character, or a rune - it's just a byte out of the
middle of something. The same is true for slices of strings; they are not
necessarily valid UTF-8. This is a cause of trouble. You shouldn't be
subscripting through UTF-8 byte by byte. In practice, you have to be aware of
the UTF-8 representation in Go and Rust, or use a library which is. Here's my
grapheme-aware word wrap in Rust.[1] Too much touchy fooling around with byte-
level indices into arrays there.

Arguably, if you're going to use UTF-8 as an string representation, subscripts
should be of an opaque type, not integers. You should be able to move forward
or backwards one grapheme at a time cheaply, and if necessary, create an array
of slices which represent all the graphemes in the array. Then programmers
could random-address strings without fear, and you don't have multiple
internal representations.

[1] [https://github.com/John-Nagle/rust-
rssclient/blob/master/src...](https://github.com/John-Nagle/rust-
rssclient/blob/master/src/wordwrap.rs)

------
cauterized
Perhaps the problem with Python3's str type is that a string type shouldn't be
challenging to grok?

Also, for those of us who learned computer science back before UTF-16 was a
standard, a "string" has always meant an array of chars, and a char was a
byte. In some languages this is still the case.

In other languages, from Pascal to Java and beyond, a string has been a
distinct class or type. Though generally those types began as a hint
abstraction around an array of bytes. Surprise, surprise, that's how Python2
has always done it.

So Python3 changed which internals it uses for its default string type (pro
tip: a b'foo' object in Python2.7 or Python3 isn't "a bytes", it's "a
bytestring".)

I happen to think that this is a good choice in Python3. It's not 1996 any
more. People expect software to support accented characters and ridiculous
emoji. Default Unicode strings are easier to work with for most purposes that
involve accepting text from a user and returning text for a user. For most of
us it's worth the additional disk space and memory trade-offs even for things
like dict keys that don't benefit from Unicode.

People whose work involves processing binary sequences as bytes for
convenience are now inconvenienced and understandably frustrated, but they're
not the majority of users of the language.

The author is correct that people who expect Python3 strings to be arrays of
bytes are mistaken. But the author is wrong to tell people that what they've
worked with as a "string" and considered a "string type" all their lives --
and which is STILL considered a "string" in many languages is not a string at
all. It's still a string. It's still a type of string. Even Python still calls
it a byteSTRING. It's just no longer the way Python internally represents a
sequence of characters surrounded by unadorned quote marks.

~~~
arundelo
_(pro tip: a b 'foo' object in Python2.7 or Python3 isn't "a bytes", it's "a
bytestring".)_

Quotes from the Python 3 documentation:

 _Bytes literals are always prefixed with 'b' or 'B'; they produce an instance
of the bytes type instead of the str type._

 _class bytes([source[, encoding[, errors]]])_

 _Return a new “bytes” object, which is an immutable sequence of integers in
the range 0 <= x < 256\. bytes is an immutable version of bytearray – it has
the same non-mutating methods and the same indexing and slicing behavior._

[https://docs.python.org/3/library/functions.html#bytes](https://docs.python.org/3/library/functions.html#bytes)

[https://docs.python.org/3/reference/lexical_analysis.html#st...](https://docs.python.org/3/reference/lexical_analysis.html#strings)

~~~
d0mine
"bytes" is the type for bytestrings on both Python 2 and 3.

------
bandrami
Gah. No.

Unicode code points (which is what a str is a sequence of) are not characters.
That is not a one-to-one mapping, nor a one-to-many mapping. That is a one-or-
many-or-none to one-or-many "mapping". And glyphs are a third category that we
aren't even getting into.

~~~
jstimpfle
Well, we are slowly approaching the "truth" .-)

Frankly I think most people don't care about the complexities of Unicode.
Count me in. I treat it as a necessary evil. What I do with it is mostly
concerned with the characters (code points if you insist) from the ASCII range
that are in there (for example, splitting lines or words). I hope it's okay to
ignore code points vs glyphs etc. in this case?

~~~
jstimpfle
Btw. I'm fully aware that this is just the bytes vs unicode issue, taken to
the next level.

The difference however is that a) most data doesn't contain combining code
points while much data is non-ASCII unicode, and b) software (i.e. most
software, with the exception of perl 6 and probably few others) doesn't have
convenient support for glyph-level strings yet -- I wouldn't mind if it had.

------
Sami_Lehtinen
Latin-1 files work just fine. The examples in the post are just overly
complicated: open('test.txt','w',encoding='latin-1').write('No need to
separately encode or decode.')

~~~
Sir_Cmpwn
Yours is the correct way of doing it, but I chose to do it this way because I
felt that this example gives more insight into the concepts this article is
trying to explain.

------
brudgers
The way I grok the computer science, strings are sequences of zero or more
characters input into automata and unicode is a way of encoding text. Text is
a sequence of one or more glyphs input into humans. Thus, for me, the
implementation of String in Python 2 is about as sound from a computer science
perspective as an implementation can get. By which I mean that in the end all
data boils down to bits and clustering the bits as bytes is about as
reasonable as alternatives. On the other hand, Python 3 treats strings as text
and this necessitates all the overhead of multiple encodings and converting
glyphs from one language to another language according to the messy and
inconsistent and incomplete rules of human language. For me, it would have
been better if Python 3 (and most other languages) had a 'Text' type in
addition to the String type.

The problem is sloppy use of language in a domain where usually it is ok but
sometimes it isn't.

~~~
Groxx
Treating strings as "a sequence of bytes" is perfectly fine if you never, ever
interpret or manipulate the contents.

As soon as you want to e.g. limit to 160 characters and suffix with an
ellipsis, you run into problems of "what is a character", and you can't even
call it a single unicode codepoint. Is "Z̴̛̺͉͙͚̰̔̏ͧͅ" is a single character,
or over 10? Can you even call it a single "glyph" since it's formed of many
units? Here's the unicode representation:

    
    
        Z\u0314\u030F\u0367\u0334\u031B\u033A\u0349\u0359\u035A\u0330\u0345
    

Which is a whopping 46 bytes in UTF-8. Or what about unicode flags and their
2-character representation: [https://esham.io/2014/06/unicode-
flags](https://esham.io/2014/06/unicode-flags)

Strings as a concept as they currently stand are absolute nonsense. A "Text"
type might resolve the semantic problems (and I love the name, this is a great
idea), but strings _are_ text. Any other use is just abusing the container
because it's easier to type "v1.2.3" than to make a "Version(1,2,3)" structure
(especially when you have to communicate it across different programs /
languages).

~~~
brudgers
To me, because texts get encoded into strings (or strings encode texts) they
are not the same thing. For example, Base 64 encodes data into a string, but
the source is not necessarily a text. That's a separate issue from the way
'string' often gets used in the context of programming...a context in which
even the otherwise pedantic seem to lose the faith ('regular expressions' is
another one that is closely related).

Whether or not Z̴̛̺͉͙͚̰̔̏ͧͅ is a character (in terms of computer science) is a
matter of whether or not it is part of the input language which some machine
accepts. Which is to say it is no different than whether or not 'HashMap' is
part of the input language to a compiler (yes for Java, no for Python).

~~~
Groxx
Base64 is a text representation of a binary blob of data - it's just a
protocol that happens to limit itself to a less-likely-to-be-mangled-by-bad-
string-handling-code sequence of bytes.

Regexes are a great example - they're text that is parsed into a parsing-
engine that can be executed. The text part is just a human-interface protocol
over many possible implementations, and importantly, it has a standard.
Slightly-varying standards at times, but everything trends towards Perl's
version, plus/minus some features. And yes - implementations can choose which
variant(s) they support, because they control its interpretation.

Programming languages don't have a choice about if "Z̴̛̺͉͙͚̰̔̏ͧͅ" is a
character though, if they deal with human-language input and output. Humans
have already decided. When it's displayed, it either is or is not, often based
on the viewer's language (e.g. `str.lower()` is locale-sensitive, but many
programming languages ignore this and only deal with ASCII for English
speakers). If the program doesn't understand how it's dealing with this human
<-> computer protocol and mangles it, it's just as bad as something that
mangles other protocols like TCP/IP, except that humans are _occasionally_
more forgiving.

\---

edit: I should probably tl;dr this.

There is a right way and a wrong way to manipulate human-language text. And
it's _extremely_ complicated to do correctly - people are difficult. Shoving
it under the rug and ignoring it entirely, as has been done by most people in
most languages, is 100% the wrong "solution".

Python 3 (or even better, Swift) have taken a step in the right direction to
reduce accidents - it's painful because it requires correcting long-standing
horrifically-wrong habits.

~~~
brudgers
I used the term 'regular expressions.' I did not use the term 'regexes'.
Regular expressions are clearly defined and have mathematical properties
including equivalence to (or the ability to unequivocally and fully describe)
finite automata. Conversely, regexes are not clearly defined mathematically.

There's nothing wrong with imprecision when precision is not called for and
particularly when the imprecision facilitates communication. Sometimes however
the abstractions leak and expecting the string type to embody the properties
of human text is one of those...at least in my opinion. Other people may have
different opinions.

~~~
Groxx
Ah, yeah, you're entirely right about the regexes. My mistake.

So I think we mostly agree. My question is then: what _are_ strings for, if
not Text? It makes a terrible enum, a weirdly-limited escape-hatch for
ignoring type systems, and an immensely wasteful protocol.

------
paulsutter
It's true that programmers don't understand Python3 strings. But the language
is to blame for being opaque.

Go is right up front: Go strings use UTF-8 and Go source code is in UTF-8. You
can get the length of a string in bytes, or in runes. There's only one string
type. Since every programmer needs to understand UTF-8 anyway, you can
understand it immediately.

I tried to find how Python3 unicode strings work, but I could not find it
anywhere, until I saw nneonneo's comment here.

------
pvdebbe
Interesting, I haven't ever read about anyone thinking Python 3 is the one
that has broken strings. It's always been the Python 2.

~~~
calvinlh
Zed Shaw, author of _Learn Python the Hard Way_ , thinks Python 3 strings are
broken. He also needs to read this article.

[https://learnpythonthehardway.org/book/nopython3.html](https://learnpythonthehardway.org/book/nopython3.html)

------
Dzugaru
I'm doing ML in Python 2 and I can't see a single reason to move to Python 3.
Every article is about strings, but in fact I needed to work with Unicode once
(mostly doing computer vision tasks, but once trained a RNN on a book in my
native language just for fun) - and I don't remember ANY problem with it in
Python 2.

So, why should I bother migrating? I moved to C# 7 instantly and never looked
back, for example, because it has tons of stuff. But the only thing I hear
about Python 3 is "it's so much better!!!111" and "strings, you need it".

~~~
msl09
If you are not using strings then very few things are different between Python
3 and 2 so the cost of change to a more maintained interpreter is even smaller
for you. The only real problem you might have is having some library that
doesn't have Python 3 support. Also since you are dealing with ml I must warn
you that the developers already committed to not support Python 2 in the
future, not sure about numpy though.

~~~
throwawayish
Python 3 has mostly more stuff than Python 2 in the stdlib, some Syntax
improvements, most of which are supported by Python 2.7 backports as well. So
yes, arguably there are entire classes of problems were one doesn't need to
care about differences between Python 2 and 3. But there are also many classes
of problems where the differences matter very much.

------
aeturnum
I agree with the article that Python 3 has great character string handling,
but I'd suggest that the author does not understand why people like Python 2's
system.

Proper handling of unicode glyphs makes for a great demo and is clearly the
proper behavior, but does not represent a common use scenario.

It's fairly rare to be doing sub-string manipulation of user input or strings
that will be displayed. Instead, it's much more common to be transporting data
around inside strings. Maybe you're moving some JSON, or some binary data or
csv content. In this examples, you're moving data from one point to another
and the Python 2 approach is far simpler. What encoding is it? Is it unicode?
For most applications, it does not matter. You can put the data in a str, then
pass that str to a function.

In Python 3, this scenario gets more complicated. Is your data in bytes? Is it
in a str? If it's in a str, how is it encoded? The Python typing system does
not make finding these things out smooth. The details will determine which
functions you can pass the data to and what transformations you (may) need to
perform. Python 3 makes the programmer work harder to pass properly-tagged
string data to functions, which generally makes string handling code more
finicky and complex. The result is a string system that is _much_ more
predictable and understandable in the failure case, but is more verbose in
getting there.

It doesn't help that Python's poor type management tools mean that your first
indication you mis-handled a string in Python 3 is the same as Python 2 -
unexpected output to a buffer (often b'string' instead of terminal-killing
garbage).

~~~
Sir_Cmpwn
You shouldn't be moving things like that around in strings. You should be
moving them around as bytes, then decoding them when you need to manipulate
them. Your strategy is probably going to lead to subtly broken behaviors in
your program. The problem here is not how Python 3 handles strings, it's how
you're handling strings.

>It doesn't help that Python's poor type management tools mean that your first
indication you mis-handled a string in Python 3 is the same as Python 2 -
unexpected output to a buffer (often b'string' instead of terminal-killing
garbage).

I agree, Python's behavior around implicit conversion of things to str is
pretty bad. It would be better to throw a TypeError.

~~~
aeturnum
This is exactly what I mean.

I know I should move binary data around as bytes. I'm not saying I want to (or
intend to) move binary data around as strings.

I'm saying that, in Python 2, you get a str - which happens to be binary and
you don't have to worry about it. Even if it's unicode, it'll still work - it
just won't matter. In Python 3, it depends on what the library gives you. Did
it give you bytes? Maybe a str? You'll need to check and do the proper
conversion to bytes. The behavior that Python 3 forces on you is safer and
more correct, but it's behavior that you did not have to go through in Python
2.

Sometimes, in Python 3, the library gives you bytes and you want bytes and
you're fine, but it's common that you are not.

------
gigatexal
Couldn't agree more with the article or the conclusion: "python 2 is dead.
Long live python 3"

------
jordiburgos
With these examples, I finally understood the difference between Python 2 and
Python 3.

------
BrandoElFollito
What does "to grok a string" mean? It is used across the article and comments,
yet is not immediately referenced in Google.

~~~
grimoald
[https://www.merriam-webster.com/dictionary/grok](https://www.merriam-
webster.com/dictionary/grok)

------
the8472
>>>open(b'test-\xd8\x00.txt', 'w').close()

Is that a C-string? Does it crash if one forgets the null byte?

~~~
masklinn
> Is that a C-string?

It's a Python bytestring.

> Does it crash if one forgets the null byte?

No. But in that case it _will_ crash when you use it as CPython (and most FS
APIs, and most FS) don't allow NUL in filenames.

~~~
pvg
It's not going to crash. You'll get an error.

------
cdevs
It's says bytes is an array of bytes, does the author mean bits?

~~~
MattConfluence
It's a bit of a clunky sentence because of the name conflict between the
concept of multiple bytes and the Python data type.

A byte is a sequence of bits, and "bytes" is the name of a sequence of bytes
in Python.

------
pwdisswordfish

        >>> 'おはようございます'[::-1]
    

Why do people keep using this as an example? A character is not a necessarily
a single code point; reversing code points is not any more meaningful than
reversing bytes. And reversing strings isn't something frequently needed in
practice either.

Let me make this as clear as possible:

    
    
        a code point is not a character
    
        a code point is not a character
    
        a code point is not a character
    

I want the author to read that, over and over again, until it sinks in.

------
the_mitsuhiko
> The only problem with Python 3’s str is that you don’t grok it

Sadly this quote shows a fundamental lack of understanding of the problems
with the Python 3 string type.

~~~
jstimpfle
Downvoted for surprising lack of substance.

Note I do read many of your blog posts -- and thanks for writing them, there
is a lot of insight for me.

I also did read most or all of your blog posts about python3 unicode handling.
The thing is, while most or all of the facts presented there are "true", many
of the negative conclusions there are just your opinion that stems from years
and years of doing it "your way" (I would call it FUD, but I have strong
opinions too). Python2's unpredictable implicit conversions are hardly "sane"
(that's one of the unjustified claims made there). Do you also call
Javascript's "==" sane?

For me Python3 str has worked like a charm for years. I like the strict
separation of high and low level affairs. I like how I can treat files as text
files (deal only with python3 _str_ s) and don't have to think about low level
affairs, or types, or conversions, which is a great boon for scripting. I like
how I can drop down to the bytes level when needed and know exactly what I'm
at.

For some balance, I wouldn't expect that conversion to python3 is always easy.
But I would blame python2 for missing clean concepts, not python3.

Also, it's hardly convenient to code most of a Python3 app at the byte level.
But it could be argued that python should not be used for these things.

~~~
ThePhysicist
I agree that having stricter coercion rules makes sense in situations where
the coercion might be ambiguous, whether adding byte-strings and Unicode
strings together is such a case is debatable though (I think it is), as in
many cases implicit conversion yields an acceptable outcome. Again, this is
more a question of design philosophy, but the thing with Python (2) is that
implicit coercion was the default behavior in many cases, so changing that is
painful.

Personally I think we should double-down on type annotations and stronger
(optional) typing for Python, because the lack of a good type system is by far
the biggest obstacle for building robust, large systems in Python.

------
Grue3
In Python 2.7.12

>>>s = u'おはようございます'

>>>print s[::-1]

すまいざごうよはお

The only problem with Python 2.7 strings is that the author of the article
doesn't grok them. Just use u'' instead of ''. There is also "from __future__
import unicode_literals".

~~~
haikuginger
In both of those cases, you're no longer using a Python 2 string, in that it's
not a bytes-like object. For example, you can't write a Unicode object across
a socket- it has to be converted to bytes first.

~~~
Grue3
unicode is a subtype of basestring. And it existed since the beginning of
Python 2, afaik. It's extremely unfair to compare Python 3 strings to Python 2
str, when Python 3 str is just a rename of Python 2 unicode and has basically
the same features. And the article doesn't even mention that Python 2 has all
the same features.

~~~
daenney
> And the article doesn't even mention that Python 2 has all the same
> features.

Because it's not about the fact that you can achieve the same thing. What the
article does is illustrate that a thing that returns 'str' when you call
'type' on it in Python 2 is not the same nor can be used in the same way as
what Python 3 would call a 'str'.

~~~
Grue3
No, if that was the author's intention, he would've simply pointed out that
"unicode" was renamed to "str", while "str" was renamed to "bytes" and is no
longer considered a string. Instead, he wanted to demonstrate how Python 3
string handling is some sort of a major breakthrough when it's largely a
cosmetic change.

~~~
jjawssd
It's a cosmetic change which happens to completely derail ill-informed Python
programmers, so yeah, it's a big deal.

------
rini17
So, for example what is The Py3k Right Way to fetch an URL with text content
into a string? The example
[https://docs.python.org/3/library/urllib.request.html#exampl...](https://docs.python.org/3/library/urllib.request.html#examples)
just says "it's complicated, we just know it's utf-8", and even that would
bomb out if there's a character spanning the 100 byte boundary.

To me it looks like by insisting on lossy byte/string conversion of all I/O
the language painted itself into a corner, with funny "a bytes is not a
string" chanting sideshow.

~~~
masklinn
The "right way" _in every single language_ is to follow the encoding sniffing
algorithm:
[https://html.spec.whatwg.org/multipage/syntax.html#encoding-...](https://html.spec.whatwg.org/multipage/syntax.html#encoding-
sniffing-algorithm).

You may want to note the following:

1\. implementing the entire encoding-sniffing algorithm for a basic example is
a bit extreme

2\. the very first step of the encoding-sniffing algorithm is "if the use has
explicitly provided an encoding, use it", which is essentially what the
example does

> To me it looks like by insisting on lossy byte/string conversion of all I/O
> the language painted itself into a corner

That makes literally no sense. If you want to "fetch an URL with _text
content_ into a _string_ " — emphasis on _string_ , not _bag of bytes_ , if
you want a bag of bytes you can skip the whole decoding thing in Python it's
_not necessary_ — there's no other way than to decode it, which means you need
to assert or discover its encoding, for all you know the document could be in
Big5.

~~~
haikuginger
Seriously. It's easy to "fetch a URL with text content into a string" with
Python 2 right now _because Python 2 assumes that all bytes are ASCII_.

~~~
TheDong
No, it's not! It's easy to fetch a url's content into a bag of bytes. That's
it.

If the content of the page happens to contain an emoji (which hey, more and
more do) or even a friggin' unicode double-quote mark, then you're no longer
fetching it into a valid string, if you take string to mean faithful textual
representation. You've got all the bytes there, but you can't interpret it
correctly.

This is a real problem that has caused real pain for me with various tools
written in python2

------
tnat0r
I've got nothing against Python3. But recognize that Python3 got itself into
this mess by being incompatible with Python2. It's different language, albeit
superficially similar to Python2. Python3 should have been given another name
like "Bob" or something. Ditto for Perl6 vs Perl5 and Angular2 vs Angular1.

~~~
throwawayish
Everyone is running around talking about semver, but then tell you that you're
supposed to rename your project when you do breaking changes? Sure.

And it's not like there was no early announcement that it won't be 100 %
compatible. That was officially announced 10+ years ago.

~~~
tnat0r
It's not a matter of lead time in announcing it. If it's not backwards
compatible then it's a different language. I can still run 20 year old C code
in a modern C compiler.

~~~
teddyh
In languages, this is accepted practice. Was Perl 5 backward compatible with
Perl 4? Or C11 with C99? Or even C99 with C89, or K&R C?

If you use reasonable and forward-looking language constructs, it will be
runnable – or at least very easy to port to – newer major versions of the
language. This is also true for the transition from Python 2 to Python 3. The
problem is that _a lot_ of people used (and still use) Python 2 badly.

~~~
tnat0r
> Was Perl 5 backward compatible with Perl 4? Or C11 with C99? Or even C99
> with C89, or K&R C?

Yes.

~~~
throwawayish
Actually, no.

There are many differences between K&R/UNIX C and C89. For example, in K&R
string constants could be modified, and repitions of the same constant would
be different strings. This is not the case in C89. Variadic functions are
different. Octal numbers were changed. Arithmetic works differently. And so
on.

I can't comment on Perl 4 vs 5 vs 6.

~~~
b2gills
The latest version of Perl 5 is almost entirely backwards compatible with
every earlier versions of Perl 5. ( There are some features which almost no-
one used, that also should never be used that were eventually removed )

Perl 5 is backwards compatible with Perl 4

Perl 4 is just a renamed version of Perl 3 to coincide with the release of the
book "Learning Perl"

Perl 3 is backwards compatible with Perl 2

Perl 2 is backwards compatible with the original Perl

Many of the problems that new Perl programmers have with learning Perl is that
every new feature had to be added in a way that it didn't break existing code.
That is why for example you have to add "use strict" and "use warnings" to
every Perl source file, even though that should be the default.

Perl 6 exists because we "wanted to break everything that needs breaking", so
it is very different than any previous version.

That is why both Perl 5 and Perl 6 will both continue to be supported
languages.

Imagine taking good ideas from every high level modern programming language,
bringing them all together, while making the features seem like they have
always belonged together. That is Perl 6.

I like to say that as Perl 4 is to Perl 5 is to Perl 6, C is to C++ is to
Haskell,C#,smalltalk,BNF,go, etc

------
RantyDave
Oh I grok it just fine, it just sucks horribly.

I may be missing something "pythonic" but casting between byte strings and
ascii strings - a null operation - is by _far_ my greatest cause of runtime
bugs.

To make it worse, the obvious cast is broken beyond all belief. What do we get
from str(b'hello world')? Literally "b'hello world'" \- a __repr__ of the
object. Of course, adding an encoding changes the actual functionality of the
cast ... str(b'hello world', 'ascii') gives 'hello world'. So, so broken.

~~~
Walkman
You don't grok it. Bytes are not ASCII strings. They are NUMBERS between 0 and
255. Strings are strings. DECODING bytes with ASCII encoding into TEXT is not
"casting", it's DECODING.

~~~
jstimpfle
I wouldn't say they are not ASCII strings, because they are if you squint.

Better say "bytes are not Unicode strings".

~~~
Walkman
No, they are NOT strings, they are not any kind of strings. They are bytes:

    
    
        >>> b'asd'[0]
        97
        >>> b'asd'[1]
        115
        >>> b'asd'[2]
        100
    

They happen to contain an ASCII DECODABLE string in this case, but to get the
text back, you NEED TO decode it first:

    
    
        >>> text = b'asd'.decode()
        >>> text[0]
        'a'
        >>> text[1]
        's'
        >>> text[2]
        'd'

~~~
jstimpfle
Technically true. I'd inferred a different context on this one: By "strings" I
meant the abstract concept in this case, not the python _str_ which is the one
meant.

