
The Python Unicode Mess - psibi
http://changelog.complete.org/archives/9938-the-python-unicode-mess
======
acdha
As far as I can tell this is a long-form “I used to be able to ignore encoding
issues and now it’s a ‘mess’ because the language is forcing me to be
correct”. Each of the examples cited is something which was a source of latent
bugs which he thought was working because they were ignored.

Only his third bit of advice isn’t wrong and treating it as something unusual
shows the problem: the only safe way to handle text has always been to decode
bytes as soon as you get them, work with Unicode, and then encode it when you
send them out. Anything else is extremely hard to get right, even if many
English-native programmers were used to being able to delay learning why for
long periods of time.

~~~
masklinn
> As far as I can tell this is a long-form “I used to be able to ignore
> encoding issues and now it’s a ‘mess’ because the language is forcing me to
> be correct”.

The problem with that view is that there are things for which you _can not_ be
correct, and there are no encoding issues because there _is no encoding_ (or
if there is one it does not map to proper unicode):

* UNIX files and paths have no encoding, they're just bags of bytes, with specific bytes (not codepoints, not characters, bytes) having specific meaning

* Windows file and path names are sequences of UTF-16 _code units_ but not actually UTF-16 (they can and do contain unpaired surrogates), as above with specific code units (again not codepoints or characters) having specific meaning

These are issues you will encounter on user systems, there is no "forcing you
to be correct". A non-unicode path is not incorrect. On many systems it just
is. OSX is one of the few systems where a non-unicode path is actually
incorrect, and that means you will not encounter one as input so you have no
reason to handle this issue at all.

> Only his third bit of advice isn’t wrong and treating it as something
> unusual shows the problem: the only safe way to handle text

That's where you fail, and to an extent so does Python: some text-like things
_are not actually text_. Path names famously is one of this case. You're
trying to hammer the square peg of path names in the round hole of unicode.

~~~
acdha
That’s just restating my point: Unix filenames are bytes (on most filesystems,
anyway). The fact that many people were able to conflate them with text
strings was a convenient fiction. Python no longer allows you to maintain that
pretense but it’s easy to deal with by treating them as opaque blobs, attempt
to decode and handle errors, or perform manipulations as bytes.

~~~
ubernostrum
One thing that amuses me given the number of complaints about the Python 3
string transition is how vastly _better_ Python 3 is for working with bytes.
The infrastructure available is light-years ahead of what Python 2 offered,
precisely because it gave up on trying to also make bags of bytes be the
default string type.

~~~
ak217
Thank you for saying that. Working with strings and bytes in Python 3 is
nothing short of a joy compared to the dodgy stuff Python 2 did. People who
complain about the change are delusional.

~~~
necovek
The only problem I have with Python3 strings/bytes handling is the fact that
there are standard library functions which accept bytestrings in Py2 (regular
"" strings), and Unicode strings in Py3 (again, regular "" strings in Py3).

This has led to developers attempting to conflate the two distinctly different
concepts and make APIs support both while behaving differently.

A simple solution is there in plain sight: just use exclusively b"" and u""
strings for any code you wish to work in both Py2 and Py3, and forget about
"". All and any libraries should be using those exclusively if they support
both. Python3-only code should be using b"" and "" instead.

One could consider this a design oversight in Python 3: the fact that the
syntax is so similar elsewhere makes people want to run the same code in both,
yet a core type is basically incompatible.

~~~
joshuamorton
u"" is a syntax error in python3 (or at least it was for a while, apparently
its not anymore, that said...). The correct cross-platform solution is to do

    
    
        from __future__ import unicode_literals
    

which makes python2 string literals unicode unless declared bytes. Then ""
strings are _always_ unicode and b"" strings are always bytes, no matter the
language version.

~~~
ak217
> u"" is a syntax error in python3

This has not been the case since 2012. The last release of Python 3 for which
this was the case reached end of life in February 2016. Please stop
misinforming people.

~~~
Gorgor
While u"" is accepted in current Python 3, for some reason they ignored the
raw unicode string ur"" which still is a syntax error in Python 3. So,
unicode_literals is definitely preferable.

------
ptx
Text encoding in general is a mess, and Python 2 Unicode support was a mess,
but Python 3 makes it _much less_ of a mess.

I think the author has a mess on his hands because he's trying to do it the
Python 2 way – processing text without a known encoding, which is not really
possible, if you want the results to come out right.

To resolve the mess in Python 3, choose what you actually want to do:

1\. Handle raw bytes without interpreting them as text – just use bytes in
this case, without decoding.

2\. Handle text with a known encoding – find out the encoding out-of-band from
some piece of metadata, decode as early as possible, handle the text as
strings.

3\. Handle Unix filenames or other byte sequences that are usually strings but
could contain arbitrary byte values that are invalid in the chosen encoding –
use the "surrogateescape" error handler; see PEP 383:
[https://www.python.org/dev/peps/pep-0383/](https://www.python.org/dev/peps/pep-0383/)

4\. Handle text with unknown encoding – not possible; try to turn this case
into one of the other cases.

Also, watch Ned Batchelder's excellent talk, _Pragmatic Unicode, or, How do I
stop the pain?_ , from 2012: [https://pyvideo.org/pycon-us-2012/pragmatic-
unicode-or-how-d...](https://pyvideo.org/pycon-us-2012/pragmatic-unicode-or-
how-do-i-stop-the-pain.html)

~~~
zeroname
> To resolve the mess in Python 3, choose what you actually want to do...

The thing is that this is not actually going to happen. Programs are simply
broken _across the board_ , because few people can be bothered to deal with
all these peculiarities.

The difference is, in Python 2, output would be corrupted in some edge cases,
but generally it would "just work". In Python 3, the program falls flat on its
face even in cases that would've ended up working fine in Python 2.

I don't think there's a general answer on which behavior causes less real-
world problems total, but the idea that Python 3 makes less of a mess is not
something I can agree with.

~~~
Trombone12
This just reflects your experience of only speaking english, to most of the
world their native language is _not_ and edge case.

~~~
zeroname
> This just reflects your experience of only speaking english, to most of the
> world their native language is not and edge case.

Excuse me, I don't exclusively deal in 7-bit ASCII characters just because I
happen to speak English, which isn't the _only_ language I speak either.

------
garethrees
There is a particular use case which leads to frustration with Python 3, if
you don't know the latin1 trick.

The use case is when you have to deal with files that are encoded in some
unknown ASCII-compatible encoding. That is, you know that bytes with values
0–127 are compatible with ASCII, but you know nothing whatsoever about bytes
with values 128–255.

The use case arises when you have files produced by legacy software where you
don't know what the encoding is, but you want to process embedded ASCII-
compatible parts of the file as if they were text, but pass the other parts
(which you don't understand) through unchanged (for example, the files are
documents in some markup language, and you want to make automatic edits to the
markup but leave the rest of the text unchanged). Processing as text requires
you to decode it, but you can't decode as 'ascii' because there are high-bit-
set characters too.

The trick is to decode as latin1 on input, process the ASCII-compatible text,
and encode as latin1 on output. The latin1 character set has a code point for
every byte value, and bytes with the high bit set will pass through unchanged.
So even if the file was actually utf-8 (say), it still works to decode and
encode it as latin1, and multi-byte characters will survive this process.

The latin1 trick deserves to be better known, perhaps even a mention in the
porting guide.

~~~
teddyh
> _The latin1 character set has a code point for every byte value_

No it doesn’t. The whole range of 128-159 are undefined. However, the old MS-
DOS CP-437 encoding (which is incompatible with latin1/ISO-8859-1) _does_. So
your trick is valid, but _not_ with latin1.

~~~
teddyh
I can’t edit my post now, but it turns out I was _wrong_. The range of 128-159
_are_ defined in ISO-8859-1, as little-used “control characters”:

[https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_set](https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_set)

So, the trick described by garethrees does work with latin1, and I was
mistaken.

------
perlgeek
The real problem here is that

* UNIX file systems allow any byte sequence that doesn't contain / or \0 as file and directory names

* User interfaces have to render that as strings, so they must decode

* There is no meta data about what the file encoding is

Many programs use the encoding from the current locale, which is mostly a good
assumption, but the way that locales scope (basically per process) has nothing
to do with how file names are scoped.

So, many programs make some assumptions. Some models are:

1) Assume file names are encoded in the current locale

2) Assume file names are encoded in UTF-8

3) Don't assume anything

The "correct" model would be 3), but it's not very useful. People want to be
able to sort and display file names, which generally isn't very useful with
binary data.

Which is why most programs, including python, use 1) or 2), and sometimes
offer some kind of kludge for when the assumption doesn't hold -- and
sometimes not.

IMHO a file system should store an encoding for the file names contained in
it, and validate on writes that the names are correct. But of course that
would be a huge POSIX incompatibility, and thus won't happen.

People just live with the current models, because they tend to be good enough.
Mostly.

~~~
goerz
Would it really be POSIX-incompatible? Does the standard mandate that a
filesystem can place no such restriction on top of "filenames are unencoded
bytes"? If not, then it's just that tools cannot blindly assume filenames are
decodable. Isn't MacOS guaranteeing UTF8 these days, while still being POSIX-
compliant?

~~~
perlgeek
I don't know if there is an explicit mandate for that in the standard, but
forbidding this that were previously allowed, both in code and documentation,
is not backwards compatible.

Just imagine a file system that wouldn't allow the character "e" in file
names.

Of course, the impact would be not as drastic, but still it's backwards
incompatible.

------
zorkw4rg
I'm not so sure other languages do that any better (nodejs doesn't even
support non-unicode filenames at all for instance). Modern python does a
pretty good job at supporting unicode, very far away from being a "Mess"
that's just very much not true at all. People always like to hate on python
but then other languages supposedly designed by actually capable people do
mess up other stuff all the time. Look at how the great Haskell represents
strings for instance and what a clusterfuck[1] that is.

[1] [https://mmhaskell.com/blog/2017/5/15/untangling-haskells-
str...](https://mmhaskell.com/blog/2017/5/15/untangling-haskells-strings)

~~~
masklinn
Rust is probably one of the languages which does this crap best, and that's
thanks to static typing and deciding to not decide:

1\. it has proper, validated _unicode_ strings (though the stdlib is not
grapheme-aware so manipulating these strings is not ideal)

2\. it has proper _bytes_ , entirely separate from strings

3\. it has "the OS layer is a giant pile of shit" OsString, because file paths
might be random bag of bytes (UNIX) or random bags of 16-bit values (and
possibly some other hare-brained scheme on other platforms but I don't believe
rust supports other osstrings currently)

4\. and it has nul-terminated bag o'bytes CString

For the latter two, conversion to a "proper" language string is explicitly
known to be lossy, and the developer has to decide what to do in that case for
their application.

~~~
jcranmer
> 1\. it has proper, validated unicode strings (though the stdlib is not
> grapheme-aware so manipulating these strings is not ideal)

Grapheme clusters are overrated in their importance for processing. The list
of times you want to iterate over grapheme clusters:

1\. You want to figure out where to position the cursor when you hit left or
right.

2\. You want to reverse a string. (When was the last time you wanted to do
that?)

The list of times when you want to iterate over Unicode codepoints:

1\. When you're implementing collation, grapheme cluster searching, case
modification, normalization, line breaking, word breaking, or any other
Unicode algorithm.

2\. When you're trying to break text into separate RFC 2047 encoded-words.

3\. When you're trying to display the fonts for a Unicode string.

4\. When you're trying to convert between charsets.

Cases where neither is appropriate:

1\. When you want to break text to separate lines on the screen.

2\. When you want to implement basic hashing/equality checks.

(I'm not sure where "cut the string down to 5 characters because we're out of
display room" falls in this list. I suspect the actual answer is "wrong
question, think about the problem differently").

Grapheme clusters is relatively expensive to compute, and its utility is very
circumscribed. Iterating over Unicode codepoints is much more useful and
foundational and yet still very cheap.

~~~
zbentley
> Grapheme clusters are overrated in their importance for processing. The list
> of times you want to iterate over grapheme clusters:

> 1\. You want to figure out where to position the cursor when you hit left or
> right.

> 2\. You want to reverse a string. (When was the last time you wanted to do
> that?)

You missed the big one:

3\. You want to determine the logical (and often visual) length of a string.

Sure, there are some languages where logical-length is less meaningful as a
concept, but there are many, many languages in which it's a useful concept,
and can only be easily derived by iterating grapheme clusters.

~~~
anonymfus
Visual length of a string is measured in pixels and millimetres, not
characters. In a font/graphics library, not in a text processing one.

~~~
zbentley
Sorry, visual length as in visual number of "character-equivalent for purposes
of word length" things. Those things are close to, but not exactly the same
as, grapheme clusters, so the latter can often be used as an imperfect (but
much more useful than unicode points or bytes) proxy for the former.

There's no perfect representation of number-of-character-equivalents that
doesn't require understanding of the language being handled (and it's
meaningless in some languages as I said), but there are _many_ written
languages in which knowing the length in those terms is both extremely useful
and extremely hard to do without grapheme cluster identification.

~~~
Groxx
> _character-equivalent for purposes of word length_

Serious question: why would you want to do this?

I know it's fashionable to limit usernames to X characters... but why? The
main reason I've seen has been to limit the _rendered length_ so there are
some mostly-reliable UI patterns that don't need to worry about overflows or
multiple lines. At least until someone names themselves:

Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ Ｗ

Which is 20 characters, no spaces, and will break loads of things.

(I'm intentionally ignoring "db column size" because that depends on your
encoding, so it's unrelated to graphemes)

~~~
ubernostrum
_Serious question: why would you want to do this?_

Have you never, in your entire life, encountered a string data type with a
length rule? All sorts of ID values (to take an obvious example) either have
fixed length, or a set of fixed lengths such that every valid value is one of
those lengths, and many are alphanumeric, meaning you cannot get round length
checks by trying to treat them as integers. Validating/understanding these
values also often requires identifying what code point, not what grapheme, is
at a specific index.

Plus there are things like parsing algorithms for standard formats. To take
another example: you know how people sometimes repost the Stack Overflow
question asking why "chucknorris" turns into a reddish color when used as a
CSS color value? HTML5 provides an algorithm for parsing a (string) color
declaration and turning it into a 24-bit RGB color value. That algorithm
requires, at times, checking the length _in code points_ of the string, and
identifying the values of code points at specific indices. A language which
forbids those operations cannot implement the HTML5 color parsing algorithm
(through string handling; you'd instead have to do something like turn the
string into a sequence of ints corresponding to the code points, and then
manually manage everything, and why do that to yourself?).

~~~
Groxx
Yes. All instances I've seen have been due to byte-size restrictions (so it
depends on encoding) or for visual reasons (based on fundamentally flawed
assumptions). With exceptions for dubious science around word-lengths between
languages / as a difficulty/intelligence proxy, or just having fun identifying
patterns. (interesting, absolutely, but of questionable utility / bias)

But every example you've given have been around visuals, byte sizes, or code
points (which are unambiguously useful, yes). Nothing about _graphemes_.

------
aeturnum
My main criticism of Python 3's changes to strings is that it has become much
more specific about strings.

In Python 2, if you have a series of bytes -or- a "string", the language has
no opinion about the encoding. It just passes around the bytes. If that set of
bytes enters and exits Python without being changed, its format is of no
concern. Interactions do not force you to define an encoding. This is _not
correct_ , but it is _often functional._

Python 3, on the other hand, if you ever treat bytes as a string, forces you
to have an opinion about the encoding. Same goes for if you convert back to
bytes. For uncommon or unexpected encodings, the chance of this going wrong in
a casual, accidental way is much higher. Of course, the approach is more
correct, but it doesn't _feel_ more correct to the programmer.

~~~
ak217
> it doesn't feel more correct to the programmer.

I agree with the details of what you said, but the insidious thing about how
Python 2 organized strings and encodings is that most programmers were free to
ignore it and produce buggy software. Then, later, people who had to use that
software on non-ascii data would try to use it and it would blow up. This
would lead to a very painful cycle of shaking out bugs that the original
author may not even be motivated to fix.

The decision to force encodings to be explicit and strings/bytes to be
separate was a great design change. It literally made all our code more
valuable by removing hidden bugs from it.

------
nicolaslem
For anyone interested in learning why Python 3 works this way I highly
recommend the blog of Victor Stinner[0].

As for the article, this is nothing new. The problem is similar to the issues
raised by Armin Ronacher[1]. These problems are well known and Python
developers address them one at a time. Issues around these egde cases have
improved since the initial release of Python 3.0.

[0] [http://vstinner.github.io](http://vstinner.github.io)

[1] [http://lucumr.pocoo.org/2014/5/12/everything-about-
unicode/](http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/)

------
burntsushi
This article is kind of hard to evaluate, because the OP doesn't provide an
example program with an example input that fails. So it's hard to judge
whether the solution presented here is actually ideal. Instead, we're forced
to just take the OP's word for it, which is kind of uncomfortable.

I do somewhat agree with the general sentiment, although I find it difficult
to distinguish between the problems specifically related to its handling of
Unicode and the fact that the language is unityped, which makes a lot of
really subtle things very implicit.

~~~
yorwba
The OP links to StackOverflow, where failing inputs are mentioned in the
comments on the accepted answer. And the second-most upvoted answer explains
that _.decode( 'unicode_escape')_ only works for Latin-1 encoded text:
[https://stackoverflow.com/a/24519338](https://stackoverflow.com/a/24519338)

~~~
lozenge
The question being how to parse character escapes (backslash sequences) in
Python.

To be honest, you could write a custom character-by-character parser easily or
even use the regex module.

------
flohofwoe
IMHO the whole python3 string mess could have been prevented if they had
chosen UTF-8 as the only string encoding instead of adding a strict string
type with a lot of under-the-hood magic. That way strings and byte streams
could remain the same underlying data, just as in python2. The main problem I
have with byte-streams vs strings in python3 is that it adds a strict type
checking at runtime which isn't checked at 'authoring time'. Some APIs even
make it impossible to do upfront type checking even if type hints would be
provided (e.g. reading file content either returns a byte stream, or a string,
based on the content of a string parameter in the file open function).

Recommended reading: [http://utf8everywhere.org/](http://utf8everywhere.org/)

~~~
marcosdumay
> IMHO the whole python3 string mess could have been prevented if they had
> chosen UTF-8 as the only string encoding instead of adding a strict string
> type with a lot of under-the-hood magic.

That is basically what Python2 does, and it is completely wrong.

~~~
flohofwoe
Can you give any reasons why this is completely wrong? The web seems to work
just fine with UTF-8. The advantage is that you can pass string data around as
generic byte streams without even knowing about the encoding. You'll only have
to care about the encoding at the end points.

~~~
toyg
You are joking, right? Have you ever seen non-English webpages? More often
than not, a multitude of ??? and Chinese characters pop up at some point or
another.

~~~
flohofwoe
I'm from Germany so I've seen a few non-English webpages. I can't remember
having seen any text rendering problems since the late 90's or so.

~~~
guitarbill
this is because browsers have very sophisticated algorithms to detect the
encoding because this was such a frequent issue. (and yes, UTF-8
adoption/support has been growing, which also helps)

being German and working in a multi-national company, i can confirm it is
still very much an issue with software that doesn't handle this. Excel is one
of the worst offenders, document corruption is rife especially when going
between Excel on Windows and Excel for Mac. this is because Excel doesn't to
UTF-8 as default for legacy reasons (I think), but also either doesn't have
encoding detection or has very bad encoding detection.

------
minitech
> And the environment? [it’s not even clear.]
> [https://stackoverflow.com/questions/44479826/how-do-you-
> set-...](https://stackoverflow.com/questions/44479826/how-do-you-set-a-
> string-of-bytes-from-an-environment-variable-in-python)

That question is about interpreting backslash escape sequences for bytes in an
environment variable. All this person wants is `os.environb` (and look, its
existence highlighted a Windows incompatibility, saving them from subtle bugs
like every other Python 3 improvement).
[https://docs.python.org/3/library/os.html#os.environb](https://docs.python.org/3/library/os.html#os.environb)

~~~
mixmastamyk
Thanks, never noticed environb. I’m still learning new things about Python 3
ten years later.

------
TimJYoung
Getting Unicode right, especially with various file systems and cross-platform
implementations is hard, for sure. But, I think this quote:

"And, whatever you do, don’t accidentally write if filetype == "file" — that
will silently always evaluate to False, because "file" tests different than
b"file". Not that I, uhm, wrote that and didn’t notice it at first…"

shows a behavior that, to me, is inexcusable. The encoding of a string should
never cause a comparison to fail when the two strings are equivalent _except
for the encoding_. For example, in Delphi/FreePascal, if you compare an
AnsiString or UTF-8-encoded string with a Unicode string that is equivalent,
you get the correct answer: they are equal.

~~~
mikezter1
> The encoding of a string should never cause a comparison to fail when the
> two strings are equivalent except for the encoding.

You'll have to admit that the encoding is a property of a string, just like
the content itself. As always, you as a programmer are bound to know both of
these properties to have predictable results. To compare two strings of
different encoding to one another, you'll have to find a common ground for
interpreting the data contained in the string.

If you don't want or need that, then all you have is a "string" of bytes.

~~~
TimJYoung
Sure, but you can have defined rules about what happens when you compare
values with disparate encodings, similarly to how you have to have rules about
how column expressions are compared in SQL with regard to their collations.
The way such things are done is typically to coerce the second value into the
encoding of the first value, and _then_ compare the two values. What the
Delphi compiler does is issue warnings when there might be data loss or other
issues with such coercions so that the developer knows that it might not be
safe and that they might want to be more explicit about how the comparison is
coded.

------
apk-d
Let's be honest, the real mess is with UNIX filenames. I dare you to come up
with a legitimate use case for allowing newlines and other control characters
in a file name.

~~~
dcbadacd
It's like a built-in unit test - devs have to _not_ mangle and assume anything
about filenames they get from the system - though they still do, I've seen
multiple times how my nice umlauts get mangled or my spaces cause scripts to
fail.

~~~
PeterisP
A few years ago I tried naming my home directory with the unicode pile of poo
() and a space in the name to test what of my code might break. However, it
broke too much of third party tools/scripts that I occasionally needed for
something, so I reverted within a few days.

Though it might be interesting to have an integration test box where the
username (and thus all the relevant paths) includes all kinds of oddities -
whitespace, emoji, right-to-left marker, etc.

------
Aardappel
Going of on a tangent a bit here, but I think there are 2 important related
issues:

* API design should fit the language. In a "high on correctness" language like Haskell or Rust, I'd expect APIs to force the programmer to deal with errors, and make them hard to ignore. In a dynamically typed language like Python where many APIs are very relaxed / robust in terms of dealing with multiple data types (being able to see numbers/strings/objects generically is part of the point of the language), being super strict about string encoding sounds extra painful compared to a statically typed language. I'd expect an API in this language to err on the side of "automatically doing a useful/predictable thing" when it encounters data is only slightly incorrect, as opposed to raising errors, which makes for very brittle code. Most Python code is the opposite of brittle, in the sense that you can take more liberties with data types before it breaks than in statically typed languages. Note that I am not advocating incorrect APIs, or APIs that silently ignore errors, just that the design should fit the language philosophy as best as possible.

* Where in a program/service should bytes be converted to text? Clearly they always come in as bytes (network, files..), and when the user sees them rendered (as fonts), those bytes have been interpreted using a particular encoding. The question where in the program should this happen? You can do this as early as possible, or as late as possible. Doing it as early as possible increase the code surface where you have to deal with conversions, and thus possible errors and code complexity, so that doesn't seem so great to me personally, but I understand there are downsides to most of your program dealing with a "bag of bytes" approach too.

~~~
dan-robertson
I don’t think Haskell is a very good example to promote for string handling.
Things are mostly strict and well behaved once they make it into the Haskell
program but before then they either need to satisfy the program’s assumptions
before being input or the program will be buggy/crash unless it is carefully
written such that it’s assumptions are right.

~~~
Aardappel
I didn't mention Haskell specifically for strings, but as a language that
tends to be very precise about corner cases. That may not even be the best
example, but I couldn't think of any better mainstream-ish language examples
:)

------
lincolnq
Indeed py3 decided to make unicode strings the default. This fixes all sorts
of thorny issues across many use cases. But it does indeed break filenames. I
haven't dealt with this issue myself, but the way python was supposed (?) to
have "solved" this is with surrogate escapes. There's a neat piece on the
tradeoffs of the approach here:
[https://thoughtstreams.io/ncoghlan_dev/missing-pieces-in-
pyt...](https://thoughtstreams.io/ncoghlan_dev/missing-pieces-in-
python-3-unicode/)

Maybe handling the surrogates better would allow you to use 'str' everywhere
instead of bytes?

------
gnud
> For a Python program to properly support all valid Unix filenames, it must
> use “bytes” instead of strings, which has all sorts of annoying
> implications.

While in python 2, you had to use unicode strings for all sorts of actual
text, which caused its own problems.

> What’s the chances that all Python programs do this correctly? Yeah. Not
> high, I bet.

Exactly.

------
zzzeek
Don't think of python Unicode as a "string". Think of it as "text". I don't
really understand the issues the author is having with things like sys.stdout
and such because he did not provide complete examples. He should cite actual
examples and bug reports that he has posted for these things, ive had no such
issues. There's a lot of things we need to do to accommodate for non-ascii
text but they are all "right" as far as I've observed.

------
dan-robertson
Part of the issue is to do with bytes and strings being considered totally
different by python but confusingly similar to people.

The error from "file" != b"file" is particularly bad. It makes sense if you
realise that a == b means a,b have the same type and their values are equal.
But there is no way a even a reasonably careful programmer could spot this
without super careful testing (and who’s to say they would remember to test
b"file" and not "file"). Other ways this could be solved are:

1\. String == bytes is true iff converting the string to bytes gives equality
(but then can == become non transitive)

2\. String == bytes raises (and so does string == string if encodings are
different)

3\. Type-specific equality operators like in lisp. But these are ugly and
verbose which would discourage their use and so one would not think to use
bytesEqual instead of ==

4\. A stricter/looser notion of equality that behaves as one of the above
called eg === but this is also not great

~~~
xapata
> The error from "file" != b"file" is particularly bad. It makes sense if you
> realise that a == b means a,b have the same type and their values are equal.
> But there is no way a even a reasonably careful programmer could spot this
> without super careful testing (and who’s to say they would remember to test
> b"file" and not "file").

I'm of the opposite opinion. I appreciate that b'a' != 'a'.

~~~
dan-robertson
I don’t think it’s a problem that they aren’t equal. This is reasonable. The
problem is that it is hard for one to foresee this error. The mental model of
bytes and strings is likely to be either their equal-looking literals or a
mental concept of “bytes and strings are basically the same except for some
exceptions.” One cannot reasonably trace every variable to figure out whether
it is a bytes or a string. The comparison a == b being false comparing strings
to bytes makes sense when a and b could be anything. However when b is already
definitely a (byte) string, it is more useful to get an error when a has a
different type.

What is your opinion on numbers:

Should 1 == 1.0?

What about 1+0j?

Or 1/1 (the rational number, although I’m not sure this can be constructed)?

~~~
xapata
int and float being comparable is practical, though occasionally troublesome.
Complex usually doesn't compare well across types. You can use a Fraction type
for 1/1\. I haven't formed an opinion about that, since I don't use them
often.

------
0x006A
i love unicode handling in python3, it's so much better to work with. python2
was a mess, migrating old code requires looking at old code, the result is
only better code, never a mess.

------
kabacha
> Python's unicode is a "mess" because of this single edge case I've
> encountered

FTFY

~~~
est
more like

> Unicode is a "mess" because it can not unquote arbitrary backslash strings.

------
upofadown
The article is about a specific instance (filenames). In general, handling
Unicode as a bunch of indexable code points as per Py3 turned out to be not
that great. I guess the idea came from the era where people still thought that
strings could be in some sense fixed length. These days we better understand
that strings are inherently variable length. So there is no longer any reason
to not just leave everything encoded as UTF-8 and convert to other forms as
and if required. Strings are just a bunch of bytes again.

~~~
ubernostrum
The most correct way to expose Unicode to a programmer in a high-level
language is to make grapheme clusters the fundamental unit, as they correspond
to what people think of as "characters". Failing that, exposing strings as
sequences of code points is a second-best choice.

UTF-8 is a non-starter because it encourages people to go back to pretending
"byte == character" and writing code that will fall apart the instant someone
uses any code point > 007F. Or they'll pat themselves on the back for being
clever and knowing that "really" UTF-8 means "rune == code point ==
character", and also write code that blows up, just in a different set of
cases.

And yes, high-level languages should have string types rather than "here's
some bytes, you deal with it". Far too many real-world uses for textual data
require the ability to do things like length checks, indexing and so on, and
it doesn't matter how many times you insist that this is completely wrong and
should be forbidden to everyone everywhere; the use cases will still be there.

~~~
roel_v
That's silly. How often have you had to work with grapheme clusters without
also using a text rendering engine? But the number of times you need to know
the number of bytes a string takes, even when using scripting languages, is
much higher. The only way to deal with this is to not make assumptions, and
not have a string.size() function, but specific accessors for 'size in bytes',
'number of code points' and (potentially, if the overhead is warranted) 'nr of
grapheme clusters'.

The 'fundamental' problem here is that the average programmer doesn't
understand 'strings' because it seems so easy but it's actually very hard
(well, not even hard, just big and tedious). Even more so now that many people
can have careers without really knowing about what a 'byte' is or how it
relates to their code.

~~~
minitech
> How often have you had to work with grapheme clusters without also using a
> text rendering engine?

All the time. Want to truncate user text? You need grapheme clusters. Reverse
text? Grapheme clusters. Get what the user thinks of as the length of text?
Grapheme clusters. Not saying it’s a good idea to make them any sort of
default because of that, though; you’re right that it should be explicit.

~~~
roel_v
Truncating text is almost always (in my experience) a UI thing, where you pass
a flag to some UI widget saying 'truncate this and that on overflow' and while
rendering, it can then truncate using grapheme clusters.

How often does one reverse text? And when do users care about text length?
Almost always (again, in my experience) in the context of rendering - when
deciding on line length or break points, so when you know and care about much
more than just 'the string' \- but also font, size, things you only care about
in the context of displaying. Not something that should be part of the 'core'
interface of a programming language.

I mean I think we agree here; my point was that I too used to think that
grapheme clusters mattered, but when I started counting in my code, it turned
out they didn't. Sure I can think of cases where it would theoretically would
matter, but I'm talking about what do you actually use, not what do you think
you will use.

~~~
minitech
I’m biased towards websites, but truncating text server-side to provide an
excerpt is something I need to do pretty often. Providing a count of remaining
characters is maybe less common, but Twitter, Mastodon, etc. need to do it,
and people expect emoji (for example) to count as one.

Plus sometimes you’re the one building the text widget with the truncation
option.

~~~
moefh
Twitter's count of "characters" is code points after normalization[1].

I don't know who expects emoji to count as one character, but they'd be
surprised by Twitter's behavior: something like ‍[2] counts as 4 characters
(woman, dark skin tone, zero width joiner, school).

[1] [https://developer.twitter.com/en/docs/basics/counting-
charac...](https://developer.twitter.com/en/docs/basics/counting-
characters.html) [2] [https://emojipedia.org/female-teacher-
type-6/](https://emojipedia.org/female-teacher-type-6/)

------
snicker7
There are lots of comments indicating that the programmer is doing things
wrong. But what is the right way to deal with encoding issues? Wait for code
to break in production?

Whatever "best practices" there are for dealing with unexpected text encoding
in Python, they do not seem to be widely well-known. I bet a large % of Python
programmers (myself included) made the exact same errors the author had, with
little insight as to how avoid them in the future.

------
tyingq
His examples are all stuff that isn't Unicode. The filename thing would
probably work using a latin1 encoding, since that leaves 8 bit bytes
undisturbed.

------
andrewstuart
That's not Python's fault - those are programmer errors.

Having said that, Python really has something to answer for with "encode"
versus "decode" \- WTF? Which is which? Which direction am I converting? I
still have to look that up every single time I need to convert.

Why the heck are there not "thistothat" and "thattothis" functions in Python
that are explicit about what they do?

~~~
burntsushi
This is something I see folks trip on. I think encode/decode are fine names
actually. The problem it's that Unicode strings have decode defined at all,
and similarly, that byte strings have encode defined. Byte strings should only
have a decode operation and Unicode strings should only have an encode
operation. Depending on your input, the wrong operations can actually succeed!

~~~
guitarbill
Not sure what you're talking about mate, on Python 3.6:

    
    
        >>> "hello world".decode("utf-8")
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        AttributeError: 'str' object has no attribute 'decode'
        >>> b"hello world".encode("utf-8")
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        AttributeError: 'bytes' object has no attribute 'encode'

~~~
burntsushi
That's good. I guess it's only in Python 2 then.

~~~
luckystarr
In Python 2 it works almost the same, except you only get an error when
encoding/decoding doesn't work out. So I see this as an improvement.

------
franga2000
If you're storing files with non-Unicode-compatible names, you should really
stop. Even if on Unix, you _can_ technically use any kind of binary mess as a
name, doesn't mean you _should_. And this applies to all kinds of data. All
current operating systems support (and default to) Unicode, so handling
anything else is a job for a compatibility layer, not your application.

If you write new code to be compatible with that one Windows ME machine set to
that one weird IBM encoding sitting in the back of the server room, you're
just expanding your technical debt. Instead, write good, modern code, then
write a bridge to translate to and from whatever garbage that one COBOL
program spits out. That way, when you finally replace it, you can just throw
away that compatibility layer and be left with a nice, modern program.

In EE terms, think of it like an opto-isolator. You _could_ use a voltage
divider and a zenner diode, but thats just asking for trouble.

------
jlarocco
I can't believe there are still people whining about this in 2018.

Those problems with gpodder, pexecpt, etc. aren't due to Python 3, they're due
to the software being broken. Without knowing the encoding, UNIX paths can't
be converted to strings. It's unfortunate, but that's the way it is, and it's
not Python's fault.

------
codedokode
The author has files with invalid names and complains that Python refuses to
accept them. Maybe he should fix the names first?

~~~
gpderetta
If all tools he had access to behaved in the same way, he wouldn't be able to
fix these "wrong" file names.

------
softblush
[https://web.archive.org/web/20181006121702/http://changelog....](https://web.archive.org/web/20181006121702/http://changelog.complete.org/archives/9938-the-
python-unicode-mess)

------
luckystarr
Author doesn't seem to care that there is a difference between Unicode the
standard and utf-8 the encoding. While the changes on the fringes to the
system are debatable, they are also in a way sensible. Internal to your
application everything should be encoding independent (unicode objects in Py2,
strings in Py3) while when talking to stuff outside your program (be it
network, local file content or filesystem names) it has to be encoded somehow.
The distinction between encoding independent storage and raw byte-streams
forces you to do just that!

Stop worrying and go with the flow. Just do it as it is supposed to be done
and you'll be happy.

------
mikezter1
The encoding is a property of the string, just like the content, just as with
any other object. If you want to compare strings with different encodings,
you'll have to convert at least one of them.

I was never forced into encoding hell again, after reading this excellent
post: [https://www.joelonsoftware.com/2003/10/08/the-absolute-
minim...](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-
every-software-developer-absolutely-positively-must-know-about-unicode-and-
character-sets-no-excuses/)

------
singularity2001
I invest some karma to point out how I'd love for str to just use UTF-8 by
default, and print as UTF-8 by default:

print(b'DONT b"EVERYTHING!"')

print(str(b'SAME!'))

print(str(b'I DONT WANT TO add ,"UTF-8" everywhere!','UTF-8'))

line="ום עולם"

output.write(line) # TypeError: a bytes-like object is required, not 'str'

fp.write(output.getvalue()) # TypeError: write() argument must be str, not
bytes

Please at least allow us to set a global option via
sys.setdefaultencoding('UTF8') as before to automatically encode/decode as
UTF-8 by default!

------
madrox
Dealing with string encoding has always been the bane of my existence in
Python...going back over 10 years when I first started using it. I've never
had such wild issues with decoding/encoding in other languages...that may be
my privilege, though, since I was dealing with internal systems before Python,
and then I got into web scraping.

Regardless, string encoding/decoding in Python is _hard_ , and it doesn't feel
like it needs to be.

------
loeg
I agree Python3 is an awful mistake and that straight-up Unicode is not well
suited for storing arbitrary byte strings from old disk images. However,
Python 3.1+ encode disk names as WTF-8 (aka utf-8b):
[https://www.python.org/dev/peps/pep-0383/](https://www.python.org/dev/peps/pep-0383/)
.

------
gspetr
This post barely scratches the tip of the iceberg.

For a more comprehensive discussion of unicode issues and how to solve them in
Python, "Let’s talk about usernames" does this issue more justice than I could
write in a comment:
[https://news.ycombinator.com/item?id=16356397](https://news.ycombinator.com/item?id=16356397)

------
jessaustin
TFA is short and to the point. A few examples, a few links to other examples.
Py3's insistence on shoving Unicode into every API it possibly could maybe
fit, is often inconvenient for coders and for users. This thread has 100
comments, mostly disagreeing in the same fingers-in-ears-I-can't-hear-you
fashion. Whom are we struggling to convince, here?

------
vfclists
If python programmers think they are the only ones with UTF problems, try
Lazarus and Freepascal development mailing lists. The debates have going since
forever, and I am sure issues will be popping up every now and then.

Try Elixir. According to their docs they've had it right from the word go - I
think.

------
andrewstuart
Is the author saying that the Python programming language handles this badly,
and all other (relevant) programming languages do not?

Or is that that Python's attention to detail means that issues that would be
glossed over or hidden using ther languages are brought to the fore and
require addressing?

------
SoulMan
I just came from pycon India 2018. This is exactly what the keynote was
about.(it was by author of Flask)

------
wParser
Filenames are a good example to show people why forcing an encoding onto all
strings simply doesn't work. The usual reaction from people is to ignore that
and they'll shout: "fix your filenames!"

Here is another example: Substrings of unicodestrings. Just split a
unicodestring into chunks of 1024 bytes. Forcing an encoding here and allowing
automatic conversions will be a mess. People will shout: "Your're splitting
your Strings wrong!"

The first language I knew that fell for encoding aware strings was Delphi -
people there called it "Frankenstrings" and meanwhile that language is pretty
dead.

As a professional who has to handle a lot of different scenarios (barcodes,
Edifact, Filenames, String-buffers, ...) - in the end you'll have to write all
code using byte-strings. Then you'll have to write a lot of GUI-Libraries to
be able to work with byte-strings... and in the end you'll be at the point
where the old Python was... (In fact you'll never reach that point because
just going elsewhere will be a lot easier)

~~~
joshuamorton
Sub strings of Unicode strings are fine. Byte level chunking of a Unicode
string requires encoding this string as bytes, then working with bytes, then
deciding the text.

Splitting a piece of Unicode text every 1024 bytes is like splitting an ascii
string every 37 bits. It doesn't make sense.

------
Walkman
It's just a not very well explained rant of some shitty libraries and a lot of
legacy code. If you want to read about REAL complaints, read Armin Ronacher
thought about it instead: [http://lucumr.pocoo.org/2014/5/12/everything-about-
unicode/](http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/)

------
godman_8
This is mostly why PHP6 wasn't a thing.

------
prevedmedved
Looks like we need a Python 4. (/s)

~~~
kbumsik
Well, Python already has a plan for Python 4. The Python 4 will be released
after Python 3.8. There are already discussion on Python 4.0 in the dev group.
It is just a new number after 3.8 so there won't be breaking issues like 2=>3.

~~~
1wd
I think that was just some idea that was discarded.

"Seems that we've reached the consensus: we release Python 3.10 after Python
3.9. We maybe release Python 4.0 at some point if there's a significant
backwards incompatible change." [https://mail.python.org/pipermail/python-
committers/2018-Sep...](https://mail.python.org/pipermail/python-
committers/2018-September/006159.html)

------
repolfx
That sounds more like a mess handling things that are _not_ Unicode.

~~~
masklinn
Yes, but the issue here would be that Python forced these "things which are
not unicode" into unicode.

------
IshKebab
This is just the cost of using a dynamic language with implicit error handling
(exceptions).

~~~
ubernostrum
You can garble filenames just as easily in statically-typed languages.
Consider, for example, Windows' infamous 16-bit units that aren't actually
well-formed UTF-16. I'm not aware of any widely-used programming language
whose type system will save you from that sort of thing ("here's some bytes,
figure out if they're a string and if so what encoding" is a historically very
difficult problem).

~~~
masklinn
> You can garble filenames just as easily in statically-typed languages.

If the language assumes filenames are regular language strings, which not all
do.

> Consider, for example, Windows' infamous 16-bit units that aren't actually
> well-formed UTF-16.

unix filenames are literally just bags of bytes with no known or specified
encoding.

~~~
ubernostrum
Unix filenames don't pretend to be something they aren't. Windows filenames
like to present a convincing façade of being UTF-16 right up until they
aren't.

~~~
masklinn
Windows filenames "like to present a convincing façade of being UTF-16" in the
exact same way unix filenames "like to present a convincing façade of being
UTF-8". Both are common assumptions neither is actually true, and all of that
is well-documented.

~~~
MrRadar
> unix filenames "like to present a convincing façade of being UTF-8"

Except they never have? Unix paths have always been bags of bytes, both before
Unicode and UTF-8 were invented and after. It's just convention that modern
Unix systems use UTF-8 as the text encoding for paths.

~~~
masklinn
> Except they never have?

And neither have Windows paths ever actually pretended to be UTF-16, that's my
point.

