

PyPy - Python 3k Status #5 update  - kracekumar
http://morepypy.blogspot.in/2012/07/py3k-status-update-5.html

======
saurik
Paths and file names on Unix fundamentally do not have encoding: the
filesystem represents them as sequences of bytes, and it is entirely possible
to have a directory full of folders where there is no codec that is capable of
faithfully or even reasonably decoding all of the names contained (or even
that Unicode itself is capable of representing the semantically correct
decoding of the filenames if you knew what theoretical codec represented them
in the first place).

It is therefore a fundamental mistake and a misinterpretation of the semantics
of filesystems by Python 3 to insist that file names are represented by
Unicode strings with a locale-sensitive encoding; in fact, I question whether
there are interesting security ramifications inherent in this mistake (such as
allowing me to change the locale in which a process is running and thereby
remap its import path; this coming up, of course, as this article is largely
about sys.path and Unicode).

~~~
thristian
Well, it's a misinterpretation of the semantics of traditional Unix
filesystems. Other filesystems used by other operating systems (such as
Windows' NTFS and OS X's HFS+) genuinely do store filenames as Unicode
strings, so on those platforms (or on Unix, if you've mounted one of those
filesystems) Python 3.x's approach is exactly correct.

That said, recent versions of Python 3 include a workaround for exactly the
problem you describe: when the interpreter gets a byte-sequence value from the
OS, filenames that decode to Unicode cleanly will be represented as proper
Unicode strings, while filenames that can't be decoded will have the raw bytes
represented as code-points in Unicode's Private Use Area. That way, even if
Python can't decode the contents of a string, you can still, say, get a
parameter from the command-line, pass it to open(), and be confident that
you'll actually get the file the user intended.

~~~
saurik
While I reference the issue of a mixed-codec directory in order to make clear
the flaw in the operating assumption, the actual problem I am interesting in
here, and which I conclude my statements with a reference to, is handling
things like sys.path (hence this being a reply to the article).

To respond to your comments, however: actually, the behavior on, let's say a
Linux box, if you mount one of those aforementioned filesystems, will not be
"exactly correct". Python 3 will attempt to encode the Unicode string using
the current locale, pass it to the underlying Unix open() function, which will
then have no clue what to do as it hits the filesystem.

In fact, rather than just idly claiming this, I went ahead and set up exactly
this test setup on one of my servers. I created a python3 script in a known
specific source encoding (UTF-8) and asked it, in each of two different
locales, to make a file that included an accented character, while mounted on
an HFS+ disk image.

    
    
        hfs+# mount | grep hfs
        /.../hfs+.img on /.../hfs+ type hfsplus (rw,force)
        hfs+# cat test.py
        #!/usr/bin/python3
        # -*- coding: utf-8 -*-
        open("helloä", "w")
        hfs+# LANG=en_US.UTF-8 ./test.py
        hfs+# LANG=fr_FR.ISO-8859-1 ./test.py
        hfs+# LANG=en_US.UTF-8 ls -la
        -rw-r--r-- 1 root root  0 2012-07-11 00:57 hello?
        -rw-r--r-- 1 root root  0 2012-07-11 00:57 helloä
        -rwxr-xr-x 1 root root 64 2012-07-11 00:56 test.py*
        hfs+# 
    

As you can see, the behavior here is really poor for something that claims to
support Unicode. What we would expect to have happen is that, as I opened the
file with a Unicode name on a Unicode filesystem that I would actually get the
specific Unicode string that I had wanted.

Instead, because Python 3 is only pretending to understand the semantics of
filesystems with regards to character sets, and in fact has no way of taking
advantage of the Unicode support in HFS+, we ended up with the encoding of the
user's locale environment breaking our filenames.

This would be akin to me doing JSON.encode() in a browser, and having a
Unicode JavaScript string get converted into different JSON (which is also
represented as a Unicode JavaScript string) depending on what language the
user's browser is configured to use: that is a miserable Unicode failure, and
not a success story.

FWIW, the exact same code _does_ work on Mac OS X (using the correct locale
name of fr_FR.ISO8859-1): both versions of the script get the exact same
filename. To be honest, I was somewhat surprised that they got that right ;P.
I sadly do not have a Windows computer handy, as I'd love to see how it
handles NTFS (which is technically UCS-2 and doesn't have the weird
canonicalization behavior that HFS+ sometimes does: I can easily imagine
broken corner cases with invalid UTF-16 surrogate pairs).

Regardless, I really do not believe that you need to break the behavior on
Linux in order to make Mac OS X work correctly. Even if such a tradeoff were
required, I question whether it should be resolved with Linux on the losing
end. I am not going to say that the correct solution doesn't even use Unicode
strings: it just needs to more intelligently handle who is in charge of the
encodings and what they semantically mean than Python 3 is prepared to do.

As far as possible solutions go, I certainly do not believe that the private
Unicode codepoint solution is either sufficient (as it doesn't solve the de
novo filename creation problem) nor even remotely reasonable (as if I attempt
to then communicate these filenames with other systems, or heaven-forbid
wanted to store a filename with a private Unicode codepoint in it, I'm now
screwed).

(edit: Wow, BTW. They added this "solution" after I had already given up on
Python 3k, so I hadn't followed up to see what they did. It seems like they
ended up not using "private codepoints" as they were considering in 2008, but
are instead using surrogate-halves: they are encoding "malformed UTF-8
sequences as malformed UTF-16 sequences"[1]. Given that you are allowed to
store malformed UTF-16 on NTFS, I wonder how they expect that to work. Still
doesn't solve either of my complaints, though.)

1:
[http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043...](http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-
utf-8b.html)

~~~
lmm
If you set LANG=fr_FR.ISO-8859-1 you are declaring that you want your
filenames encoded as ISO-8859-1; the behaviour sounds like exactly what I'd
expect. It's certainly not "broken on linux"; at worst it's "broken on linux
when not using a utf-8 locale", and frankly I'd expect most things to break
under that circumstance.

~~~
saurik
You seem to be missing the point of this exercise: HFS+ is a Unicode-aware
filesystem, so the idea that "you want your filenames encoded as ISO-8859-1"
is fundamentally invalid and unimplementable. You can't specify an encoding
for the files you are saving, as they are semantically "Unicode" and saved by
the filesystem as UTF-16 on disk as an implementation detail.

The reason that HFS+ came up is that, as a filesystem that stores filenames as
actual "strings" as opposed to "arrays of bytes", you actually get the correct
behavior now on OS X with Python 3 (which you also seem to have missed);
thristian contended that this behavior would also work correctly if I mounted
the HFS+ image on Linux, and it does not.

To explicitly demonstrate this on Mac OS X (where the filename will be sent as
Unicode through to the Unicode filesystem API and then saved correctly through
HFS+ to disk with the original name, no matter what encoding you happen to
have set as part of your locale, as your locale truly should be irrelevant in
this specific situation, just as in my JSON analogy):

    
    
        mac$ cat test.py
        #!/Library/Frameworks/Python.framework/Versions/3.2/bin/python3
        # -*- coding: utf-8 -*-
        import locale
        print(locale.getpreferredencoding())
        open("testä", "w")
        mac$ LANG=en_US.UTF-8 ./test.py
        UTF-8
        mac$ LANG=fr_FR.ISO8859-1 ./test.py 
        ISO8859-1
        mac$ ls -la
        total 8
        drwxr-xr-x    4 saurik  staff   136 Jul 11 00:36 ./
        drwxr-xr-x+ 278 saurik  staff  9452 Jul 11 00:28 ../
        -rwxr-xr-x    1 saurik  staff   159 Jul 10 18:07 test.py*
        -rw-r--r--    1 saurik  staff     0 Jul 11 00:36 testä
        mac$

~~~
lmm
>HFS+ is a Unicode-aware filesystem, so the idea that "you want your filenames
encoded as ISO-8859-1" is fundamentally invalid and unimplementable

Sure. But it's still the idea that linux would apply, and how linux APIs (that
expect a filename to be a stream of bytes) work. I really think this is a
linux/locale problem rather than a python problem.

~~~
saurik
I will yet again repeat: the reason for this excursion into HFS+ semantics on
Linux was caused by thristian's insistence that Python's behavior would handle
HFS+'s Unicode behavior when mounted on Linux in the same correct way it does
on Mac OS X. This is, in fact, false. This then nullifies the argument that
this is a filesystem-specific issue.

You seem to be refusing to track this conversation's multiple thoughts: there
is the underlying argument "Python 3 is making unreasonable assumptions" with
a specific argument "these assumptions are reasonable on OS X" followed by an
aside "incidentally, this behavior actually is not related to operating
systems but is related to filesystems: as proof I cite HFS+ mounted on Linux"
with an error pointed out in the aside "no: in fact HFS+ on Linux has the same
behavior as any other filesystem on Linux".

I then separately respond to the point about "these semantics work on OS X"
(ceding, in fact, albeit explicitly remaining skeptical on Windows), saying
that the tradeoffs of "works worse on Linux" (which I get to assume, as my
earlier arguments that this is the case were not actually challenged: that on
Linux the concept of encodings does not apply to filenames and causes problems
like locale-specific sys.path) seems like the wrong direction to lean (which
is an opinion, of course).

However, to make that claim, I need to defend against a new point that is
brought up: that thristian believes that an epicycle added to the algorithm
(the PUA "save the problem for later" mechanism) is sufficient to mitigate the
Linux problems. I claim that it is not, and I bring up a few reasons why (de
novo filenames, interop with non-Python systems, existing usages for PUA):
reasons which, incidentally, were also discussed as open problems on the
Python mailing list.

Finally, I also explained that the PUA solution isn't even being used anymore,
but was actually replaced by UTF-8b. As this solves one of my complaints
(existing usages for PUA) I then have to first admit that (although I defend
that I believe that invalid surrogate pairs are not invalid on Windows,
leading to a similar problem) and then, for clarity, mention that my other
arguments are not affected by UTF-8b.

~~~
lmm
So, in the interests of being perfectly clear: I am challenging your claim
that python 3's approach works worse on Linux; I assert that its semantics
under linux are correct (i.e. what a well behaved program running under linux-
the-system should do). Conceptually, a program should tell the operating
system to save a filename under a given name (unicode string); it is then the
operating system's responsibility to translate that to and from bytes on disk.

What you have observed, and demonstrated with your example, is the behaviour
of linux running with LANG=fr_FR.ISO-8859-1, which is to represent filenames
that contain characters not representable in ISO-8859-1 as ?s. Any well-
behaved linux program will exhibit the same behaviour, because it is not
program behaviour but OS behaviour. Programs that ignore LANG and do their own
filename encoding will appear better under your test, but such programs are
misbehaving; by declaring LANG=fr_FR.ISO-8859-1 the user has made an explicit
declaration that they wish for their filenames to be encoded as ISO-8859-1,
and should expect as much.

That filenames of files stored on HFS+ under linux still have linux semantics
despite the filesystem's semantics being different is an interesting accident
of history but really neither here nor there. The idea that you want your
filenames encoded as ISO-8859-1 may indeed be fundamentally invalid and
unimplementable on HFS+, but it remains the semantics of setting
LANG=fr_FR.ISO-8859-1 on linux, and as such it should be expected that linux
would attempt to follow this behaviour as closely as possible.

Really the whole excursion into filesystems is irrelevant. Python 3 behaves
correctly operating systems which provide unicode filenames, i.e. "OSX" and
"Linux with a UTF8 locale", and as well as could be expected on operating
systems where filenames are only permitted to be strings in a particular
encoding i.e. "Linux with a non-UTF8 locale".

------
bcambel
When do you think Py3K will be main stream ? I'm afraid this might be a
failing effort. Is there any company out there using Py3K in production ?

~~~
kisielk
Do we really need to rehash this tired argument over every single post that
references Python 3? You can look at any other Python 3 related post in the
last couple of years on HN and Reddit to find endless arguments about this
topic.

How about a comment thread that actually talks about the post contents for
once?

