Hacker News new | past | comments | ask | show | jobs | submit login

> Honestly the poster is correct as far as it goes: modern python (and modern string libraries more generally) are pretty bad at handling dirty input.

This is incorrect: Python is fine for working with messy data. The difference is that when you CONVERT that data you have to handle the possibility of errors rather than hoping for the best. If you’re working with filenames, you can pass bytes around all day long without actually decoding them; it’s only when you decode them that you’re forced to handle errors.

> The difference is that when you CONVERT that data

Heh, "convert" it by concatenating it with a different directory path? Print it for the user? Stuff it into some kind of output format? Every one of those actions is going to toss an exception in python[1], and there are no tools available for reliably doing this in a safe and lossless way that matches the obvious syntax you'd use with "good" strings.

Maybe that's "fine" to you, I guess.

[1] And in lots of other environments. The quibble I have with the linked article is, again, that this kind of thinking is pervasive in modern string handling. Python certainly isn't alone.

Changing paths and other filename manipulations are supposed to be done using os.path or pathlib. The discussion at https://docs.python.org/3/library/os.path.html starts with this problem and notes that the functions support all bytes or all Unicode but you have to be consistent. Don’t force a conversion to text and it works fine.

Similarly once you’re talking about output other than passing it through unmodified to a format which can handle that, you are by definition converting it and need to handle what can go wrong. It’s easy to handle this in several ways, either by reporting an error or using some sort of escaped representation with an indication that this doesn’t match the document encoding, but you no long have the luxury of pretending that treating string as a synonym for bytes was ever safe.

And the need to use special libraries to handle objects that have been strings since the dawn of Unix is precisely the kind of mess the poster is talking about. Yes, yes, everyone agrees that these problems "can" be solved in Python. The question treated is whether or not Python (and modern utf8-centric string libraries) solves them WELL.

Pathlib[0] is kind of a step in the right direction. I think you could do something like subclass it to track each component (and component encoding) separately if that makes sense for your application.

I'm not a fan of Python3's language fork in any way (I think it was completely unnecessary to fork the language to make the improvements they wanted to) but I'll admit things like Pathlib and UTF-8b are a step in the right direction for handling arbitrary Unix paths. (I work with a large Python 2 code base and a product that has to interact with a variety of path encodings, so this subject is... sensitive for me.)

[0]: https://www.python.org/dev/peps/pep-0428/

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact