Hacker News new | past | comments | ask | show | jobs | submit login

This problem in Python 3 is not limited to OS file names, that’s just one way to get invalid Unicode data. But invalid data happens all the time when working with real data. The Python 3 string design requires that all strings must be valid Unicode or Python will raise an error. This is a really unfortunate property that has bitten every single data scientist I know who uses Python 3. At some point, often hours or days into a long, expensive computation, one of their programs has suddenly encountered just a single invalid byte and crashed, costing them days of time and work. The only recourse for writing robust programs that can gracefully and correctly handle invalid data is not to use strings, which, frankly makes the string type seem pretty useless.

The Python 3 string design also necessitates scanning and often transcoding every piece of string data that it encounters, both on the way in and again on the way out. That means that not only is the string type inappropriate for any data that might not be valid Unicode, it is also inappropriate for any data that might be large.

I’ve been meaning to write a blog post about how Julia handles strings, but haven’t yet gotten around to it. Among other benefits:

- You can process any data as strings and characters, whether it’s valid Unicode or not.

- If you read any data as strings or characters and write it back out, you get the exact same data back, no matter what it is, valid or not.

- Invalid characters are parsed according to the Unicode 10 spec.

- You only get an error if you actually ask for the code point of an invalid character, which is a fairly rare operation and must error since there is no correct answer.

- The standard library generally handles invalid Unicode gracefully.

- You can use strings for large data: there’s no need to look at, let alone transcode string data—if you don’t need to access something no work is required.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: