Hacker News new | past | comments | ask | show | jobs | submit login

> That’s just restating my point: Unix filenames are bytes (on most filesystems, anyway). The fact that many people were able to conflate them with text strings was a convenient fiction.

Python tools for backups are my worst terror beacuse of that - they kept destroying data of our clients because they dared (gasp!) name files with characters from their own language or do unthinkable things, like create documents titled "CV - <name with non-unicode characters>.docx".

The fact that Python3 at least tries to make programmers not destroy data as soon as you type in a foreign name (which happens even in USA) is a good thing.




> Python tools for backups are my worst terror beacuse of that

You can have badly written tools in any language. There are even functions to get file paths as bytes (e.g. [os.getcwdb](https://docs.python.org/3/library/os.html#os.getcwdb), it's just most people don't use them because it's rare-ish to see horribly broken filenames and not convenient.

Do other languages get this right 100% of the time on all platforms? I don't think so, it's just you've never noticed.

* C: has no concept of unicode strings per se, may or may not work depending on the implementation and how you choose to display them (CLI probably "works", GUI probably not)

* Rust: seems to assume UTF-8? (https://doc.rust-lang.org/std/ffi/index.html#conversions)

* Go: gets this right, but probably breaks on Windows? "string is the set of all strings of 8-bit bytes, conventionally but not necessarily representing UTF-8-encoded text" (https://golang.org/pkg/builtin/#string)

In short, either I don't understand what point you're making, or it isn't unique to Python.


Not to bring up one of my favorite languages or anything, but I do think D got this completely right.

https://tour.dlang.org/tour/en/basics/alias-strings

> This means that a plain string is defined as an array of 8-bit Unicode code units. All array operations can be used on strings, but they will work on a code unit level, and not a character level. At the same time, standard library algorithms will interpret strings as sequences of code points, and there is also an option to treat them as sequence of graphemes by explicit usage of std.uni.byGrapheme.

And perhaps my favorite part (https://tour.dlang.org/tour/en/gems/unicode):

> According to the spec, it is an error to store non-Unicode data in the D string types; expect your program to fail in different ways if your string is encoded improperly.

I should note that what I really like about this approach is the total lack of ambiguity. There is no question about what belongs in a string, and if it's not UTF then you had better be using a byte or ubyte array or you are doing it wrong by definition.


So in D it's impossible to work with files if their filename is not Unicode?


Rather it would be an error to grab a Unix filename, figure your job was done, and store it directly into a string. So you'd... handle things correctly. Somehow. I admit I've never had the bad luck of encountering a non-UTF8 encoded filename under Linux before and can't claim with any confidence that my code would handle it gracefully. In any language, assuming you're using the standard library facilities it provides things will hopefully mostly be taken care of behind the scenes anyway.

What I like about the D approach isn't that declaring it an error actually solves anything directly (obviously it doesn't) but that it removes any ambiguity about how things are expected to work. If the encoding of strings isn't well defined, then if you're writing a library which encodings are the users going to expect you to accept? Or worse, which encodings does that minimally documented library function you're about to call accept?


OsString/OsStr are not utf-8.


Would you care to elaborate? I'm not claiming to know Rust, but the link I provided clearly says "[t]hese do inexpensive conversions from and to UTF-8 byte slices".


https://doc.rust-lang.org/std/ffi/struct.OsString.html has it pretty clearly: on unixes it’s a bag of bytes, on Windows it’s the modified UTF-16 they’ve got going on. There’s a trick called WTF-8 that bridges some gaps, though that’s considered an implementation detail: https://simonsapin.github.io/wtf-8/

They’re conversions because they’re not UTF-8 in the first place, that is, they’re not String/str. The conversions are as cheap as we can make them. That language is meant to talk about converting from OsString to String, not from the OS to OsString.


Why do people say "bag" instead of string/sequence/vector/array/list/etc.? Bags are multisets... they're by definition unordered. It's already a niche technical term so it's really weird to see it used differently in a technical context...


I think it feels really evocative. Like, a bag of dice or something. You can’t see what’s inside, you have no idea what’s on them. It reinforces the “no encoding” thing well.


I think it is more eloquently stated that "you shouldn't make assumptions about what's inside." Saying "you can't see what's inside" ignores the biggest cause of the conflation. Userspace tools allow you to trivially interpret the bag of bytes as a text string for the purpose of naming it for other people.


Yeah, that might be better, good point.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: