Hacker News new | comments | show | ask | jobs | submit login

In the Rust standard library, we have String, str, OsString, Path, and PathBuf, off the top of my head, with external crates implementing things like URLs, ropes, and others.

That said, I completely feel what this blog post is saying: strings are _hard_, especially if you're not just doing ASCII.




std provides String (utf8), OsString (bytes/wtf8), PathBuf (bytes/wtf8), CString (null-terminated).

Each has their own unsized view: str, OsStr, Path, and CStr.

We also provide AsciiExt for those times where you really truly believe you want to be working with a String as Ascii.

That said, we generally try to make it as ergonomic as possible to pass a plain str where a Path/OsStr is expected. This is because utf8 is a subset of wtf8, so it's always fine to convert in that direction blindly (and it's really nice to just be like `File::open("foo.txt")` when hacking something together). This is why so many interfaces are riddled with something like `P: As<Path>`. The differentiation largely exists for the other direction, IMO. Paths and OsStrs aren't guaranteed to be valid UTF8, and shouldn't be provided where a proper utf8 string is expected.

Path is just a convenience wrapper over OsStr that understands the platform's seperator conventions and provides convenient utilities.


Don't forget we kinda sorta have byte characters and strings a la `b'x'` and `b"foo"`, which are really just simpler ways of expressing byte slices. Unfortunately they lack string specific methods until we get specialization (fingers crossed).


why are they so hard?


There's two aspects of complexity with strings:

1) 40 years of crazy encodings and languages.

2) human languages are wildly diverse and basically any assumption you wish to apply is broken.

For 1, any system that wants to deal with the outside world needs to deal with: operating system encodings (arbitrary bytes on unix, malformed UCS2 on windows), C representation (null-terminated strings), systems that only work with ASCII, systems that only work with utf8, systems that work with arbitrary encodings/languages (HTML). This is arguably unnecessary complexity that exists because of short-sighted decisions in the past.

2 is the necessary complexity; the fact that languages are really complicated.

There are thousands of symbols in writing. Do you try to encode these symbols in a monolithic manner, or in a compositional way? For historical reasons, you can often do both! ë can be a single character, or e with an accent modifier. How do you handle string searching in such a model? Do you match `noel` with `noël`? What's the length of noël? 4 characters? 5 characters? bytes? graphemes? codepoints? Can you correctly reverse noël (do it wrong and you can get leön)?

Different letters which have similar/identical representations but different semantics/origins! Is Ε "capital e" or "capital ε"? How do you upper-case or lower-case these letters? Do you expect to_upper(to_lower(char)) to roundtrip (it won't)? Do you expect capitalization to be doable in-place (it's not)? Do you expect capitalization to be region-specific (it is)?

Are any of these operations even coherent in a language like Japanese? Why are you trying to do them?

God help you if you want to display this text. Are you ready to handle right-to-left text? Are you assuming that your font is monospace (hey there terminal and text editors)? C̢̫a̘̺̯n ̘̜̦̹y̷̫̼̘̩o̶͉u̗̩̻̞ ̻ẹ͡v̴̤͎̹e̶̫̠̤̭̺̤̞n̛̞̹̣̩̲͉̮ ̜͖̪͔̖d̤e̘̯ͅa̺l̟̀ ͚̗̣w̭i̸͇̠̥̣̜̥t̸h̸̻̮̼̙̹ ̗̺̱̣̰̱̙z̟a̺͜l̠̦̖̟̰͍g҉̜͖͓̫ơ̩̹̰͕?̹̳̼̯̘̺̟


I think my favorite part of this article is how my Zalgo example doesn't render even remotely correctly in any browser I've tried it in.


Because we spent a lot of time optimizing every assumption for the ascii-only case, and letting go of that simplification is distressing, I guess.


Yes. And many languages had ASCII-only stuff for a long time, so "real" APIs seem much, much more complex. Of course, that complexity was always there, but us English speakers could mostly just ignore it...




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: