Though note that it's only suitable for internal usage (as in, you wouldn't expose an API that uses WTF-8), so it's not something that you need to actually think about when learning Rust. :P
1. a plain "buffer of unparsed octets" type (like Erlang's "binary" type), to hold onto your weird invalid UTF-16 (or whatever other weird invalid stuff you like);
and 2. a strict, validated-on-construction "sequence of codepoints" type.
Going from #1 to #2 wouldn't be a direct cast; it'd instead be a stream-parser function that could be hooked with a callback, which would be handed parsed-and-valid, and invalid-and-left-unparsed, "segments":
maybe_parsed_segment() = valid_string_of_codepoints() | binary()
The default callback would presumably respond to a binary() of size N by emitting a "�" string of length N. That's pretty much what we have today, but both sturdier and more extensible.
Ouch. So one thing that Rust is not going to succeed at is not having the chaos around string types that C++ has. Why is it that some languages can get away with just having 1 unicode string type with libraries to translate to other representations as needed, but not others?
1. Rust is not garbage collected. This means that a separation between strings-that-own-their-data and strings-that-reference-their-data is needed.
2. Rust cares about first-class FFI support; in particular, copying strings at FFI boundaries is right out. That is why Rust needs C strings to be a separate type.
3. Rust wants to do pathnames right in a cross-platform way. In particular, it should be impossible for other apps to create valid pathnames that cause Rust apps manipulating those paths to crash or malfunction. UTF-8 paths fail this criterion.
How does Rust handle strings encoded as UTF-16 or UTF-32? Does it need immediate conversion?
 - https://doc.rust-lang.org/std/ffi/struct.OsString.html
Very cool to show off to your geek programmer friends and impress them, but I really question the day-to-day usefulness when considering the added complexity.
I have never ever had bugs in my code from somehow having invalid Unicode code-points in my paths. 99% of the time, replacing '/' with '\' and handling drive letters gets you there. I realize certain classes of bugs could happen, but I'd rather take that risk, have a bug, and fix it, than burden the type system and libraries.
In a way, I'm a little sad that modern computer languages are going from C's extreme of "always trust the programmer" to the other of "never trust the programmer". Bugs are bad, but language usability, ease-of-use and simplicity are also worth fighting for.
Having said all that, I am really happy to see Rust being awesome in the general case and appreciate all the work you and the rest of the team are putting forth.
They don't seem analogous in the slightest. Grammars can be correct without being proven correct. Calculating factorials at runtime instead of at compile time doesn't change the result of the calculation. But treating all paths as though they were UTF-8 breaks programs when certain paths are fed to them.
> I have never ever had bugs in my code from somehow having invalid Unicode code-points in my paths. 99% of the time, replacing '/' with '\' and handling drive letters gets you there. I realize certain classes of bugs could happen, but I'd rather take that risk, have a bug, and fix it, than burden the type system and libraries.
So this philosophy is what brought us the fun of feeding paths that contain spaces to shell scripts. Except this would be worse, because you can write space-correct Bash scripts—but if paths in the language are UTF-8, and you have to work with a path that contains non-UTF-8 characters, there is literally nothing you can do, and your program will be broken no matter what†. I'm glad that Rust took the time to do things right and figured out how to make the right thing ergonomic instead of giving up, deciding that correct paths are impossible, and forcing everyone to write broken programs.
By the way, this is not a theoretical problem: VCS's have been bitten by this in the real world. https://www.redmine.org/issues/2664
† Note that Golang has a different solution to this problem: strings in Golang don't have to be UTF-8, and they're only UTF-8 by convention. This is a totally valid solution, but Rust didn't go with it because removing the invariant that strings are UTF-8 has downstream consequences that affect every API that takes strings.
I think that's precisely the point; at scale, and with today's modular software culture (like last week's left-pad news cycle demonstrates), 99% correct software can mean millions of installations of bad code, all vulnerable to attack, or even just 1%-type bugs.
One way to think about the last 10 years of language design is an ongoing dialogue on how much or little to trust engineer intuition about which tools are safe for which jobs. The pendulum is swinging towards the belief that in general engineers should be given much more safety-oriented tools.
Also, you seem to think typeful programming is just for those who either want to show off (“to your geek programmer friends”) or don't trust other programmers (“to the other [extreme] of "never trust the programmer"”). This plainly untrue. Typeful programming is for those who embrace code as a communication medium (“my data structure has these complex invariants, which I can communicate either with three lines of code, or with two long English paragraphs in comments than might even get out of sync”) and want to include the type checker in the conversation (“this code is rather tricky, and I made lots of changes to it, could you check for me if I missed anything?”). Of course, the effectiveness of code as a communication medium is contingent on your ability to understand code.
Making things seem simpler by finding an abstraction over them is great!
Making things seem simpler by finding an abstraction that leaks some details so you sometimes need to handle them at a high level is... less great. But sometimes still a win.
Making things seem simpler but neither handling the complexity nor allowing your users to handle the complexity is technical debt at best.
(0) Memory is infinite, and memory allocation will always succeed. If it doesn't, aborting the entire process is the right course of action.
(1) Communication over a network is reliable and data will arrive in the same order in which it was sent.
And here are some examples of unacceptably leaky abstractions:
(0) A few basic data structures [in particular, growable arrays and hashtables] suffice to implement every algorithm you might ever want to implement.
(1) Strings are arrays of characters, and thus you can efficiently index them by character.
(2) Every [nonempty] string is a valid file name or URL.
(3) Every external resource can be manipulated using an API that presents it as a sequence of bytes.
(4) An event loop is the universal solution to all asynchronous I/O problems.
I think in general, building on top of leaky abstractions is a form of technical debt, and you have to be aware of how much you're accumulating and what you're getting in exchange. Also, complexities stack super-linearly. I'm tempted to say, additionally, that building shallowly atop a leaky abstraction is less problematic than pushing leaks further down in a tall stack... but on reflection I have no idea how I'd support that.
So, what I meant to say in my previous post is that, as far as I can tell, good leaky abstractions only arise when doing systems programming, or, more precisely, at the boundary between low-level system services (e.g. runtime systems for managed languages) and the applications that consume them. Abstraction leaks in high-level application code are likely to be bad.
You need to worry about running out of memory precisely because the abstraction is leaky. If you need to micromanage, it's perhaps not leaky, but it's also not much of an abstraction.
Exploring this example a little deeper, if our model is "we never run out of memory, but our program may stochastically die at any point in time", then that's not as leaky as "we never run out of memory", and our program may die for reasons outside our control so it's a good idea to be robust against those. But now let's say we need to send a request to a service, and the service disconnects when we've sent the request. With a stochastic model of failures, we clearly should just re-send. But if the problem was that our payload demanded more memory than could be provided, retrying repeatedly will just keep killing servers.
We probably want a model of "we never run out of memory, our program may die at any point in time, possibly correlated with inputs", which I would contend is actually not very leaky. We still spill over into performance, but most things spill over into performance...
I would say that leaky abstractions constitute debt relative to non-leaky equivalently powerful abstractions. They may still pay their way.
In practice, I don't worry. I'm fine with the leaky abstraction, because the benefit of fixing the leak (reducing the probability that OOM crashes my program from almost zero to exactly zero) isn't worth the cost (increased complexity). But see below.
> If you need to micromanage, it's perhaps not leaky, but it's also not much of an abstraction.
I'm not sure about this. There are levels of abstraction and degrees of micromanagement. For instance, `malloc` lets you micromanage the memory layout of your data structures, and `mmap` and `mprotect` also let you micromanage what you can do with each memory page, but they're still abstractions over even lower-level concerns, like the mapping between physical and virtual memory addresses.
> We probably want a model of "we never run out of memory, our program may die at any point in time, possibly correlated with inputs", which I would contend is actually not very leaky.
Depends on what kind of program you're writing. If your web crawler crashes, you just restart it. If your web server crashes, your users will be annoyed, but at least your database won't be compromised. (You're properly using transactions, right?) But if your DBMS crashes, your data might even be in an irrecoverably inconsistent state, unless you properly anticipated all (software) causes for it to crash.
In practice, you're being sloppy or what you've described
is the relevant worrying - considering whether your program has appreciable odds of OOM and the odds of things being particularly bad when that happens, and then deciding that additional worrying is unnecessary. But this can need re-evaluating as circumstances change.
> There are levels of abstraction and degrees of micromanagement.
No disagreement there.
> Depends on what kind of program you're writing.
A bit, but mostly in terms of performance.
> But if your DBMS crashes, your data might even be in an irrecoverably inconsistent state, unless you properly anticipated all (software) causes for it to crash.
I disagree. If you have designed your program with an assumption that it might fail at any point regardless of reason, then you don't need to have "properly anticipated all (software) causes for it to crash", and further you should be robust against some portion of non-software causes.
I'm unhappy with this line of reasoning. For me, an abstraction leak is a failure to behave as advertised. Since memory allocation can always fail, an abstraction that advertises “allocations always succeed” is necessarily leaky. Even if the leak isn't important in practice, the leak exists.
> If you have designed your program with an assumption that it might fail at any point regardless of reason
I'm not sure this is even possible (if I take nothing for granted, how do I do anything?), but I'd like to be proven wrong.
To my mind, it must fail to behave as advertised in a way that is meaningful. Otherwise, the differences are precisely what is being abstracted away.
> I'm not sure this is even possible (if I take nothing for granted, how do I do anything?), but I'd like to be proven wrong.
I don't mean "might act in arbitrary ways", but specifically "might be unexpectedly be terminated at any point".
Lucky for you. I will think about you next time I have to fix one of these 1% pathname bugs you do not seem to be concerned about, which tend to happen all the time when your program is used by many people with many files (i.e. the case every developer hopes for)
One of the projects I am working on lately is a data integrity and syncing platform with multiple PB of data in multiple data centers, and the dataset has been around for 10-12 years.
Some of the file names and path names I see have the most unnatural and unbelievable characters in them.
It can absolutely be quite a challenge. This point about rusts string system sounds like maybe could give me some very precise control.
Do you use any computers with non-unicode filesystems? I hit such a bug every month or two. (Most recently a couple of days ago in Deluge, for a concrete example, which seems to not open .torrent files with non-unicode filenames). If you're actually using a computer with a non-unicode filesystem at all regularly then I'm honestly amazed you haven't seen such bugs.
> I realize certain classes of bugs could happen, but I'd rather take that risk, have a bug, and fix it, than burden the type system and libraries.
It's a question of costs and benefits, no? A particular class of bugs has severity x and happens at rate y. If the language is such that the type system and libraries can handle it at a cost of less than xy (in terms of complicating or slowing development) then it's worth handling it there; if not, not.
> In a way, I'm a little sad that modern computer languages are going from C's extreme of "always trust the programmer" to the other of "never trust the programmer". Bugs are bad, but language usability, ease-of-use and simplicity are also worth fighting for.
Language usability, ease-of-use and simplicity are absolutely valuable. A good language is one that allows you to write correct programs without sacrificing those things.
I find a language that will keep track of things for me is a actually huge help in writing correct code. E.g. if I'm writing Python 2 then I still have to worry about whether a particular object is a string or a byte sequence - worse, I have to do all the work of keeping track of it in my head, because the language will just silently corrupt data if I get it wrong. I end up writing comments on my functions that say what a particular value is, and carefully reading my code to verify I've made encode()/decode() calls in the right places. I don't want my language to "trust me", any more than I'd want to e.g. try to cut a piece of wood to the right length without measuring or marking it first.
(Many of the western codepages are either subsets of UTF-8, or not actually UTF-8 but can roundtrip through UTF-8 so you don't actually notice the difference as long as the program is just using the filename as an opaque "handle" and not actually manipulating it. In my case my filesystem is Shift-JIS for compatibility with some old Japanese programs, and that doesn't roundtrip cleanly through unicode)
The major driver behind Rust's design is memory safety - make it impossible (or very difficult) for certain classes of bugs to exist. You can write C software that is reliable without that safety, but it is very time consuming and never perfect.
This is a similar situation - give the programmer tools to write "obviously" correct code.
The String/str divide is the unavoidable consequence of not having garbage collection. A String is owned, a str is borrowed. This blog post goes into more information.
This is nowhere near the chaos that C++ has.
Which as someone who greatly appreciates the separation Rust offers, this makes me curious: does the string type either not enforce its contents being Unicode, or is it not capable of storing arbitrary OS paths? Because AFAICT, the two are mutually exclusive. The rub here is that OS paths — on POSIX operating systems¹ — are byte strings that we like to pretend are text; but since no particular encoding is enforced, you end up with effectively just byte strings. A String type that validates itself as valid Unicode is not capable of representing these (as arbitrary byte strings are not necessarily valid Unicode). But if we let the String type not validate, then I can pass non-Unicode to functions that take a "String" — something I'd rather not do.
There's also Python's byte smuggling, but I find that a bit weird.
¹While I call out POSIX, I'm told Window's pathnames are sequences of unsigned 16-bit values that most people expect to be UTF-16 — but that fact isn't enforced by the OS.
I don't know what happens if you do something like mount an NFS share with filenames that aren't valid Unicode. I would guess a fair amount of stuff will break.
Swift strings are Unicode, and Unicode is certainly capable of representing arbitrary byte streams, by just converting to and from any 8 bit fixed-width encoding (not UTF-8!). There are explicit conversion functions like -fileSystemRepresentation which handle the conversion between filesystem paths and Strings.
People differ on whether or not universal reference counting is a form of garbage collection with weak guarantees or a form of manual memory manage with flexible semantics.
Which is pretty unfortunate, because the memory management field has always treated reference counting as a form of garbage collection—because it is  . The idea that reference counting is somehow not GC is as far as I can tell a recent one (popularized, I think, by Cocoa and iOS?)
This is not just a semantic quibble, because there are successful garbage collection techniques that combine tracing with reference counting (e.g. ulterior reference counting ) to achieve GC, so treating tracing as the only way to do GC makes no sense. In fact, David Bacon observed in 2004 that reference counting and tracing garbage collection are best viewed as just two extremes of a continuum , and many "optimizations" that you apply to one scheme or the other are really just moving the scheme you choose toward the opposite pole.
I remember seeing lots of "but Python doesn't have REAL GC" arguments years ago. But Cocoa's foray into GC, subsequent retreat back to manual ref counting and then ARC probably have affected this too, yeah.
What they miss is that RC is a form of GC algorithms and that Apple only gave up on Objective-C's GC, because they never managed to make it stable.
The way one could link GC enabled with non GC enabled frameworks, plus the fragility of underlying C code for memory management meant Apple developer forums were full of crash reports.
So Apple did a "Worse is Better" pivot and basically made the compiler automate the retain/release calls from Cocoa that were being written by hand.
Similarly Swift adopts ARC, because it needs to interoperate with Objective-C's runtime and having a GC would introduce extra performance issues, or force them to use a RC cache mechanism like .NET has for COM.
But of course, the majority without the background knowledge of how the decision came to be or interest in compiler design, just uses Apple's decision as some kind of proof against GC.
From what I can remember pauses were never an issue.
There used to exist a quite long page on the Apple Developer's web site listening all the caveats and corner cases to watch out when enabling new GC algorithm, if you will.
It is called "Garbage Collection Programming Guide", but all links to it have been disabled. But you can see get it with a bit of Google-Fu.
Also I remember to occasionally go through the forums and they had quite a few developers having integration issues with the new algorithm.
Standard C++ only has value-semantic C strings, and only supports fixed-width character encodings. It's not very rich, but I'd hardly call it 'chaos'. I'm happy to let my GUI library or DBMS handle collation and normalisation.
This is actually an area where the C++ standard library has been lacking until recently. For example, you'd typically want a map with string keys to own them, but without severe contortions that meant map readers had to allocate a string key (expensive!) to perform a lookup.
Actually that's almost exactly what Rust is doing. Most of the types of strings you mentioned are added from the FFI library which is used to create bindings to other languages. They are a necessity to convert between the representations nicely.
 Java also manages to conflate synchronization into the mix with `StringBuffer`.
 C# (thankfully) just has static functions to manipulate strings as paths.
Also I personally don't consider paths as strings, nor URLs – both are types for specific purposes with specific invariants, which are distinct from strings'.
That's not quite true -- on POSIX paths are strings of bytes. POSIX doesn't specify any particular encoding. AFAIR the only restrictions on paths are around the NUL (0) byte and '/' (47).
 I'm sure we'd all be happier if Unicode (encoded as UTF-8) were specified as the way to do paths, but POSIX was invented quite a bit before Unicode & UTF-8, so here we are.
Basically if your language/std-lib claims to be cross platform and considers paths to be definitely-utf8, it's busted. I think, maybe, plan9 is a case where this claim holds?
On Windows, and in many Windows APIs, the return value is a "sequence of UTF-16 code units" (which is essentially a fancy way of saying "a sequence of 16-bit words"). This goes back to Windows originally using UCS-2, where any sequence of 16-bit words is valid (because there wasn't any concept of a surrogate, yet alone a lone surrogate), given its support predates the Unicode consortium extending Unicode to 21-bits and along with it UTF-16's existence (1996).
> String is the heap-allocated sequence of characters,
Honestly I think the main thing that trips people up here is the naming convention. If it was String and StringView rather than str I don't think people would have nearly as much trouble. I guess the name was chosen for brevity considering how pervasive &str is in rust code.
That said, if anything, it would be StrBuf and str, because str is a primitive type, hence the lower case. It almost wasn't a primitive type, but there were drawbacks, so we kept it as one.
"unicode bytes" aren't a thing; bytes implies encoding (and subscripting yielding <=0xff), otherwise it's "codepoints" (and subscripting yielding an int somewhere on the unicode planes).
String and str are guaranteed to be valid UTF-8 encoded Unicode strings. If you’re wondering why UTF-8 is the standard encoding for Rust strings, check out the Rust FAQ’s answer to that question.
That's spot on. Please add this to the first part, too; "... buffer of UTF-8 encoded unicode bytes". or even just "encoded unicode string." It will be clear what is (and is not) meant.
Otherwise nice article! Even understandable for someone with no Rust experience.
As a kinda funny aside, I also wrote the linked-to FAQ answer. Took a number of drafts to get all the fiddly Unicode terminology right.
When I see 'sort' collocated with 'string' my mind immediately jumps to the verb meaning of 'sort', not the noun meaning of 'sort', so it was a bit confusing.
Also the type-theoretic meaning of "kind" is accurate here, no? Owned vs slice really are two different kinds - it's just that Rust can't represent kindedness in general (it has specialized support for tracking ownedness, but you can't reuse that system to track other properties).
I'd like to see some examples illustrating appropriate usage of Clone-on-Write with Rust strings - that's something I personally have struggled with while learning rust.
One area where it's used effectively in the stdlib is String::from_utf8_lossy: http://doc.rust-lang.org/collections/string/struct.String.ht...
If the argument is valid UTF-8, then no allocations need to be performed, and so the returned value will be a &str. But if it's not valid UTF-8, the replacement characters need to be inserted, which means allocation, which means String will be returned.