Hacker News new | comments | show | ask | jobs | submit login
String types in Rust (andrewbrinker.github.io)
143 points by steveklabnik on Mar 28, 2016 | hide | past | web | favorite | 85 comments



My favorite bit of learning with Rust strings was the WTF-8 format: https://simonsapin.github.io/wtf-8/

Implemented: https://github.com/rust-lang/rust/blob/master/src/libstd/sys...


WTF-8 is great, and lest anyone think that it's intended to be a joke it's worth mentioning that it's actually a very useful thing:

"WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired."

Though note that it's only suitable for internal usage (as in, you wouldn't expose an API that uses WTF-8), so it's not something that you need to actually think about when learning Rust. :P


I don't really get the point of it; I'd think that, properly, what you'd want is:

1. a plain "buffer of unparsed octets" type (like Erlang's "binary" type), to hold onto your weird invalid UTF-16 (or whatever other weird invalid stuff you like);

and 2. a strict, validated-on-construction "sequence of codepoints" type.

Going from #1 to #2 wouldn't be a direct cast; it'd instead be a stream-parser function that could be hooked with a callback, which would be handed parsed-and-valid, and invalid-and-left-unparsed, "segments":

   maybe_parsed_segment() = valid_string_of_codepoints() | binary()
...and would then respond with a valid segment, which could just be "" to discard the input segment.

The default callback would presumably respond to a binary() of size N by emitting a "�" string of length N. That's pretty much what we have today, but both sturdier and more extensible.


In this post I’m going to explain the organization of Rust’s string types with String and str as examples, then get into the lesser-used string types—CString, CStr, OsString, OsStr, PathBuf, and Path

Ouch. So one thing that Rust is not going to succeed at is not having the chaos around string types that C++ has. Why is it that some languages can get away with just having 1 unicode string type with libraries to translate to other representations as needed, but not others?


Three reasons why Rust cannot adopt Go's solution:

1. Rust is not garbage collected. This means that a separation between strings-that-own-their-data and strings-that-reference-their-data is needed.

2. Rust cares about first-class FFI support; in particular, copying strings at FFI boundaries is right out. That is why Rust needs C strings to be a separate type.

3. Rust wants to do pathnames right in a cross-platform way. In particular, it should be impossible for other apps to create valid pathnames that cause Rust apps manipulating those paths to crash or malfunction. UTF-8 paths fail this criterion.


> Rust cares about first-class FFI support; in particular, copying strings at FFI boundaries is right out. That is why Rust needs C strings to be a separate type.

How does Rust handle strings encoded as UTF-16 or UTF-32? Does it need immediate conversion?


OsString [1] in Rust provides native string that can be directly passed to OS-specific API. It is UTF-16 on Windows.

[1] - https://doc.rust-lang.org/std/ffi/struct.OsString.html


Point 3 reminds me of Haskell trying to prove certain grammars correct at compile-time or C++ calculating factorials using template meta-programming.

Very cool to show off to your geek programmer friends and impress them, but I really question the day-to-day usefulness when considering the added complexity.

I have never ever had bugs in my code from somehow having invalid Unicode code-points in my paths. 99% of the time, replacing '/' with '\' and handling drive letters gets you there. I realize certain classes of bugs could happen, but I'd rather take that risk, have a bug, and fix it, than burden the type system and libraries.

In a way, I'm a little sad that modern computer languages are going from C's extreme of "always trust the programmer" to the other of "never trust the programmer". Bugs are bad, but language usability, ease-of-use and simplicity are also worth fighting for.

Having said all that, I am really happy to see Rust being awesome in the general case and appreciate all the work you and the rest of the team are putting forth.


> Point 3 reminds me of Haskell trying to prove certain grammars correct at compile-time or C++ calculating factorials using template meta-programming.

They don't seem analogous in the slightest. Grammars can be correct without being proven correct. Calculating factorials at runtime instead of at compile time doesn't change the result of the calculation. But treating all paths as though they were UTF-8 breaks programs when certain paths are fed to them.

> I have never ever had bugs in my code from somehow having invalid Unicode code-points in my paths. 99% of the time, replacing '/' with '\' and handling drive letters gets you there. I realize certain classes of bugs could happen, but I'd rather take that risk, have a bug, and fix it, than burden the type system and libraries.

So this philosophy is what brought us the fun of feeding paths that contain spaces to shell scripts. Except this would be worse, because you can write space-correct Bash scripts—but if paths in the language are UTF-8, and you have to work with a path that contains non-UTF-8 characters, there is literally nothing you can do, and your program will be broken no matter what†. I'm glad that Rust took the time to do things right and figured out how to make the right thing ergonomic instead of giving up, deciding that correct paths are impossible, and forcing everyone to write broken programs.

By the way, this is not a theoretical problem: VCS's have been bitten by this in the real world. https://www.redmine.org/issues/2664

† Note that Golang has a different solution to this problem: strings in Golang don't have to be UTF-8, and they're only UTF-8 by convention. This is a totally valid solution, but Rust didn't go with it because removing the invariant that strings are UTF-8 has downstream consequences that affect every API that takes strings.


> 99% of the time, replacing '/' with '\' and handling drive letters gets you there.

I think that's precisely the point; at scale, and with today's modular software culture (like last week's left-pad news cycle demonstrates), 99% correct software can mean millions of installations of bad code, all vulnerable to attack, or even just 1%-type bugs.

One way to think about the last 10 years of language design is an ongoing dialogue on how much or little to trust engineer intuition about which tools are safe for which jobs. The pendulum is swinging towards the belief that in general engineers should be given much more safety-oriented tools.


It's not like we embrace complexity for its own sake. The harsh reality is that things like file paths, dates and times, phone numbers and even people's names are already intrinsically complex; and you can either acknowledge this complexity in your code, or ignore it at your own peril. If you want to blame anyone anyone for the complexity, blame those who design file systems, or ISO-standardize date and time formats, or even humanity as a whole for coming up with all sorts of naming customs! A programmer (or any other given single person) has only so much control over the complexity of their environment.

Also, you seem to think typeful programming is just for those who either want to show off (“to your geek programmer friends”) or don't trust other programmers (“to the other [extreme] of "never trust the programmer"”). This plainly untrue. Typeful programming is for those who embrace code as a communication medium (“my data structure has these complex invariants, which I can communicate either with three lines of code, or with two long English paragraphs in comments than might even get out of sync”) and want to include the type checker in the conversation (“this code is rather tricky, and I made lots of changes to it, could you check for me if I missed anything?”). Of course, the effectiveness of code as a communication medium is contingent on your ability to understand code.


Exactly. The complexity is inherent in the domain, and making it seem simpler by sweeping it under the carpet is not solving the problem, and it will result in hard to find bugs when working at scale.


There's abstraction, leaky abstraction, and brokenness.

Making things seem simpler by finding an abstraction over them is great!

Making things seem simpler by finding an abstraction that leaks some details so you sometimes need to handle them at a high level is... less great. But sometimes still a win.

Making things seem simpler but neither handling the complexity nor allowing your users to handle the complexity is technical debt at best.


Maybe this is just a sign of my lack of experience, but I don't think implementing leaky abstractions is ever justified outside of systems programming. (Implementing a runtime system for a manged language counts as “systems programming” in my book.) Here are some examples of justified leaky abstractions:

(0) Memory is infinite, and memory allocation will always succeed. If it doesn't, aborting the entire process is the right course of action.

(1) Communication over a network is reliable and data will arrive in the same order in which it was sent.

And here are some examples of unacceptably leaky abstractions:

(0) A few basic data structures [in particular, growable arrays and hashtables] suffice to implement every algorithm you might ever want to implement.

(1) Strings are arrays of characters, and thus you can efficiently index them by character.

(2) Every [nonempty] string is a valid file name or URL.

(3) Every external resource can be manipulated using an API that presents it as a sequence of bytes.

(4) An event loop is the universal solution to all asynchronous I/O problems.


I don't see how your examples line with "systems programming" vs "not systems programming".

I think in general, building on top of leaky abstractions is a form of technical debt, and you have to be aware of how much you're accumulating and what you're getting in exchange. Also, complexities stack super-linearly. I'm tempted to say, additionally, that building shallowly atop a leaky abstraction is less problematic than pushing leaks further down in a tall stack... but on reflection I have no idea how I'd support that.


Building on top of leaky abstractions isn't necessarily a form of technical debt. Some leaky abstractions are actually better than their non-leaky counterparts. For example, not worrying about the possibility of running out of memory lets you concentrate on whatever your program is actually supposed to do, rather than micromanaging a very improbable situation.

So, what I meant to say in my previous post is that, as far as I can tell, good leaky abstractions only arise when doing systems programming, or, more precisely, at the boundary between low-level system services (e.g. runtime systems for managed languages) and the applications that consume them. Abstraction leaks in high-level application code are likely to be bad.


Hmm. I think that's a little confused.

You need to worry about running out of memory precisely because the abstraction is leaky. If you need to micromanage, it's perhaps not leaky, but it's also not much of an abstraction.

Exploring this example a little deeper, if our model is "we never run out of memory, but our program may stochastically die at any point in time", then that's not as leaky as "we never run out of memory", and our program may die for reasons outside our control so it's a good idea to be robust against those. But now let's say we need to send a request to a service, and the service disconnects when we've sent the request. With a stochastic model of failures, we clearly should just re-send. But if the problem was that our payload demanded more memory than could be provided, retrying repeatedly will just keep killing servers.

We probably want a model of "we never run out of memory, our program may die at any point in time, possibly correlated with inputs", which I would contend is actually not very leaky. We still spill over into performance, but most things spill over into performance...

I would say that leaky abstractions constitute debt relative to non-leaky equivalently powerful abstractions. They may still pay their way.


> You need to worry about running out of memory precisely because the abstraction is leaky.

In practice, I don't worry. I'm fine with the leaky abstraction, because the benefit of fixing the leak (reducing the probability that OOM crashes my program from almost zero to exactly zero) isn't worth the cost (increased complexity). But see below.

> If you need to micromanage, it's perhaps not leaky, but it's also not much of an abstraction.

I'm not sure about this. There are levels of abstraction and degrees of micromanagement. For instance, `malloc` lets you micromanage the memory layout of your data structures, and `mmap` and `mprotect` also let you micromanage what you can do with each memory page, but they're still abstractions over even lower-level concerns, like the mapping between physical and virtual memory addresses.

> We probably want a model of "we never run out of memory, our program may die at any point in time, possibly correlated with inputs", which I would contend is actually not very leaky.

Depends on what kind of program you're writing. If your web crawler crashes, you just restart it. If your web server crashes, your users will be annoyed, but at least your database won't be compromised. (You're properly using transactions, right?) But if your DBMS crashes, your data might even be in an irrecoverably inconsistent state, unless you properly anticipated all (software) causes for it to crash.


> In practice, I don't worry. I'm fine with the leaky abstraction, because the benefit of fixing the leak (reducing the probability that OOM crashes my program from almost zero to exactly zero) isn't worth the cost (increased complexity).

In practice, you're being sloppy or what you've described is the relevant worrying - considering whether your program has appreciable odds of OOM and the odds of things being particularly bad when that happens, and then deciding that additional worrying is unnecessary. But this can need re-evaluating as circumstances change.

> There are levels of abstraction and degrees of micromanagement.

No disagreement there.

> Depends on what kind of program you're writing.

A bit, but mostly in terms of performance.

> But if your DBMS crashes, your data might even be in an irrecoverably inconsistent state, unless you properly anticipated all (software) causes for it to crash.

I disagree. If you have designed your program with an assumption that it might fail at any point regardless of reason, then you don't need to have "properly anticipated all (software) causes for it to crash", and further you should be robust against some portion of non-software causes.


> In practice, you're being sloppy or what you've described is the relevant worrying

I'm unhappy with this line of reasoning. For me, an abstraction leak is a failure to behave as advertised. Since memory allocation can always fail, an abstraction that advertises “allocations always succeed” is necessarily leaky. Even if the leak isn't important in practice, the leak exists.

> If you have designed your program with an assumption that it might fail at any point regardless of reason

I'm not sure this is even possible (if I take nothing for granted, how do I do anything?), but I'd like to be proven wrong.


> I'm unhappy with this line of reasoning. For me, an abstraction leak is a failure to behave as advertised.

To my mind, it must fail to behave as advertised in a way that is meaningful. Otherwise, the differences are precisely what is being abstracted away.

> I'm not sure this is even possible (if I take nothing for granted, how do I do anything?), but I'd like to be proven wrong.

I don't mean "might act in arbitrary ways", but specifically "might be unexpectedly be terminated at any point".


> I have never ever had bugs in my code from somehow having invalid Unicode code-points in my paths. 99% of the time, replacing '/' with '\' and handling drive letters gets you there. I realize certain classes of bugs could happen, but I'd rather take that risk, have a bug, and fix it, than burden the type system and libraries.

Lucky for you. I will think about you next time I have to fix one of these 1% pathname bugs you do not seem to be concerned about, which tend to happen all the time when your program is used by many people with many files (i.e. the case every developer hopes for)


Yea the last 1% is the most fun and difficult.

One of the projects I am working on lately is a data integrity and syncing platform with multiple PB of data in multiple data centers, and the dataset has been around for 10-12 years.

Some of the file names and path names I see have the most unnatural and unbelievable characters in them.

It can absolutely be quite a challenge. This point about rusts string system sounds like maybe could give me some very precise control.


I definitely fall on the other side of the fence, philosophically. Perhaps I've just fixed too many of those "trust me" bugs that programmers are constantly adding into code I have to maintain. I'm especially over the thinking that sufficient vigilance is something we should rely on humans to achieve when the computer is way, way better at it.


> I have never ever had bugs in my code from somehow having invalid Unicode code-points in my paths.

Do you use any computers with non-unicode filesystems? I hit such a bug every month or two. (Most recently a couple of days ago in Deluge, for a concrete example, which seems to not open .torrent files with non-unicode filenames). If you're actually using a computer with a non-unicode filesystem at all regularly then I'm honestly amazed you haven't seen such bugs.

> I realize certain classes of bugs could happen, but I'd rather take that risk, have a bug, and fix it, than burden the type system and libraries.

It's a question of costs and benefits, no? A particular class of bugs has severity x and happens at rate y. If the language is such that the type system and libraries can handle it at a cost of less than xy (in terms of complicating or slowing development) then it's worth handling it there; if not, not.

> In a way, I'm a little sad that modern computer languages are going from C's extreme of "always trust the programmer" to the other of "never trust the programmer". Bugs are bad, but language usability, ease-of-use and simplicity are also worth fighting for.

Language usability, ease-of-use and simplicity are absolutely valuable. A good language is one that allows you to write correct programs without sacrificing those things.

I find a language that will keep track of things for me is a actually huge help in writing correct code. E.g. if I'm writing Python 2 then I still have to worry about whether a particular object is a string or a byte sequence - worse, I have to do all the work of keeping track of it in my head, because the language will just silently corrupt data if I get it wrong. I end up writing comments on my functions that say what a particular value is, and carefully reading my code to verify I've made encode()/decode() calls in the right places. I don't want my language to "trust me", any more than I'd want to e.g. try to cut a piece of wood to the right length without measuring or marking it first.


Out of curiosity, what would be some examples of computers with non-unicode filesystems? I'm scratching my head trying to come up with anything other than old, old legacy systems that used EBCDIC.


POSIX does not require any particular encoding of file names. File names are not even required to be text, just a sequence of bytes; the only forbidden bytes are '\0' and '/'.

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_...


Sure - but concretely, what specific systems have non-unicode filesystems? I know a lot of linux distributions allow you to choose a different encoding if you want, but the defaults push you towards UTF-8.


Windows.

(Many of the western codepages are either subsets of UTF-8, or not actually UTF-8 but can roundtrip through UTF-8 so you don't actually notice the difference as long as the program is just using the filename as an opaque "handle" and not actually manipulating it. In my case my filesystem is Shift-JIS for compatibility with some old Japanese programs, and that doesn't roundtrip cleanly through unicode)


This is not "never trust the programmer", it's "do we hand the programmer a broken library?"


The problem is that C's case of "always trust the programmer" has proven over time not to be sufficient. This is particularly true for security bugs.

The major driver behind Rust's design is memory safety - make it impossible (or very difficult) for certain classes of bugs to exist. You can write C software that is reliable without that safety, but it is very time consuming and never perfect.

This is a similar situation - give the programmer tools to write "obviously" correct code.


Fixing a handful of programs that way is already more expensive than putting it in the language.


You shouldn't have to roll your own (likely incorrect) path validation code every time you handle a file. This is something just about every program does. There are standard ways of doing this, and those standards can be done in a standard library.


String handling in Rust is definitely not as accessible as some other languages, but none of the complexity is incidental. The String/str split is actually fundamental and pervasive, but internally all string operations are performed on these two types. Path represents a file system path, so it has a totally different API, and CString and OsString are for translating to other representations in external systems (libc and the OS, of course), not for use outside of that context.

The String/str divide is the unavoidable consequence of not having garbage collection. A String is owned, a str is borrowed. This blog post goes into more information.

This is nowhere near the chaos that C++ has.


You can avoid the split without garbage collection, e.g. Swift has a single String type which is a retain-counted copy-on-write type. But this does mean that working with strings in Swift can have hidden costs (e.g. copies where you might not expect them), plus the cost of atomic reference counting. So this solution isn't appropriate for a language like Rust.


> Swift has a single String type

Which as someone who greatly appreciates the separation Rust offers, this makes me curious: does the string type either not enforce its contents being Unicode, or is it not capable of storing arbitrary OS paths? Because AFAICT, the two are mutually exclusive. The rub here is that OS paths — on POSIX operating systems¹ — are byte strings that we like to pretend are text; but since no particular encoding is enforced, you end up with effectively just byte strings. A String type that validates itself as valid Unicode is not capable of representing these (as arbitrary byte strings are not necessarily valid Unicode). But if we let the String type not validate, then I can pass non-Unicode to functions that take a "String" — something I'd rather not do.

There's also Python's byte smuggling, but I find that a bit weird.

¹While I call out POSIX, I'm told Window's pathnames are sequences of unsigned 16-bit values that most people expect to be UTF-16 — but that fact isn't enforced by the OS.


The filesystem used on OS X and iOS is HFS+, which stores filenames as a normalized UTF-16. The POSIX C interface (open(2), etc) takes UTF-8. If you try to make a file with invalid UTF-8, the filesystem will silently munge it. For example, if your path has a trailing 0xc3 byte, that byte gets rewritten into the string "%c3". So you don't actually get arbitrary byte strings.

I don't know what happens if you do something like mount an NFS share with filenames that aren't valid Unicode. I would guess a fair amount of stuff will break.

Swift strings are Unicode, and Unicode is certainly capable of representing arbitrary byte streams, by just converting to and from any 8 bit fixed-width encoding (not UTF-8!). There are explicit conversion functions like -fileSystemRepresentation which handle the conversion between filesystem paths and Strings.


You're right, I meant "with Rust's memory management story," which excludes universal reference counting.

People differ on whether or not universal reference counting is a form of garbage collection with weak guarantees or a form of manual memory manage with flexible semantics.


> People differ on whether or not universal reference counting is a form of garbage collection with weak guarantees or a form of manual memory manage with flexible semantics.

Which is pretty unfortunate, because the memory management field has always treated reference counting as a form of garbage collection—because it is [1] [2]. The idea that reference counting is somehow not GC is as far as I can tell a recent one (popularized, I think, by Cocoa and iOS?)

This is not just a semantic quibble, because there are successful garbage collection techniques that combine tracing with reference counting (e.g. ulterior reference counting [3]) to achieve GC, so treating tracing as the only way to do GC makes no sense. In fact, David Bacon observed in 2004 that reference counting and tracing garbage collection are best viewed as just two extremes of a continuum [4], and many "optimizations" that you apply to one scheme or the other are really just moving the scheme you choose toward the opposite pole.

[1]: http://www.memorymanagement.org/glossary/g.html#term-garbage...

[2]: http://researcher.watson.ibm.com/researcher/files/us-bacon/B...

[3]: http://www.cs.utexas.edu/users/mckinley/papers/urc-oopsla-20...

[4]: https://www.cs.virginia.edu/~cs415/reading/bacon-garbage.pdf


> The idea that reference counting is somehow not GC is as far as I can tell a recent one (popularized, I think, by Cocoa and iOS?)

I remember seeing lots of "but Python doesn't have REAL GC" arguments years ago. But Cocoa's foray into GC, subsequent retreat back to manual ref counting and then ARC probably have affected this too, yeah.


On Apple's case it is even worse in terms of urban legends, because the anti-GC crowd thinks it is some kind of victory.

What they miss is that RC is a form of GC algorithms and that Apple only gave up on Objective-C's GC, because they never managed to make it stable.

The way one could link GC enabled with non GC enabled frameworks, plus the fragility of underlying C code for memory management meant Apple developer forums were full of crash reports.

So Apple did a "Worse is Better" pivot and basically made the compiler automate the retain/release calls from Cocoa that were being written by hand.

Similarly Swift adopts ARC, because it needs to interoperate with Objective-C's runtime and having a GC would introduce extra performance issues, or force them to use a RC cache mechanism like .NET has for COM.

But of course, the majority without the background knowledge of how the decision came to be or interest in compiler design, just uses Apple's decision as some kind of proof against GC.


I'm not really sure Apple gave up on tracing GC just because they couldn't make it stable. It also happened because they were concerned about pauses (especially when compared to Android at the time, which had a poor GC). That may or may not have been a valid concern; certainly the success of hybrid RC systems in the literature has shown that there is some merit to the idea that RC is useful to reduce pauses, though the situation is complex. The thing is, when viewed in the proper light Apple never introduced and abandoned GC. Rather, Objective-C always had GC, and the aborted "GC" attempt was actually an attempt to switch the GC algorithm.


I agree when discussing from the "RC is GC" CS point of view.

From what I can remember pauses were never an issue.

There used to exist a quite long page on the Apple Developer's web site listening all the caveats and corner cases to watch out when enabling new GC algorithm, if you will.

It is called "Garbage Collection Programming Guide", but all links to it have been disabled. But you can see get it with a bit of Google-Fu.

Also I remember to occasionally go through the forums and they had quite a few developers having integration issues with the new algorithm.


According to the majority of CS books and papers about garbage collection, RC is yet another set of garbage collection algorithms.


Reference counting is a form of garbage collection. Since the problem is readily solvable with any garbage collector (just make substrings point to the parent string), it's readily solvable with reference counting.


> This is nowhere near the chaos that C++ has.

Standard C++ only has value-semantic C strings, and only supports fixed-width character encodings. It's not very rich, but I'd hardly call it 'chaos'. I'm happy to let my GUI library or DBMS handle collation and normalisation.


As the author of Rust's regex crate (and various other stringy things), Rust's string handling is easily the best I've ever used. The usage of String/&str completely dwarfs usage of the other string types, which are a bit more specialized.


In my experience, it is absolutely vital for performance to have both "owning" and "slice"/"view" string types, as well as sensible conversions between them.

This is actually an area where the C++ standard library has been lacking until recently. For example, you'd typically want a map with string keys to own them, but without severe contortions that meant map readers had to allocate a string key (expensive!) to perform a lookup.


> Why is it that some languages can get away with just having 1 unicode string type with libraries to translate to other representations as needed, but not others?

Actually that's almost exactly what Rust is doing. Most of the types of strings you mentioned are added from the FFI library which is used to create bindings to other languages. They are a necessity to convert between the representations nicely.


To be fair, Java and C# -- two languages that you would probably consider as having one unicode string type -- also effectively have two string types between `String` and `StringBuilder`. [0] Java also has `java.nio.file.Path`. [1]

[0] Java also manages to conflate synchronization into the mix with `StringBuffer`.

[1] C# (thankfully) just has static functions to manipulate strings as paths.


If you want to add IO into Java's mix, you can also use StringWriter.

Also I personally don't consider paths as strings, nor URLs – both are types for specific purposes with specific invariants, which are distinct from strings'.


I suppose it depends on your definition of a string. If it's an ordered collection of glyphs, than paths and URLs are definitely strings with invariant restrictions. But, of course, that's hardly a useful definition when you need to take a URL as input, is it? So I agree that treating them as separate types with an external string representation is a very useful tool to have, even if I don't agree that it should always be applied! Having worked with the Path class in Java, I personally find it an impediment to understandable code, but I'd also be willing to accept that that particular implementation and my intuition simply don't mesh well.


> I suppose it depends on your definition of a string. If it's an ordered collection of glyphs, than paths and URLs are definitely strings with invariant restrictions.

That's not quite true -- on POSIX paths are strings of bytes. POSIX doesn't specify any particular encoding[1]. AFAIR the only restrictions on paths are around the NUL (0) byte and '/' (47).

[1] I'm sure we'd all be happier if Unicode (encoded as UTF-8) were specified as the way to do paths, but POSIX was invented quite a bit before Unicode & UTF-8, so here we are.


And on Windows it's "potentially ill-formed UTF-16": https://simonsapin.github.io/wtf-8/

Basically if your language/std-lib claims to be cross platform and considers paths to be definitely-utf8, it's busted. I think, maybe, plan9 is a case where this claim holds?


> And on Windows it's "potentially ill-formed UTF-16"

On Windows, and in many Windows APIs, the return value is a "sequence of UTF-16 code units" (which is essentially a fancy way of saying "a sequence of 16-bit words"). This goes back to Windows originally using UCS-2, where any sequence of 16-bit words is valid (because there wasn't any concept of a surrogate, yet alone a lone surrogate), given its support predates the Unicode consortium extending Unicode to 21-bits and along with it UTF-16's existence (1996).


Agreed, as long as the language provides quick and fast go/from string mechanism, great them as their own objects. What drives me batty is when I know something is just a string under the hood but getting my string into that object is a nightmare and inconsistent.


Honestly, having coded a lot of C# path code, I'd rather an object that properly encapsulates this, as long as there's easy APIs to get a string copy from it and vice versa - I mean, you're going to be throwing away oodles of string refs anyways since they're immutable, so one more for the path conversion wouldn't hurt and the path API would be a lot cleaner.


String is the heap-allocated sequence of characters, str is a view into a String. Strings in Rust are guaranteed to be valid UTF-8; file paths in Windows are not, so you cannot use String for those, so you need another type. Repeat for different "stringy" types that do not have the same semantics as Strings and you end up with a number of string types.


  > String is the heap-allocated sequence of characters, 
To be clear, String is a heap-allocated sequence of u8 which form valid unicode scalar values. "character" in Rust usually means "char", and String is _not_ a list of chars.


Note that everything but String and str are not used that often, and when they are, its straightforward. You'll probably use them in cases like `File::new(Path::new(my_string))` with some error handling if necessary, and forget that the intermediate existed. In fact, many of these APIs use std::convert trait magic so that you don't need to manually convert to the intermediate type at all. You also see these string types used when they are read, stored, and reused later. I haven't seen many cases of string-type-specific gymnastics being done, except sometimes with paths (which you can treat as a list of characters as well as a list of segments).


Having recently read about Pascal string types http://wiki.freepascal.org/Character_and_string_types this feels very interesting.


I think low level systems programing language need that.


Yeah if nothing else the naming here could be better. If someone writes a blog post where the correspondence between String and ownership has to be repeated fifty times, it might have been better to name them OwnedString and StringView, or something like that.


Thanks for this is was very helpful.

Honestly I think the main thing that trips people up here is the naming convention. If it was String and StringView rather than str I don't think people would have nearly as much trouble. I guess the name was chosen for brevity considering how pervasive &str is in rust code.


Yeah, it's sort of historic. I like to joke that renaming String to StrBuf is my only wish for Rust 2.0.

That said, if anything, it would be StrBuf and str, because str is a primitive type, hence the lower case. It almost wasn't a primitive type, but there were drawbacks, so we kept it as one.


I believe that StrBuf or something similar was the name for a short time after the ~str removal.


What were the drawbacks?



I would be against giving the more expensive type the shorter name. That said, I kind of agree with Steve that having a `Buf` suffix would help.


Hey, author here! Happy to answer any questions people have, and happy to hear any feedback.


String (the “owned” sort of string type) is a wrapper for a heap-allocated buffer of unicode bytes. str (the “slice” sort of string type) is a buffer of unicode bytes that may be on the stack, on the heap, or in the program memory itself.

"unicode bytes" aren't a thing; bytes implies encoding (and subscripting yielding <=0xff), otherwise it's "codepoints" (and subscripting yielding an int somewhere on the unicode planes).

Further down:

String and str are guaranteed to be valid UTF-8 encoded Unicode strings. If you’re wondering why UTF-8 is the standard encoding for Rust strings, check out the Rust FAQ’s answer to that question.

That's spot on. Please add this to the first part, too; "... buffer of UTF-8 encoded unicode bytes". or even just "encoded unicode string." It will be clear what is (and is not) meant.

Otherwise nice article! Even understandable for someone with no Rust experience.


Sure, no problem.

As a kinda funny aside, I also wrote the linked-to FAQ answer. Took a number of drafts to get all the fiddly Unicode terminology right.


Maybe consider a word other than 'sort'.

When I see 'sort' collocated with 'string' my mind immediately jumps to the verb meaning of 'sort', not the noun meaning of 'sort', so it was a bit confusing.


Yes I had that too. I would have used "variety", as in this "variety" of string. Or "kind" maybe.


Yeah, the word you look for isn't "sort", but "kind".


I shied away from the word "kind" simply because that has a specific meaning in type theory that doesn't fit here. Higher-kinded types are a must-requested feature for Rust, and I didn't want to confuse readers into thinking this has some connection to Rust's string types.


The word "sort" isn't free of alternate associations either, and "kind" seems less likely to confuse people, seeing as Rust doesn't have any concept called "kinds" right now and won't have HKT for years, if ever.


"sort" also has a specific meaning in type theory - it's a class of kinds.

Also the type-theoretic meaning of "kind" is accurate here, no? Owned vs slice really are two different kinds - it's just that Rust can't represent kindedness in general (it has specialized support for tracking ownedness, but you can't reuse that system to track other properties).


Nice writeup.

I'd like to see some examples illustrating appropriate usage of Clone-on-Write with Rust strings - that's something I personally have struggled with while learning rust.


Cow in general is on my shortlist for "I need to improve the docs."

One area where it's used effectively in the stdlib is String::from_utf8_lossy: http://doc.rust-lang.org/collections/string/struct.String.ht...

If the argument is valid UTF-8, then no allocations need to be performed, and so the returned value will be a &str. But if it's not valid UTF-8, the replacement characters need to be inserted, which means allocation, which means String will be returned.


I blogged about that some time ago: https://llogiq.github.io/2015/07/09/cow.html


I've used it when I wanted to provide a value which was statically initialized with a literal in the source, but which could possibly mutate during program execution. Instead of doing `String::from` and performing the heap allocation regardless, the heap allocation is only performed if you actually mutate it.


Stupid question: Why "str" instead of "Str"? (For homogeneity)


Oh you wouldn't believe the huge amount of bikeshedding that went into the naming of `String` and `str`. :) The end result might not be the nicest - to my eyes at least - but at least a decision was made!


Because primitive types are all lowercase. See: char, usize, i32, f64, etc...




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: