
String types in Rust - steveklabnik
http://andrewbrinker.github.io/blog/2016/03/27/string-types-in-rust/
======
nieksand
My favorite bit of learning with Rust strings was the WTF-8 format:
[https://simonsapin.github.io/wtf-8/](https://simonsapin.github.io/wtf-8/)

Implemented: [https://github.com/rust-
lang/rust/blob/master/src/libstd/sys...](https://github.com/rust-
lang/rust/blob/master/src/libstd/sys/common/wtf8.rs)

~~~
kibwen
WTF-8 is great, and lest anyone think that it's intended to be a joke it's
worth mentioning that it's actually a very useful thing:

 _" WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that
encodes surrogate code points if they are not in a pair. It represents, in a
way compatible with UTF-8, text from systems such as JavaScript and Windows
that use UTF-16 internally but don’t enforce the well-formedness invariant
that surrogates must be paired."_

Though note that it's only suitable for internal usage (as in, you wouldn't
expose an API that uses WTF-8), so it's not something that you need to
actually think about when learning Rust. :P

~~~
derefr
I don't really get the point of it; I'd think that, properly, what you'd want
is:

1\. a plain "buffer of unparsed octets" type (like Erlang's "binary" type), to
hold onto your weird invalid UTF-16 (or whatever other weird invalid stuff you
like);

and 2. a strict, validated-on-construction "sequence of codepoints" type.

Going from #1 to #2 wouldn't be a direct cast; it'd instead be a stream-parser
function that could be hooked with a callback, which would be handed parsed-
and-valid, and invalid-and-left-unparsed, "segments":

    
    
       maybe_parsed_segment() = valid_string_of_codepoints() | binary()
    

...and would then respond with a valid segment, which could just be "" to
discard the input segment.

The default callback would presumably respond to a binary() of size N by
emitting a "�" string of length N. That's pretty much what we have today, but
both sturdier and more extensible.

------
stcredzero
_In this post I’m going to explain the organization of Rust’s string types
with String and str as examples, then get into the lesser-used string
types—CString, CStr, OsString, OsStr, PathBuf, and Path_

Ouch. So one thing that Rust is not going to succeed at is not having the
chaos around string types that C++ has. Why is it that some languages can get
away with just having 1 unicode string type with libraries to translate to
other representations as needed, but not others?

~~~
jdmichal
To be fair, Java and C# -- two languages that you would probably consider as
having one unicode string type -- also effectively have two string types
between `String` and `StringBuilder`. [0] Java also has `java.nio.file.Path`.
[1]

[0] Java also manages to conflate synchronization into the mix with
`StringBuffer`.

[1] C# (thankfully) just has static functions to manipulate strings as paths.

~~~
llogiq
If you want to add IO into Java's mix, you can also use StringWriter.

Also I personally don't consider paths as strings, nor URLs – both are types
for specific purposes with specific invariants, which are distinct from
strings'.

~~~
jdmichal
I suppose it depends on your definition of a string. If it's an ordered
collection of glyphs, than paths and URLs are definitely strings with
invariant restrictions. But, of course, that's hardly a useful definition when
you need to take a URL as input, is it? So I agree that treating them as
separate types with an external string representation is a very useful tool to
have, even if I don't agree that it should always be applied! Having worked
with the Path class in Java, I personally find it an impediment to
understandable code, but I'd also be willing to accept that that particular
implementation and my intuition simply don't mesh well.

~~~
lomnakkus
> I suppose it depends on your definition of a string. If it's an ordered
> collection of glyphs, than paths and URLs are definitely strings with
> invariant restrictions.

That's not quite true -- on POSIX paths are strings of _bytes_. POSIX doesn't
specify any particular encoding[1]. AFAIR the only restrictions on paths are
around the NUL (0) byte and '/' (47).

[1] I'm sure we'd all be happier if Unicode (encoded as UTF-8) were specified
as _the_ way to do paths, but POSIX was invented quite a bit before Unicode &
UTF-8, so here we are.

~~~
Gankro
And on Windows it's "potentially ill-formed UTF-16":
[https://simonsapin.github.io/wtf-8/](https://simonsapin.github.io/wtf-8/)

Basically if your language/std-lib claims to be cross platform and considers
paths to be definitely-utf8, it's busted. I think, maybe, plan9 is a case
where this claim holds?

~~~
gsnedders
> And on Windows it's "potentially ill-formed UTF-16"

On Windows, and in many Windows APIs, the return value is a "sequence of
UTF-16 code units" (which is essentially a fancy way of saying "a sequence of
16-bit words"). This goes back to Windows originally using UCS-2, where any
sequence of 16-bit words is valid (because there wasn't any concept of a
surrogate, yet alone a lone surrogate), given its support predates the Unicode
consortium extending Unicode to 21-bits and along with it UTF-16's existence
(1996).

------
pixel_fcker
Thanks for this is was very helpful.

Honestly I think the main thing that trips people up here is the naming
convention. If it was String and StringView rather than str I don't think
people would have nearly as much trouble. I guess the name was chosen for
brevity considering how pervasive &str is in rust code.

~~~
steveklabnik
Yeah, it's sort of historic. I like to joke that renaming String to StrBuf is
my only wish for Rust 2.0.

That said, if anything, it would be StrBuf and str, because str is a primitive
type, hence the lower case. It almost wasn't a primitive type, but there were
drawbacks, so we kept it as one.

~~~
bjz_
What were the drawbacks?

~~~
steveklabnik
[https://github.com/rust-lang/rust/pull/19612](https://github.com/rust-
lang/rust/pull/19612)

------
brinker
Hey, author here! Happy to answer any questions people have, and happy to hear
any feedback.

~~~
nothrabannosir
_String (the “owned” sort of string type) is a wrapper for a heap-allocated
buffer of unicode bytes. str (the “slice” sort of string type) is a buffer of
unicode bytes that may be on the stack, on the heap, or in the program memory
itself._

"unicode bytes" aren't a thing; bytes implies encoding (and subscripting
yielding <=0xff), otherwise it's "codepoints" (and subscripting yielding an
int somewhere on the unicode planes).

Further down:

 _String and str are guaranteed to be valid UTF-8 encoded Unicode strings. If
you’re wondering why UTF-8 is the standard encoding for Rust strings, check
out the Rust FAQ’s answer to that question._

That's spot on. Please add this to the first part, too; "... buffer of UTF-8
encoded unicode bytes". or even just "encoded unicode string." It will be
clear what is (and is not) meant.

Otherwise nice article! Even understandable for someone with no Rust
experience.

~~~
brinker
Sure, no problem.

As a kinda funny aside, I also wrote the linked-to FAQ answer. Took a number
of drafts to get all the fiddly Unicode terminology right.

------
dallbee
Nice writeup.

I'd like to see some examples illustrating appropriate usage of Clone-on-Write
with Rust strings - that's something I personally have struggled with while
learning rust.

~~~
tatterdemalion
I've used it when I wanted to provide a value which was statically initialized
with a literal in the source, but which could possibly mutate during program
execution. Instead of doing `String::from` and performing the heap allocation
regardless, the heap allocation is only performed if you actually mutate it.

------
twsted
Stupid question: Why "str" instead of "Str"? (For homogeneity)

~~~
bjz_
Oh you wouldn't believe the huge amount of bikeshedding that went into the
naming of `String` and `str`. :) The end result might not be the nicest - to
my eyes at least - but at least a decision was made!

