That’s where I expected to see the Short String Optimization mentioned.
For short strings, it stores the string’s characters in the same memory where, for longer strings, it stores a pointer to the characters (with ‘short’ dependent on the implementation)
It also tweaks how either get stored to make it possible (and fast) to determine what is stored in that space. See https://shaharmike.com/cpp/std-string/
Edit: this was supposed to be a response to the other response to Steve’s comment. On my phone now, can’t easily edit.
However, rust is very well-engineered, and I’d be interested in knowing what factors made its decision a win.
1. Conventions and prevailing idioms. In C++ copying strings is common, both unintentionally due to the implicit copy constructor and intentionally as a means of defensive programming. In Rust, copying strings is relatively uncommon: Rust has no copy constructors (copying must be done explicitly with the `.clone()` method), and the borrow checker obviates the need for defensive copies and therefore passing string slices is overwhelmingly preferred (and recall that string slices in Rust (`&str`) are 16 bytes, in comparison to 24 bytes for C++'s `std::string`).
2. Alternative optimizations. For example, in Rust, initializing a `Vec` (and by extension `String`) does not perform a heap allocation for zero-length values. IOW, you can call `String::new()` (or even `String::from("")`) in a hot loop without worrying about incurring allocator traffic. Any hypothesis that short strings are more common than long strings will likely also hypothesize that the most common short string length is zero (and this appears to be borne out in practice; this optimization is really important for `Vec`, so important that, for Servo, it often makes `Vec` faster in practice than their own custom `SmallVec` type which strictly exists only on the stack), so this optimization goes a fair ways towards satisfying the "I actually do have many short strings and I'm not just copying around the same string a lot" use case.
3. Weighing trade-offs. The pros of SSO are better locality and less memory usage, with the cons of larger code size and additional branches on various operations.
Putting it all together, this gives the Rust devs reasonable incentive to take the conservative path of not implementing SSO for the default string type. That's not to say that we'd be incapable of finding Rust code that would benefit from SSO, or that C++ chose their default incorrectly (different contexts allow different conclusions), or that the Rust devs will always have this stance (if performance benefits of SSO were to be clearly demonstrated on Rust code in the wild in such a definitive way as to justify changing the default behavior, then I don't think they couldn't find a way to make it happen).
G++ used to have CoW strings, but I believe they dropped them because for common cases the atomic reference counting was more expensive than potential copying.
Inlining strings save space and allocator traffic at the expense of a more complex string implementation and (depending on the platform/implementation) strings that do not start word aligned.
There are other flags you might want to put on an underlying text sequence:
- that the text is exclusively ASCII in UTF-8 strings. This means the character count and indexing can be optimized to constant-time operations
- latin-1 in UTF-16 strings. This means you can do an alternate 8-bit encoding for space savings, as well as constant-time count/indexing.
- mark that the string represent a single grapheme cluster (aka a printable character). This may allow you to combine unicode text processing to a single data structure.
- mark that the representation is a slice. In this case, you'd typically store an offset rather than a capacity. This again helps you combine higher level types on top of a single data structure and text processing libraries.
These features (and especially combinations of them) impact the hot path for text processing. If your language is already doing custom allocation (such as using a copy constructor), the allocator savings are nil, so it becomes a space vs code trade-off.
That said, as mentioned the other features do make this hurt less for many cases.
The discussion seemed to me to support the conclusion that Rust will not introduce SSO right now, but the conclusion the maintainers actually drew was that Rust will not introduce SSO ever. Of course, that could always be revisited!
(possibly because a) my observation that It needs an extra operation may be incorrect (I didn’t count all the binary ops in both algorithms) and b) I have no idea whether that makes any significant difference on modern hardware)
If there is loss of performance, it might be enough to offset the loss of ability to do SSO on strings of length 23 (how common are those? I would guess less than one in every million strings has that length)
Also, severely nitpicking: C++03 doesn’t say anything about the time complexity of c_str (https://softwareengineering.stackexchange.com/a/124784), so in that version of the language, we can do a tiny bit better, and use the wasted bits in the last byte to store a 24th character of the string, if that character wouldn’t make us decode the last byte as a valid string length.
C++11 said c_str must be constant time, so that wouldn’t work there anymore.
How so? Any SSO scheme should allow you to determine whether a string is short or not in constant time. Copying a 24-character string into a 25-byte zero-terminated buffer takes constant time, so c_str could simply do the copy.
You could set aside a buffer to use, but you would potentially need one for every string in your program of which the compiler cannot prove that c_str won’t be called on it.
I think the only way to do that would be to use SSO and allocate a memory buffer, and make it possible to find that buffer ∞ O(1) from the ‘SSO’ version. That’s doable, but I wouldn’t call it an optimization anymore.
Also, doing that and keeping the operation constant time would mean you would have to spend an equivalent amount of time on non-short strings. That can be done, but not in a performant way.
I think you're conflating two different meanings of "constant-time". The one the C++ standard cares about is "the time this takes can be upper-bounded by some constant that doesn't depend on the string length or other stuff."
There is a different idea of constant time that's used in crypto when people worry about timing side channels: "the time this takes should be exactly the same, every time."
So, no you need not "spend an equivalent amount of time" on long strings. As you pointed out, you DO have to make sure your memory allocator is good enough. But actually this is easier than you admit: you don't need constant time guarantees for arbitrary allocations; you only need them for allocations of EXACTLY 25 bytes. In practice, you might get this for free, since many allocators try to optimize small allocations.
But the other problem you alluded to sinks this proposal, even in C++98. Because, like you said, you have only two choices:
1. pre-allocate emergency buffer, which defeats the point, or
2. allocate "just in time" which might FAIL if you run out of memory
Since C++ does not allow c_str to ever fail, you can't rely on dynamic allocation to implement it. Which is a bit of a pity since in practice, if allocations that small are failing you have bigger problems anyway most likely.
Thats not constant time, thats linear/O(N) time.
C++ implementations seem to store the capacity in the string object instead of the next to the referenced memory as I described in the article. Do you know why?
It also means accessing the capacity field doesn’t involve accessing another cache line, but I can’t think of situations where that matters, as every time code needs to access capacity, the CPU will soon do something with the pointed to object (if only to free it. Or are there real-world allocators that don’t look at memory ‘near’ to-be-freed objects?)
AFAIK, C++ strings are little more than sequences of char, and string equals compares those sequences. It won’t notice when the two strings happen to be equivalent under some Unicode normalization.
In fact it links to the same post, so perhaps it was added in response to your comment?