Hacker News new | past | comments | ask | show | jobs | submit login

”However, none of them are suitable for general strings where we might have many small ones”

That’s where I expected to see the Short String Optimization mentioned.

For short strings, it stores the string’s characters in the same memory where, for longer strings, it stores a pointer to the characters (with ‘short’ dependent on the implementation)

It also tweaks how either get stored to make it possible (and fast) to determine what is stored in that space. See https://shaharmike.com/cpp/std-string/

We decided against SSO for Rust, incidentally. I’m on my phone and it’s late so I won’t elaborate too much but it’s not always clear it’s a win, depending on language semantics.

I'm sure Steve can provide better sources than me, but the discussion here (and at the first link from the reddit page) look relevant: https://www.reddit.com/r/rust/comments/2slcs8/small_string_o...

Edit: this was supposed to be a response to the other response to Steve’s comment. On my phone now, can’t easily edit.

I imagine I’d be more concerned with performance and locality than memory savings if every string required a heap allocation and an additional dereferencing for access.

However, rust is very well-engineered, and I’d be interested in knowing what factors made its decision a win.

IIRC, there are a variety of factors:

1. Conventions and prevailing idioms. In C++ copying strings is common, both unintentionally due to the implicit copy constructor and intentionally as a means of defensive programming. In Rust, copying strings is relatively uncommon: Rust has no copy constructors (copying must be done explicitly with the `.clone()` method), and the borrow checker obviates the need for defensive copies and therefore passing string slices is overwhelmingly preferred (and recall that string slices in Rust (`&str`) are 16 bytes, in comparison to 24 bytes for C++'s `std::string`).

2. Alternative optimizations. For example, in Rust, initializing a `Vec` (and by extension `String`) does not perform a heap allocation for zero-length values. IOW, you can call `String::new()` (or even `String::from("")`) in a hot loop without worrying about incurring allocator traffic. Any hypothesis that short strings are more common than long strings will likely also hypothesize that the most common short string length is zero (and this appears to be borne out in practice; this optimization is really important for `Vec`, so important that, for Servo, it often makes `Vec` faster in practice than their own custom `SmallVec` type which strictly exists only on the stack), so this optimization goes a fair ways towards satisfying the "I actually do have many short strings and I'm not just copying around the same string a lot" use case.

3. Weighing trade-offs. The pros of SSO are better locality and less memory usage, with the cons of larger code size and additional branches on various operations.

Putting it all together, this gives the Rust devs reasonable incentive to take the conservative path of not implementing SSO for the default string type. That's not to say that we'd be incapable of finding Rust code that would benefit from SSO, or that C++ chose their default incorrectly (different contexts allow different conclusions), or that the Rust devs will always have this stance (if performance benefits of SSO were to be clearly demonstrated on Rust code in the wild in such a definitive way as to justify changing the default behavior, then I don't think they couldn't find a way to make it happen).

String implementations are low enough level that they get influenced quite a bit by memory management behavior, sometimes in surprising ways.

G++ used to have CoW strings, but I believe they dropped them because for common cases the atomic reference counting was more expensive than potential copying.

Inlining strings save space and allocator traffic at the expense of a more complex string implementation and (depending on the platform/implementation) strings that do not start word aligned.

There are other flags you might want to put on an underlying text sequence:

- that the text is exclusively ASCII in UTF-8 strings. This means the character count and indexing can be optimized to constant-time operations

- latin-1 in UTF-16 strings. This means you can do an alternate 8-bit encoding for space savings, as well as constant-time count/indexing.

- mark that the string represent a single grapheme cluster (aka a printable character). This may allow you to combine unicode text processing to a single data structure.

- mark that the representation is a slice. In this case, you'd typically store an offset rather than a capacity. This again helps you combine higher level types on top of a single data structure and text processing libraries.

These features (and especially combinations of them) impact the hot path for text processing. If your language is already doing custom allocation (such as using a copy constructor), the allocator savings are nil, so it becomes a space vs code trade-off.

FWIW the C++ standard actually prohibits CoW strings, and any C++ standard library that implements CoW strings is doing so in defiance of the standard. AIUI this prohibition was completely accidental, and I believe it's due to the rules around which string operations do and do not invalidate iterators.

GCC updated their implementation to become compliant, probably for these reasons.

Worth noting that rust is hit by the additional branches worse than C++ due to the lack of move ctors/move assignment operators. No way to safely store an interior pointer to the buffer as a result.

That said, as mentioned the other features do make this hurt less for many cases.

To add to the other great replies, I should have also mentioned that String is a standard library type like any other; just because we didn’t implement it for String doesn’t mean that a package can’t implement a string with SSO if it’s important for the use-case. That’s also a factor.

From what i recall of the thread, (1) there was one thing in the String API that looked like it wouldn't play well with SSO, and (2) nobody provided really solid data showing that SSO was a consistent win.

The discussion seemed to me to support the conclusion that Rust will not introduce SSO right now, but the conclusion the maintainers actually drew was that Rust will not introduce SSO ever. Of course, that could always be revisited!

It can’t be revisited for String because the method is there, in stable, now.

From what i remember, the method wasn't impossible to implement with SSO, it would just be painful. But, exactly as you have, people took away the conclusion that SSO was impossible.

Yeah, we still could, but it would involve breaking compatibilities with a bunch of other guarantees, so it’s not really feasible IMHO.

libstdc++ and libc++ are both suboptimal in using available bits for SSO, but SSO23 [0] provides a string type which is and provides an interface compatible with std::string.

[0]: https://github.com/elliotgoodrich/SSO-23

Optimal in using bits, but possibly at a loss of performance (it needs to subtracts two numbers to extract string length)

(possibly because a) my observation that It needs an extra operation may be incorrect (I didn’t count all the binary ops in both algorithms) and b) I have no idea whether that makes any significant difference on modern hardware)

If there is loss of performance, it might be enough to offset the loss of ability to do SSO on strings of length 23 (how common are those? I would guess less than one in every million strings has that length)

Also, severely nitpicking: C++03 doesn’t say anything about the time complexity of c_str (https://softwareengineering.stackexchange.com/a/124784), so in that version of the language, we can do a tiny bit better, and use the wasted bits in the last byte to store a 24th character of the string, if that character wouldn’t make us decode the last byte as a valid string length.

C++11 said c_str must be constant time, so that wouldn’t work there anymore.

> C++11 said c_str must be constant time, so that wouldn’t work there anymore.

How so? Any SSO scheme should allow you to determine whether a string is short or not in constant time. Copying a 24-character string into a 25-byte zero-terminated buffer takes constant time, so c_str could simply do the copy.

You need to find space to copy the data to, too. If you allocate that space, your allocator needs to be constant time. I don’t know of any even somewhat competitive allocator with that property (a bumping one that gives up when it runs out of addresses would work, but isn’t competitive)

You could set aside a buffer to use, but you would potentially need one for every string in your program of which the compiler cannot prove that c_str won’t be called on it.

I think the only way to do that would be to use SSO and allocate a memory buffer, and make it possible to find that buffer ∞ O(1) from the ‘SSO’ version. That’s doable, but I wouldn’t call it an optimization anymore.

Also, doing that and keeping the operation constant time would mean you would have to spend an equivalent amount of time on non-short strings. That can be done, but not in a performant way.

Greetings, fellow language lawyer! Pedantic you are, but this time you were not quite pedantic enough :)

I think you're conflating two different meanings of "constant-time". The one the C++ standard cares about is "the time this takes can be upper-bounded by some constant that doesn't depend on the string length or other stuff."

There is a different idea of constant time that's used in crypto when people worry about timing side channels: "the time this takes should be exactly the same, every time."

So, no you need not "spend an equivalent amount of time" on long strings. As you pointed out, you DO have to make sure your memory allocator is good enough. But actually this is easier than you admit: you don't need constant time guarantees for arbitrary allocations; you only need them for allocations of EXACTLY 25 bytes. In practice, you might get this for free, since many allocators try to optimize small allocations.

But the other problem you alluded to sinks this proposal, even in C++98. Because, like you said, you have only two choices:

1. pre-allocate emergency buffer, which defeats the point, or

2. allocate "just in time" which might FAIL if you run out of memory

Since C++ does not allow c_str to ever fail, you can't rely on dynamic allocation to implement it. Which is a bit of a pity since in practice, if allocations that small are failing you have bigger problems anyway most likely.

> Copying a 24-character string into a 25-byte zero-terminated buffer takes constant time

Thats not constant time, thats linear/O(N) time.

It is constant because O(24) = O(1).

Do you know if this is going anywhere? Doesn't look like it has been touched in a few years.

Thanks for the feedback. You are right. I added some text about that and mentioned you at the bottom.

C++ implementations seem to store the capacity in the string object instead of the next to the referenced memory as I described in the article. Do you know why?

I would guess compiler writers have data that indicates that the advantage “longer strings can be stored using SSO” beats the disadvantage “every short string is longer”.

It also means accessing the capacity field doesn’t involve accessing another cache line, but I can’t think of situations where that matters, as every time code needs to access capacity, the CPU will soon do something with the pointed to object (if only to free it. Or are there real-world allocators that don’t look at memory ‘near’ to-be-freed objects?)

Equality comparisons might short-circuit if two strings differ in length (they do in Java). I’d never thought about it before, but I’m guessing that only makes sense if you’re assume normalization already happened.

Good example.

AFAIK, C++ strings are little more than sequences of char, and string equals compares those sequences. It won’t notice when the two strings happen to be equivalent under some Unicode normalization.

The article does mention short string optimization: “One clever optimization is to store small strings directly instead of allocating and referencing a chunk.“

In fact it links to the same post, so perhaps it was added in response to your comment?

This is essentially the same optimization as storing small range integers as "fixnums" whereas the larger ones are heap- allocated "bignums".

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact