
Libc++'s Implementation of std::string - stuffypages
https://joellaity.com/2020/01/31/string.html
======
dylanmclark
Reminds me of a talk a few years ago about Facebook's internal implementation
of std::string. They do the same short string optimization, but Facebook
manages to outperform this implementation by storing 24 bytes (vs 23) in
"short string mode".

IIRC, Facebook achieves this by using the last byte as the flag byte. To
signify short string mode, this flag is set 0. This allows it to also serve as
the null terminator. Tricky!

~~~
fyp
It's a tradeoff. In libc++'s version you still have string size stored in the
top 7 bits so you just need a bitshift to get size. It sounds like fb's
implementation would require looping until null terminator to get the size.

~~~
roel_v
To save those who, like me, were going to comment 'that would violate the
standard because std::string::size is required to be O(1) complexity' a Google
- the standard recommends but doesn't require that.

~~~
epistasis
Facebook's implementation would still be constant time for short strings,
because there's a constant which bounds the runtime.

Though I hear that the definition of big-O notation has shifted a bit in
Silicon Valley these days so maybe that answer would get me in trouble in an
interview.

~~~
ryani
It's true that big-O notation only concerns behavior with large N, but it's a
bit disingenuous to say that the loop executes a constant number of times --
by that argument, you could say that if you implemented size() by strlen()
it's O(1) because the string must be less than 2^64 bytes long on a 64-bit
machine.

So I can see why someone would claim that implementing size() via strlen()
"only" for small strings shouldn't be considered O(1), because strlen() is
O(n) and within that class of strings the runtime is increasing as the length
increases.

~~~
coolplants
I think it’s a bit more disingenuous to compare the magnitudes of 2^64 and 23
for the sake of argument, as if 2^64 isn’t practically asymptotic.

------
RcouF1uZ4gsC
Note that this implementation has undefined behavior according to the standard
because union members are accessed that haven't been written to. The code that
tests whether it the string is long or short, accesses members of the union
unconditionally.

This is fine because std::string is provided by the standard library and the
standard library is allowed to do stuff normal libraries are not allowed to
do. This technique does work in practice, but it is technically undefined
behavior.

~~~
Inityx
> standard library is allowed to do stuff normal libraries are not allowed to
> do.

How does this work, given that it's still written in C++? Is there special
casing in the compiler to define the behavior?

~~~
aidenn0
It will only ever be compiled with clang, so as long as clang doesn't
implement this behavior in a way that will cause it to be incorrect, that's
fine.

If it were to be special-cased, it would probably require an attribute or a
pragma or something. While it's not unheard of for compilers to automatically
detect they are compiling the standard library, it's fairly rare.

~~~
bregma
libc++ is compiled by compilers other than clang. It's a completely invalid
assumption that it will only ever be compiled by clang, because it's never
only over been compiled by clang.

I say this as someone with a full time paid job supporting libc++ compiled by
another compiler for a commercial organization in a safety context.

~~~
aidenn0
Oh, that's good to know. If obscure C++ compiler X were to cause incorrect
behavior of this code, would it be easy to upstream the fix?

------
userbinator
_The short string mode uses the same 24 bytes to mean something completely
different._

...and the reason short strings are that length is precisely because a long
string needs that amount of space to store the pointer, capacity, and used
variables anyway. On a 32-bit system, the short string limit is lower by half.

Also, as optimised as this implementation is, I have yet to see a compiler
that's smart enough to do things like replace "dumb" uses of std::string with
essentially the equivalent of what a smart C programmer using pointers would
write (as the saying goes, "the fastest way to do something is to not do it at
all.") Ditto for the other data structures in the library. In other words,
optimising individual classes approaches a local minima.

~~~
zelly
And the smart assembly programmer laughs at the C programmer. There are a
couple dozen string-oriented x86 instructions that I've never seen a C or C++
compiler produce. You could easily get a 2x speedup on strings by hand writing
clever x86 with SSE. In fact I'm surprised no one has made an STL with lots of
inline assembly.

~~~
erik_seaberg
Most of what I know predates SSE; does that have a long track record of being
fast? I know REP MOVSB was originally fast, and then CPU vendors decided it
was rarely used and did it in (slower) microcode, and then architectural
changes made it fast again sometimes depending on alignment.

~~~
userbinator
REP MOVS is still a microcode loop, but it will copy entire cachelines
(usually 64 bytes) at once if it can. The fact that it is a tiny instruction
(2 bytes) and runs in microcode means that it doesn't consume instruction
fetch bandwidth while it's running, and occupies only a tiny amount of the
instruction cache.

------
kccqzy
Here's something that once bit me. The libc++ implementation uses short string
optimization. Which means no heap allocation for short strings. Unfortunately
I didn't know this and naively thought when you std::move a string, the data()
buffer would remain unchanged. This is then incorrect. You can manifest this
by storing strings into any container that moves their elements around like
std::vector or absl::flat_hash_map.

~~~
zvrba
> when you std::move a string, the data() buffer would remain unchanged

You cannot assume anything about the state of a moved-from object. AFAIK, the
only valid operations are destruction and assigning something else to it.

~~~
quietbritishjim
> You cannot assume anything about the state of a moved-from object.

This is the right default assuption, but classes are allowed to have (and
document) more specific behaviour. For example, it is guaranteed that
std::unique_ptr and std::shared_ptr are empty (nullptr) after they have been
moved from, and std::vector is guaranteed to be empty after it is moved from
so long as the destination allocator is the same.

> AFAIK, the only valid operations are destruction and assigning something
> else to it.

Even for classes whose move constructors have no guarantees, there are often
other methods that don't have any preconditions, such as calling clear() or
resize(0). In fact is it allowed to call other operations and they should
behave consistently, it's just not guaranteed what exact value the object
should have (e.g. if size() > 0 then .at(0) should not throw and a second call
to it should return the same value as the first call to it).

~~~
zvrba
You're right. But I've given up "language lawyering" a long time ago: I have a
set of "rules of thumb" (like the one I just wrote) that exclude some
technically valid programs, (but so be it) and that let me write robust code
quickly without spending time on digging through the documentation for special
cases.

Even if I did know every single edge case in the language and library, the
developers next to me might not. Then they decide to emulate me (many learn by
example) and catastrophe ensues.

------
yongjik
I always wondered why these string implementations have full 8 bytes for
length. Many programs use millions of strings (or more), while very few would
ever use a >2GB string. It would make sense to use, say, "not-too-long string
optimization" where you only store 4-byte lengths for any string <2GB, and use
the remaining one bit to mean "the length is in the data block".

But I guess these people have run the benchmarks. ...Or maybe not. I have to
wonder.

~~~
usefulcat
It could be done as you say, in 16 bytes (assuming 64 bit pointers):

    
    
        struct not_too_long_string {
            char* data;
            uint32_t size;
            uint32_t capacity;
        };
    

Of course, if you do it that way then the longest string you can store with
the small string optimization is probably ~15 bytes instead of ~23. So
although you do save 1/3 on the size of each string, on average you're
probably still going to end up doing a greater number of dynamic allocations
because of the reduced small string capacity. Unless of course you know a
priori that a sufficiently large portion of your strings will be > 15 bytes
anyway, which of course the implementors of std::string almost certainly don't
know.

Edit: I failed to notice the part about the length being in the data block
(doh). I guess the disadvantage to putting the length there would be that an
extra indirection is required to get the length, a rather common operation.
And as others have pointed out, that only saves 4 bytes, which will be used
anyway for alignment..

~~~
jcranmer
If you want to get really fancy, you can do it in 8 bytes. Pointers are only
48-bits on 64-bits, so you can squeeze a 16-bit size field. If size overflows
that, then you can use a cookie before the data string to find the size.
Capacity could be stored in such a cookie, or junked entirely and you rely on
your memory allocator to get the size of the allocation (small-string
optimization obviously not even being considered in this model).

------
Thorrez
> I will assume that ... the char type is signed and 1 byte wide.

You don't really need to make an assumption about being 1 byte wide. That's
guaranteed by the standard.

[https://en.cppreference.com/w/cpp/language/sizeof](https://en.cppreference.com/w/cpp/language/sizeof)

~~~
stuffypages
Thanks! Fixed.

~~~
stuffypages
Interesting, the page you linked also says this.

>Depending on the computer architecture, a byte may consist of 8 or more bits,
the exact number being recorded in CHAR_BIT.

I'll add that as an assumption.

~~~
thedance
Does it make a difference? It doesn't seem like your article relies on
CHAR_BIT being 8.

------
huhtenberg
In situations when there might be a lot of _empty_ strings, the 'capacity' and
'size' can be taken out of the fixed part and bundled with the data instead.

This will bring sizeof(string) to a size of a single pointer and will still
allow for short strings of 7 chars (in 64-bit builds).

Memory allocations are usually aligned by default, so the pointers will have
at least one lower bit cleared and available to be used as a mode flag.

If in doubt, allocating through aligned_alloc(2,...) will guarantee an unused
bit in a pointer.

~~~
gpderetta
Yes, sometimes it is worth to optimize for size, especially if the string is
actually not used in any fast path and just need to be there.

I'll mention though that std::string (well, basic_string) takes an allocator
parameter, so it could only enable this optimization for 'well known'
allocators that provide aligned buffers.

------
amelius
Isn't the __cap__ field available also inside the malloc block header, so
couldn't this field be optimized out of the std::string implementation?

~~~
thedance
No. You can’t make assumptions about how global operator new works. Allocators
like tcmalloc are allowed to override it. And string can be instantiated with
any stl allocator, so it might never call ::new anyway.

------
jnordwick
Does anybody know why vector isn't allowed short buffer optimization or if it
will be changed in the future?

~~~
gpderetta
You could say that while most string are small, vectors are all kind of sizes
and the T size itself can be large, so the optimization is less of a win in
for a general purpose container.

But it is mostly for historical reasons. I believe the original STL used the
SSO optimiziation [1], so there was never any assumption about the stability
of references to string elements, while there is a lot of code that assumes
that references to vector elements do not change.

[1] The SGI STL, direclty derived from the original HP STL had extensive
rationale on why it didn't implement COW; libstdc++, which I believe also
traces its roots from it, decided to instead do COW. The rest is history.

------
lallysingh
Typo in title. That's std::string.

~~~
stuffypages
Thanks! I don't think I can edit it now though :(

Does the edit button disappear after a certain amount of time?

~~~
thenewnewguy
I don't believe you can ever edit the title of a submission, but the mods can
(and typically will).

