Reminds me of a talk a few years ago about Facebook's internal implementation of std::string. They do the same short string optimization, but Facebook manages to outperform this implementation by storing 24 bytes (vs 23) in "short string mode".
IIRC, Facebook achieves this by using the last byte as the flag byte. To signify short string mode, this flag is set 0. This allows it to also serve as the null terminator. Tricky!
One important update is that the primary rationale for us, at the time, to change std::string's implementation to `fbstring` was to get SSO. We got huge perf wins for having SSO on std::string. (small string optimization for those unfamiliar with the initialism).
This, at the time, was a violation of the standard. By wording, SSO was not standard compliant.
Once the standard was changed (as of C++11) so that std::strings could have SSO, we removed our patches on the standard library and now use "vanilla" std::string (from libstdc++). We still have `folly::fbstring` for people who want that implementation (which is smaller, and has more in-situ capacity, because of various cleverness detailed in that talk), but it's no longer our `std::string`.
I think it's to avoid a branch in `size()`. If you use this trick, you have to check the flag if it's a small in-situ string or a "regular" heap-allocated string, because they they calculate size differently. In the standard library strings, the size is always just stored in the same position in memory, so `size()` has essentially no overhead when inlined, it's just a memory load.
Branches on their own aren't particularly expensive (especially not an easily predictable one like this), but they do screw with several compiler optimizations (auto-vectorization, for one). Given that, it seems like a reasonable decision, and the only way you'd know which one is faster is just testing it.
I'd personally think it would be interesting to test artificially increasing the sizeof() of strings from 24 bytes (the minimum needed to store a pointer, size and capacity) to something like 64 bytes, just to get longer strings using the SSO. The trade-off in the stack-space of the string seems like it would totally be worth it, and with 56 character long "small strings", a huge number of strings could fit in there (virtually all names, for instance). You probably couldn't do it for std::string (would wreck havoc with ABIs), but my hunch is that performance and memory usage might both benefit from the reduced memory allocations.
Yes there is some concern that it would affect code size and/or make some optimizations impossible, but on the latter I have yet to see hard data that that is the case. We're talking about adding about 1-2 cycles of latency.
Note that the branch can be implemented in a way that compilers will lower into a CMOV, making the code size issue almost moot. I implemented that in fbstring back in the day:
> I'd personally think it would be interesting to test artificially increasing the sizeof() of strings from 24 bytes (the minimum needed to store a pointer, size and capacity) to something like 64 bytes, just to get longer strings using the SSO
And libstdc++ is actually 32 bytes :( It's easy to template on the inline capacity, you can look at llvm::SmallVector or folly::small_vector, but vocabulary types should have the smallest possible footprint. Vast majority of instances are either empty, or very small (think keys in a map).
> I'd personally think it would be interesting to test artificially increasing the sizeof() of strings from 24 bytes (the minimum needed to store a pointer, size and capacity) to something like 64 bytes
The "optimal" implementation would actually have different allocation sizes when used mostly for processing and when used mostly for storage: for processing one would prefer bigger defaults, for storage, one needs as little as possible initial overhead.
That could be achieved by raising the knowledge of that distinction and introducing actually (at least) two kind of strings. And even more than that can be gained by recognizing that whenever the strings are involved in some more complex structures, some additional restrictions for totality of all associated strings could also be exploited: allocating the strings separately is an overhead by itself.
If I remember correctly, Turbo Pascal strings had the default size of 256 bytes and that's how much was on the stack for each, and that was fast even decades ago. That was not efficient for having a lot of them in the structures, if a lot of them could be smaller, but that was what a programmer had to care separately - still the typical processing of strings and working with the limited number of them was nicely covered by that default.
The 2^n-1 sizes are cache-line don’t need explanation, I think. 27 was the maximum length of a volume name in MFS, the original Macintosh file system (which was a strange mix of backwards (no directories) and forwards (255 character file names) thinking).
Str32 was used in AppleTalk (and probably a design error or bug; Str31 is a more ‘natural’ type)
You shouldn't need a branch though. The size would always be the last element of the buffer. It doubles as the null terminator if the size is exactly 23. The only additional operation is adding the static offset to the stored value to get the actual size.
It is probably just a small win and not everybody wanted to pay for the extra complexity (SSO is already fairly complex).
Re your suggestion on large SSO, years ago, when doing document clustering, using large fixed size strings (64 chars if I remember correctly, longer strings were just truncated) was a huge win for me (even coping them around wasn't an issue). Not sure if it is appropriate for a general purpose string though.
edit: never mind, I see what you mean about having to check the size.
Yeah, exactly :) when in SSO mode, it stores the size as a delta occupying a single byte, but when in heap-allocated mode, it stores the size as a an 8-byte size_t, and you require a branch to check which case is the right one. You can see the branch here in the source [0], and godbolt confirms that for std::string it's just a memory load [1]. A shame Folly isn't available on Compiler Explorer, would be interesting to compare.
Just a guess: the null byte is necessarily at the end of the 24 bytes rather than at the beginning. It might be in a different (typically 64-byte) cache line. Maybe one you don't need to read in from main RAM otherwise if you're handling a shorter string. That might make more of a performance difference than whether 23-byte strings (plus NUL byte) are short or long, depending on your application.
> Do you happen to know why the standard doesn’t use that trick?
Many of the early implementations of `std::string` were COW (copy-on-write), which is not entirely compatible with SSO strings. In practice, COW strings proved to be troublesome and slow, and the SSO strings won out. The standard was changed to effectively mandate SSO strings and forbid COW strings.
In the early days of std::string multi-threading was much rarer, so reference counting was cheaper. As CPU count grows the relative cost of needing an atomic refcount on every string also grew.
The COW strings also were a constant source of surprising behavior. As with any COW object you have to make sure that the copy happens before and write. Simple enough, but the STL-style interface does not support this well. All you had to do is call operator[] or begin() on a non-const std::string and it would force a copy.
This meant that something that to the programmer clearly looked like a read-only operation like:
if (str[2] == 'x') { ...}
...could end up copying the string. Worse, it also meant that any previously taken references are now invalidated. This lead to horrible bugs where you would hold a pointer at the previous copy of the string (which you no longer own a refcount on) and it would almost always work... but every so often another thread would come by and reuse the memory behind your back.
So COW std::string was a long festering mess in C++. Their removal in C++11 was completely warranted.
It’s funny that every company gets to the point where they have to have their own std::string. I wonder if Facebook also traces their need back to the time when GNU std::string was copy-on-write and therefore completely useless. Google, I think, can trace their std::string to those dark days. Luckily c++11 effectively outlawed COW string implementations, so nobody need be subject to that horror show again.
The language is fundamentally too weak to express the kind of reference and borrowing relationships demanded by CoW.
Suppose I have a string and I take a reference to a character:
std::string a = f(); // shared
char& c = a.front();
Should the string be copied or not? One might argue, no taking a reference is different from "writing" so no copying, but one might also argue that reference can be used to change the string later, and the string doesn't know when this would happen so it has to pessimistically make a copy.
You get bugs if you go with the former, and bad performance if you go with the latter.
Considering jq's libjq's jv[1] C API is CoW and written in C, I'm a bit disinclined to believe that C++ is too weak to express a reasonable CoW API for strings. Instead, I suspect that std::string was too weak a design for that.
The language is too weak to express it safely. CoW on mutable strings in C and C++ by necessity implies safety (or drastically performance-degrading) tradeoffs. By contrast, `jv` types are immutable so they can presumably implement CoW without compromising memory safety or correctness.
But `jv` is written in C. It requires that the programmer adhere to a convention, and that is probably what you mean by "too weak" -- and I would agree with that if that is what you mean.
All APIs have conventions though, and jv's is quite simple. It's very difficult to get it wrong and not crash immediately, though it's easy enough to get it wrong and leak. Having (unfortunately) had to use C a lot, I must say that the jv API is among the nicest I've had the pleasure of using. Sure, it's not Rust, but neither is C++, and yet that's the topic of this thread. But I'd use a jv-like API in C++, and I think that would be peachy.
Unfortunately the STL has an overload for the non-const version, which would cause this problem. It's hard to imagine the cases where you'd need a mutable reference to a single character. It can fit into the same register that the reference would live, so even as an immutable reference it doesn't make much sense.
That's not an answer. The STL has to have a non-const version that returns some kind of reference. Because C++ strings are mutable and even something like
s[3] = 'h';
involves forming a mutable reference and then assigning its pointee.
The way I think about it, to make CoW strings really usable in C++, you basically have to make strings immutable, like many newer languages like Python. That ship has sailed by the time standardization for C++ happened.
Atomic operations needed on multi-core, and the stuff that leaks a non-const reference to raw memory, like casual read-only use of operator[], .begin(), .end(), forces a copy on shared objects and prevents the object from being shared in the future.
In addition to what other people have said, the move semantics introduced in C++11 made a lot of the benefit of CoW moot: the point of CoW is to have a cheap copy operation, but most of those cheap copy operations can be replaced by even cheaper moves, often automatically by the compiler.
A copy-on-write string is pretty similar to a shared_ptr to char. Like a shared_ptr, a shared string has to check whether it is the last referent when it goes out of scope. This introduces a point of contention on multi-core systems. A string like the one described in this article might not have to do anything in its destructor, whereas a shared string will need to atomically decrement a reference count. That’s a huge difference in cost.
The benefit of a shared string is you might save some space. That’s fine but I think most programmers today will choose am explicitly shared type when they need it, and a fast string for most cases.
From what I've heard - COW means each string data may be pointed to by multiple objects, potentially being used by different threads. So every string operation requires thread synchronization, which can be quite expensive.
It's a tradeoff. In libc++'s version you still have string size stored in the top 7 bits so you just need a bitshift to get size. It sounds like fb's implementation would require looping until null terminator to get the size.
If you use the LSB of the last byte for a flag, you're using the 56th bit of one of the {data,size,cap} words. I'd use the MSB in this case, since you can shift that off easier.
To save those who, like me, were going to comment 'that would violate the standard because std::string::size is required to be O(1) complexity' a Google - the standard recommends but doesn't require that.
Facebook's implementation would still be constant time for short strings, because there's a constant which bounds the runtime.
Though I hear that the definition of big-O notation has shifted a bit in Silicon Valley these days so maybe that answer would get me in trouble in an interview.
It's true that big-O notation only concerns behavior with large N, but it's a bit disingenuous to say that the loop executes a constant number of times -- by that argument, you could say that if you implemented size() by strlen() it's O(1) because the string must be less than 2^64 bytes long on a 64-bit machine.
So I can see why someone would claim that implementing size() via strlen() "only" for small strings shouldn't be considered O(1), because strlen() is O(n) and within that class of strings the runtime is increasing as the length increases.
That is clever. Using last byte as the flag byte also lets one cast the std::string memory address to a null terminated c-string without intermediate buffers (not that one should).
Umm.. in any implementation you can already "cast" a std::string to a null-terminated string without a buffer (cf std::string::data() and std::string::c_str()) -- C++11 essentially mandates that. c_str must return "a pointer p such that p + i == &operator[](i) for each i in [0,size()]."
The history of those demonstrate some of the bat-shit insanity of the evolution of C++. Originally data didn't have to null-terminated, but then they finally realized that making c_str actually work without this invariant in general was nearly impossible. So since C++11 data and c_str are equivalent.
Edit: by impossible I really mean something usable with sane complexity requirements -- that wouldn't make people just immediately discard std::string for something better. This is the same C++ that brought you auto_ptr, and other garbage. C++11 corrected many things, this included.
It was possible with copy on write strings (e.g. libstdc++'s strings until C++11), but C++11 added complexity requirements that outlawed COW strings, because they're only really useful in very specific circumstances, at the cost of making the common case slow.
Imo this is probably in part because you can mix the two more easily? I imagine you could link again C++ 2003 code compiled with a compiler understanding the new ABI?
There was no need to plug in the NUL until somebody called c_str(). That possibility was lost in C++11. Now, if somebody plugs something else there (which is done, UB notwithstanding), c_str() produces the wrong answer.
NUL-terminated strings were one of the dumber things C++ inherited from C, and code that depends on NUL termination is not a thing to be proud of.
How do you "plugin" the NUL. If there's no space for it, then everytime you want a C string, you need to do an allocation. And why would you put something in it's place - as you said if you modify the c_str that was always UB -- so you pretty much lost all guarantees at that point -- that was just dumb.
Basically the NUL-terminator amounts to a 1-byte of wasted space -- which is almost always completely in the noise -- if you're using std::string for tiny strings you're paying at least 24 bytes anyway. There are just very few cases where that one extra byte matters.
I think the overwhelming opinion is that this 1-byte trade-off is better than the overhead of allocation and copying when you want to pass that string around. NUL-terminated strings aren't important in C++ because of C (well, not directly), but because they are essentially the ABI of countless existing libraries and the major operating systems.
> code that depends on NUL termination is not a thing to be proud of.
Whatever. Code that depends on NUL termination is ubiquitous.
>How do you "plugin" the NUL. If there's no space for it, then everytime you want a C string, you need to do an allocation.
You have not thought it through. There was always space for a NUL, but nobody needs it to have a NUL in it until somebody calls c_str(). Anyway, that was true until somebody jiggered the Standard, just before 1998, to require a NUL visible to s[s.size()].
>C Code that depends on NUL termination is ubiquitous.
> Anyway, that was true until somebody jiggered the Standard, just before 1998, to require a NUL visible to s[s.size()].
This is not technically true. You could plugin your hated NUL byte in the implementation of operator [].
> C Code that depends on NUL termination is ubiquitous.
No, that's not what I said.
It really doesn't matter what language underlying the API of all the code that expects NUL terminated strings is written in (a lot of it is C++ of course too). Windows, MacOS, and all POSIX-ish systems have a large API that consumes NUL terminated strings. NUL terminated strings are ubiquitous in computing at this point. Sure blame C from 40 years ago and burn a dmr effigy, I don't care, but that battle was long lost.
NUL terminated strings may be terrible, but the C++ accomodation for them is not -- its a well-thought out trade-off. My understanding is that neither go nor Rust make this trade-off. golangs FFI and syscall overhead has always been something of a performance disaster anyway. C++ has always had a greater demand to "play nice" with legacy code and systems then either of those.
The overhead of just having the NUL-byte is almost always a non-factor. If it really is, then use a byte vector.
Buggering op[] would be the slowest possible way to get the NUL there. Some of us would have liked for string operations not to be as absolutely slow as can be achieved. Evidently that is a minority preference.
If you think anybody is complaining about the extra space for the NUL, what you are barking up is not even a tree.
Are you saying that there should not be an extra 0 byte at the end until the string fills up and c_str() is called? Inconsistent allocations seem like a recipe for difficult to debug performance problems.
Note that this implementation has undefined behavior according to the standard because union members are accessed that haven't been written to. The code that tests whether it the string is long or short, accesses members of the union unconditionally.
This is fine because std::string is provided by the standard library and the standard library is allowed to do stuff normal libraries are not allowed to do. This technique does work in practice, but it is technically undefined behavior.
I'm not very familiar with the C++ standard, but in the C standard a load from a different union member than the last store isn't undefined behavior, it's unspecified behavior. That's a big difference. So long as the store doesn't generate a trap representation (and most architectures don't have such representations anymore) when reinterpreted as the new type, then you can rely on the behavior, it's just that you'll need to refer to compiler documentation to understand the values reliably produced.
Moreover, there are additional guarantees when reading through char types, as well as unsigned types, so depending on the precise code things may very well be totally well defined.
C doesn't disallow type punning. The gotchas mostly have to do with visibility to the compiler, otherwise the compiler may reorder loads and stores. The safest way to do type punning is through union members as the standard makes additional guarantees that restrict the types of optimizations a compiler can make.
>> Note that this implementation has undefined behavior according to the standard because union members are accessed that haven't been written to.
Except when the union member you are reading is a 'common initial sequence' [1] of the union member that was written, to be precise ;-). But that's not the case here.
Undefined behavior in C++ and C means that it's up to the implementor to decide what happens in a situation. Usually it ends up being whatever is most convenient or most performant on the target architecture.
One consequence is that portable code can't depend on any particular behavior because it can be different between implementations. Every implementation will do something, but each one may do something completely different.
Another consequence is that "undefined behavior" isn't undefined in the context of a specific implementation because you can look at what it does and see how it's defined in that implementation. In this case, libc++ is essentially part of the implementation, so it's fair game for it to depend on implementation details of clang.
> Undefined behavior in C++ and C means that it's up to the implementor to decide what happens in a situation
> Every implementation will do something, but each one may do something completely different.
If I'm reading this correctly, you're saying that we can depend on each compiler providing a consistent way of handling each kind of undefined behaviour.
That's not correct. That describes implementation-defined behaviour, which is different. [0]
Compilers do not have to decide what behaviour should result from a particular kind of undefined behaviour, and then commit to ensuring that behaviour occurs consistently. That's the point of undefined behaviour: the compiler is permitted to assume the absence of undefined behaviour, and to optimise accordingly.
If you have undefined behaviour in your C++ code, you are not guaranteed to see consistent program behaviour. Your program is ill-formed. All bets are off, throughout the entire lifetime of your program. [1] (In C++, undefined behaviour can 'travel back in time', meaning that if your program invokes undefined behaviour, the behaviour across the entire lifetime of your program is made undefined.)
A compiler may choose to commit to a certain behaviour for a certain type of undefined behaviour (such as guaranteeing wrap-around behaviour for signed overflow), but it is not required to.
A compiler is required to define a consistent value for sizeof(int), because that's implementation-defined. [0]
> Another consequence is that "undefined behavior" isn't undefined in the context of a specific implementation because you can look at what it does and see how it's defined in that implementation.
This isn't right.
Unless the compiler's documentation tells you that you can rely upon its handling of the relevant undefined behaviour, then then compiler is not required to provide consistent behaviour for any particular kind of undefined behaviour.
> In this case, libc++ is essentially part of the implementation
Its about maintenance. Things which are "undefined behavior" can actually have a defined behavior if the compiler provides it. By doing something undefined in the stl, the compiler authors are saying that the compiler will support it and if the compiler changes, they authors will also change the library. If you rely on this behavior, theres no guarantee that the compiler wont change this behavior in a future update thereby breaking your code.
It will only ever be compiled with clang, so as long as clang doesn't implement this behavior in a way that will cause it to be incorrect, that's fine.
If it were to be special-cased, it would probably require an attribute or a pragma or something. While it's not unheard of for compilers to automatically detect they are compiling the standard library, it's fairly rare.
libc++ is compiled by compilers other than clang. It's a completely invalid assumption that it will only ever be compiled by clang, because it's never only over been compiled by clang.
I say this as someone with a full time paid job supporting libc++ compiled by another compiler for a commercial organization in a safety context.
The short string mode uses the same 24 bytes to mean something completely different.
...and the reason short strings are that length is precisely because a long string needs that amount of space to store the pointer, capacity, and used variables anyway. On a 32-bit system, the short string limit is lower by half.
Also, as optimised as this implementation is, I have yet to see a compiler that's smart enough to do things like replace "dumb" uses of std::string with essentially the equivalent of what a smart C programmer using pointers would write (as the saying goes, "the fastest way to do something is to not do it at all.") Ditto for the other data structures in the library. In other words, optimising individual classes approaches a local minima.
And the smart assembly programmer laughs at the C programmer. There are a couple dozen string-oriented x86 instructions that I've never seen a C or C++ compiler produce. You could easily get a 2x speedup on strings by hand writing clever x86 with SSE. In fact I'm surprised no one has made an STL with lots of inline assembly.
Most of what I know predates SSE; does that have a long track record of being fast? I know REP MOVSB was originally fast, and then CPU vendors decided it was rarely used and did it in (slower) microcode, and then architectural changes made it fast again sometimes depending on alignment.
REP MOVS is still a microcode loop, but it will copy entire cachelines (usually 64 bytes) at once if it can. The fact that it is a tiny instruction (2 bytes) and runs in microcode means that it doesn't consume instruction fetch bandwidth while it's running, and occupies only a tiny amount of the instruction cache.
you'd get worse results, because 1) most of those instructions are slower on modern hardware anyways, and 2) assembly is an optimization barrier on modern compilers, so even if you make one function 0.1% faster, it's not worth it if you've slowed down the rest of the program by 1%.
> Ditto for the other data structures in the library.
I agree with your observations, but I don't know if I agree with this part. For example, if you've already chosen to make a dynamic-resizing heap buffer, you're probably going to write something no better than std::vector. I know this because I've written the same thing many times working in .c files. It's handy to have that one standardized.
Of course, the lack of a good one of those will make the dynamic-resizing heap buffer a less common choice in C, so in that sense maybe you're right.
For example, if you've already chosen to make a dynamic-resizing heap buffer, you're probably going to write something no better than std::vector.
Of course, the lack of a good one of those will make the dynamic-resizing heap buffer a less common choice in C, so in that sense maybe you're right.
Precisely. If you need a dynamically resizeable buffer then using std::vector will be a very good idea, but perhaps you don't really need it; the lack of such "ready-to-use" data structures in C encourages more thought on whether it's necessary to do so, and as a result you might end up with a more efficient solution which doesn't.
Another example is using a std::map<char, something> - no compiler I know of will replace that with a 256-element array, despite the latter being much more efficient than the former when you use most of the values.
While your probably will not write something better than std::vector in a good c++ library such as libc++, it is quite possible to do better than a bad implementation of std::vector. Source: I have done better than the std::vector that was shipped with the compiler on an old sun.
I have an old sun with their old compiler powered off in my home, I could check it out.
What is the complaint though? Bad realloc strategy?
Come to think of it, the worst part of how vectors historically have been was not vector itself, but the requirement to use the copy constructor at reallocation time. C++11 and move semantics fixed that.
That was a decade ago and I don't remember. My main point was: Yes it is good to have a standard implementation that comes shipped with C++, but that does not mean you can not do better for your special case. And it might be possible with surprisingly little code.
It took a lot of time for most C++ compilers to get good at removing abstraction. It was so bad that Stepanov wrote the abstraction penality benchmark to name and shame compilers.
In particular vectors with class type iterators (as opposed to using raw pointers) paid an heavy penality. Is it possible that your vector used raw pointers as iterators?
I am not op, but I do frequently feel like idiomatic C++ will create temporary copies of strings in areas where a C programmer would be less inclined to do so.
For example, removing a prefix the C way would be to add the size of the prefix to the pointer.
I guess c++ has string_views for that sort of case as of c++17.
> For example, removing a prefix the C way would be to add the size of the prefix to the pointer.
Surely to do this you'd still have to keep either a reference to the original pointer around or remember the offset so you could `free(str)` with the correct pointer?
Yes, although it's not always your responsibility to do that.
Firstly not all strings are on the heap, but ignoring that... If somebody passes you a pointer to a string and you aren't going to hold onto it for a while, you can just add to the pointer and it's no big deal. But yes, if you own the allocation, you need to keep the old pointer too.
I always wondered why these string implementations have full 8 bytes for length. Many programs use millions of strings (or more), while very few would ever use a >2GB string. It would make sense to use, say, "not-too-long string optimization" where you only store 4-byte lengths for any string <2GB, and use the remaining one bit to mean "the length is in the data block".
But I guess these people have run the benchmarks. ...Or maybe not. I have to wonder.
You would be trading 4 bytes for an extra branch on getting the size. Since the instance needs 8-byte alignment, those 4 bytes would be lost in padding or allocation overhead anyway.
The predictors work by caching which branches were taken, and assuming the same outcome is likely to happen next time the branch at that address is checked. That particular branch is checked all the time, and unless you actually have >2GB strings the result is always the same.
Only branches which dynamic runtime behavior are bad for performance, like binary search when the same branch is taken/not taken randomly, depending on the data and the key being searched. Branches with stable runtime behavior are OK thanks to branch prediction. Examples of good branches: `while( i < 10000 )`, `if( string_length < INT_MAX )`
Of course, if you do it that way then the longest string you can store with the small string optimization is probably ~15 bytes instead of ~23. So although you do save 1/3 on the size of each string, on average you're probably still going to end up doing a greater number of dynamic allocations because of the reduced small string capacity. Unless of course you know a priori that a sufficiently large portion of your strings will be > 15 bytes anyway, which of course the implementors of std::string almost certainly don't know.
Edit: I failed to notice the part about the length being in the data block (doh). I guess the disadvantage to putting the length there would be that an extra indirection is required to get the length, a rather common operation. And as others have pointed out, that only saves 4 bytes, which will be used anyway for alignment..
If you want to get really fancy, you can do it in 8 bytes. Pointers are only 48-bits on 64-bits, so you can squeeze a 16-bit size field. If size overflows that, then you can use a cookie before the data string to find the size. Capacity could be stored in such a cookie, or junked entirely and you rely on your memory allocator to get the size of the allocation (small-string optimization obviously not even being considered in this model).
Eh, I was thinking about pretty much exactly what you said. "Length in the data block" would be when length doesn't fit in 32 bits. (I.e. >2GB or maybe >4GB.)
It would require an additional branch to test for huge strings, but it will be almost never executed, and I think modern CPUs are pretty good at optimizing out such branches...
I'm not sure that is optimization is relevant now that your CPU's native architecture will be just as efficient with the 8 byte length as the 4 byte one.
They do run bechmarks, but of course it is anyone guess how significant and representative they are. At the end of the day it is a long list of tradeoff. You pick one point in the option space and hope to have gotten it right, because it will enshrined in the ABI for a very long time.
Here's something that once bit me. The libc++ implementation uses short string optimization. Which means no heap allocation for short strings. Unfortunately I didn't know this and naively thought when you std::move a string, the data() buffer would remain unchanged. This is then incorrect. You can manifest this by storing strings into any container that moves their elements around like std::vector or absl::flat_hash_map.
> when you std::move a string, the data() buffer would remain unchanged
You cannot assume anything about the state of a moved-from object. AFAIK, the only valid operations are destruction and assigning something else to it.
> You cannot assume anything about the state of a moved-from object.
This is the right default assuption, but classes are allowed to have (and document) more specific behaviour. For example, it is guaranteed that std::unique_ptr and std::shared_ptr are empty (nullptr) after they have been moved from, and std::vector is guaranteed to be empty after it is moved from so long as the destination allocator is the same.
> AFAIK, the only valid operations are destruction and assigning something else to it.
Even for classes whose move constructors have no guarantees, there are often other methods that don't have any preconditions, such as calling clear() or resize(0). In fact is it allowed to call other operations and they should behave consistently, it's just not guaranteed what exact value the object should have (e.g. if size() > 0 then .at(0) should not throw and a second call to it should return the same value as the first call to it).
You're right. But I've given up "language lawyering" a long time ago: I have a set of "rules of thumb" (like the one I just wrote) that exclude some technically valid programs, (but so be it) and that let me write robust code quickly without spending time on digging through the documentation for special cases.
Even if I did know every single edge case in the language and library, the developers next to me might not. Then they decide to emulate me (many learn by example) and catastrophe ensues.
Not entirely true. Moving from std::vector is required to not invalidate iterators and references when moved from. This is why strings can be small buffer optimized and vectors can't be.
This does not conflict with what I or the OP wrote; you're mixing up the vector and its contents. Moving from a vector can invalidate `data()`, yet references and iterators to its previous contents remain valid.
You're misinterpreting the parent. They mean that they thought data() of the target after the move would be the same as data() of the source before the move. This is guaranteed by vector, for example.
In situations when there might be a lot of _empty_ strings, the 'capacity' and 'size' can be taken out of the fixed part and bundled with the data instead.
This will bring sizeof(string) to a size of a single pointer and will still allow for short strings of 7 chars (in 64-bit builds).
Memory allocations are usually aligned by default, so the pointers will have at least one lower bit cleared and available to be used as a mode flag.
If in doubt, allocating through aligned_alloc(2,...) will guarantee an unused bit in a pointer.
Yes, sometimes it is worth to optimize for size, especially if the string is actually not used in any fast path and just need to be there.
I'll mention though that std::string (well, basic_string) takes an allocator parameter, so it could only enable this optimization for 'well known' allocators that provide aligned buffers.
You could say that while most string are small, vectors are all kind of sizes and the T size itself can be large, so the optimization is less of a win in for a general purpose container.
But it is mostly for historical reasons. I believe the original STL used the SSO optimiziation [1], so there was never any assumption about the stability of references to string elements, while there is a lot of code that assumes that references to vector elements do not change.
[1] The SGI STL, direclty derived from the original HP STL had extensive rationale on why it didn't implement COW; libstdc++, which I believe also traces its roots from it, decided to instead do COW. The rest is history.
No. You can’t make assumptions about how global operator new works. Allocators like tcmalloc are allowed to override it. And string can be instantiated with any stl allocator, so it might never call ::new anyway.
IIRC, Facebook achieves this by using the last byte as the flag byte. To signify short string mode, this flag is set 0. This allows it to also serve as the null terminator. Tricky!