Hacker News new | past | comments | ask | show | jobs | submit login
Std::string half of all allocations in the Chrome browser process (groups.google.com)
170 points by mlrtime on Dec 5, 2014 | hide | past | favorite | 166 comments



The problem with std::string is that it's named wrong. It should be called std::string_buffer, because that is what it is. Its performance characteristics are closer to a std::vector than a std::array (now available since C++11).

Many projects cannot copy around std::vector<char> in good conscience. They really want a copy-on-write string, an immutable string, a rope, a reference-counted string, or an always-in-place string. Or some combination of the above depending on the circumstance.

The problem is that std::string is not a good type to use as a parameter for various reasons. In addition to its aggressive allocation behavior, it's also fairly inflexible. What are the alternatives?

1. boost::string_ref - available now, so use it

2. std::string_view - available starting in C++14 and works roughly like boost::string_ref

3. pass around pairs of iterators instead of single objects

3) is actually the most flexible, though it requires different kinds of overhead. The most obvious way would be to template all your string-accepting functions on two parameters: the type of your begin iterator and the type of your end iterator. But the benefit is that you can pass around any of the above to your heart's content, plus more, like elements in tries.

std::string still has an important place, but it should generally be used as a private member variable, not as something you require in your interface. Pretty much the same thing goes for char* unless you are implementing a C ABI (plus a size, please). Even then, you can immediately convert to/from a boost::string_ref and still have yourself a self-contained reference to a bounded character sequence.


> The problem is that std::string is not a good type to use as a parameter for various reasons. In addition to its aggressive allocation behavior, it's also fairly inflexible. What are the alternatives?

I think a more specific critique is that std::string is not a good type to use as a copy parameter (ie. a non-reference, non-rvalue parameter.) It's perfectly acceptable in scenarios where you're passing it as a const reference (ie. not transferring ownership) or passing it as an rvalue where it can be moved in-place.

Your critique (and suggestion to use string_ref) seems to be oriented more towards use cases where a mutated copy of a string (be it a substring or some other transformation) needs to be passed around... I don't think it's fair to say "it's not good as a parameter."


No, it's bigger than just copy parameters. std::string, even as a const reference, only works well in these cases:

1. your users already have a std::string

2. you want to support string literals

3. you'll be converting to string in the implementation anyway

What if I have a vector<const char>? What if I have a type called SanitizedString or OracleString? What about character arrays retrieved from C ABI calls?

And even if 3. is true, you are encoding implementation details in your interface. If you need to optimize your type later to use a trie, you'll have to do extra allocations and copies after all.

My point is that interfaces should require what they need: a character sequence if they need a character sequence and a character buffer if they need a character buffer. Hence my comment about std::string being more appropriately named std::string_buffer.

Even string_ref isn't perfect because it assumes your sequence is in contiguous memory. I'm hoping developments in the C++ language and standard library comes up with a good answer to that problem, perhaps one that leverages the concepts proposals. But even a concepts-based solution will have drawbacks, such as requiring even more code to go in your header file.


std::string_view didn't make it into C++14.


Ah, I guess I'm behind in the news then. I thought it was in an experimental header. I guess I hope it's in C++17 then, or else we may need to root for a good ranges proposal that covers the same ground.


Or proper garbage collection so you don't need to worry about who owns the string and you don't have the big overhead of ref-counting.

Why on earth are they using C++ for a web browser anyway? That's about the worst possible choice of programming language for that problem domain.


Because they forked it from WebKit which is derived from KHTML, which has a history that stretches back to 1998. Better to ask which languages would have been a better fit for a web browser in the context of the late nineties that have had the staying power of C++ (I'm guessing you'll come up with a rather short list).

If you want to see an attempt to write a new browser engine in a language that is memory-safe by default, I suggest you check out Mozilla's Servo engine: https://github.com/servo/servo


This is awesome! Tons of respect for Mozilla for supporting this kind of research.


All 5 major web browsers are written in C++.

EDIT: Chrome, Firefox, Safari, IE, Opera


Sure that Safari is C++ and not Objective-C?


Well WebKit is C++, but the UI is probably Obj-C.


So which web browsers are not written in C++?


Almost all of Windows (including apps that predominantly do string and tree manipulation) is written in C++. Almost all of Linux userspace is written in C. These are still terrible, insecure, fragile languages which cause frustration and loss to programmers and users every day.


Shame you're downvoted, because you're pretty much right. Don't get me wrong, correct code can very well be written in C/C++. It's just 95% of the programmers using those languages are not very skilled. The programs tend to be wrong and fragile. In C++, most of this fragility derives from its C-roots.

C/C++ is also a poor match to today's CPUs. It's very slow compared to what the hardware is capable of. Compare for example with Intel ispc (https://ispc.github.io/). It gives some idea how much performance we're currently missing.

When C was young, memory latency was typically 1 or 2 clock cycles. There was no pipelining, at most a simple state machine that would finish in a few clock cycles. A branch didn't cost much. A few cycles at most. Neither did a pointer reference. Random access was almost as fast as sequential access.

Today's CPUs have memory latency of 150-300 clock cycles. A modern CPU core can typically retire 1-4 instructions per clock cycle. A single instruction takes typically about 16 clock cycles from decoding until retire. So CPUs have to often execute blind and just guess where the execution flow will go. Branches modify this flow. CPUs simply guess the flow, branch predict. When they're wrong, they just have to invalidate currently executing instructions and start again. Branches are something to avoid. Especially unpredictable ones. Function pointers are branches. Even though they can be predicted, they often just fall out of the branch predictor cache.

We need something that can minimize the costs modern CPUs are bad at. C/C++ is very branchy and uses slow function pointers often (vtable, switch jump tables, etc).

The problem is, there's more variation among CPUs than ever before, even within same instruction set architecture. For x86, not only the costs for instructions are wildly different, but the instruction set support is fragmented.

C/C++ is not the last word. Unsafe and way too slow compared to what current hardware is capable of. The problem is, the language that is safer and a good match just does not exist yet. Some safety can be sacrificed for greater speed, but it should be situational choice by the programmer, not the only way or even default.

C/C++ is what I do at my day job.


Neither of you are really answering the question. You still haven't proposed a language is better.

When the half dozen browsers that represent about 99.9% of the browswer market all use C++, perhaps the explanation is not "They're all wrong."


> When the half dozen browsers that represent about 99.9% of the browswer market all use C++, perhaps the explanation is not "They're all wrong."

I didn't say they're wrong by any means! C++ is the horse and cart. Performance wise, the programming language automobile doesn't exist yet. Of course one should and must use what is currently available!

The best language for writing a web browser does not exist yet. Or for writing other similarly complicated applications with performance requirements.

So, yes, C++ is probably the best language for writing a web browser right now. The cleanest dirty shirt. Certainly not the safest, but out of high level languages it has the best performance. It has large enough work force, important if you're building a large project. C/C++ interfaces with current operating systems natively. And a lot of other considerations make it a sensible choice.

But C/C++ is rather far from what a language could be. It's very bug prone. Overcomplicated. Hard to maintain.

And it is slow. Not compared to other languages, but to CPU performance potential. I can write very fast software in C++ -- if I effectively fall back to assembler or at least compiler intrinsics. In other words, by being a human compiler.


Around 20 years ago we had quite a few systems programming languages to choose from.

The widespread of UNIX into the industry, killed the other languages because they weren't the UNIX system programming language and other system vendors weren't able to fight against UNIX based Workstations.

So like JavaScript in the browser, C eventually killed the alternatives in the workstation market.

C++ was able to gain industry acceptance because C++ compiler vendors were actually C compiler vendors bundling C++ in their products, since C++ came from AT&T as well.

Additionally, the computers started to be fast enough for mainstream VM based systems.

The business application developers moved along to VM based languages, while the system programmers focused on the languages that were supported by larger OS vendors, C and C++.

All the compiler vendors selling system programming languages compilers that didn't enjoy first class treatment on a mainstream OS, either closed down or changed business.

So 20 years later, C and C++ became the only alternatives for systems programming, unless one wants to play with dead languages.

With Ada being used in projects that required it, or for software scenarios where human life are at risk like aviation, train control systems, medical devices and so on.

And now we are getting beaten every day in security exploits due to daggling pointers and out of bounds errors.


C++ bugs are sometimes - too often - faster to find even from disassembly than from source code. Seriously. Yes, it's a very slow approach. But you can see what's really going on, and the picture is shockingly different from the source code point of view. While disassembly is verbose, there's no syntactic sugar that can obscure true function. C++ features, like operator overloading and exceptions, locally hide actual functionality. I've seen another language that has a lot of different ways to do same thing and enabled write-only development. Perl.

No, disassembly listing is not what I typically use for debugging. Just one of the tools. The point is, something is wrong when disassembly can be easier to understand than source code!

So depressing when you need to deal with it often. Our best tool is not very good.

Well, all this puts bread on my table, so I guess I shouldn't complain too much...


Well, the problem with C++ is that it's way too easy to take its features too far. I've seen commercial C++ libraries that overloaded operators to an absurd degree. I don't recall the name at the moment, but it was a database library.

Overloading operators is a great feature, but should be used as sparingly as possible, and only where it makes intuitive sense.

Templates offer even more rope to hang yourself, and even though I'm a late convert to using templates, I would never suggest the feature goes too far. It's kludge fuel, no doubt, but the language needs the feature, and it allows you to do things that are truly useful and elegant.

It never occurred to me that viewing the disassembly could be useful for debugging, and I would imagine it's not normally useful unless you are really doing some pretty sophisticated stuff with the language. It would be extremely educational to understand what the compiler actually does however, but if inspecting the disassembly is instructive on how best to use the language, then I would suspect the language design, or at least the compiler, is doing something wrong.

I do think C++ requires a little too much consideration of how certain operations are implemented, e.g., this very discussion, but it's a price I'm willing to pay for a language that lets me do things any way I want to do them.

I program in Python in my spare time and for small scripting tasks, and can't imagine choosing C++ over Python for any personal project I've done in the past couple years (which tend to be small anyway), but I use C++ at work and am very happy to continue using it after 20 years on and off (mostly on).


>> It's just 95% of the programmers using those languages are not very skilled.

High level languages just make it easier for crappy programmers to approximate a working solution.


This is true for C++ and doubly true for Java.

In 20 years, I don't think I've ever seen a commercial third-party C++ library that I thought was competently designed and implemented. I was about to offer the exception of mysql++, but then I remembered it's an open-source library so it wouldn't count as commercial. C++ seems to breed a lot of crappy but marginally useful code.

I haven't done much Java, and I don't think it's a bad language, I've never seen a Java development environment that I would even wish on my enemies. Java seems to breed bloated over-engineered but under-flexible designs that require the programmer to do way too much of the kind of stuff that we invented computers to do for us in the first place.


what languages are more appropriate for today's hardware? The majority of my experience in C (Linux) is on embedded hardware (and quite a few years ago). I've never run into wasted cycles before but that setup is really different from a modern cpu on a desktop.


> Almost all of Linux userspace is written in C.

Depends on what you count. Qt can be seen as (part of the) Linux userspace. KDE applications are usually written in C++ (partially in QML+JavaScript), Gtk applications are written in C or Vala (a few in C#). If you don't count DE frameworks basically the only thing left would be libc, which of course is C (but available in some form under every noteworthy OS).


What's fragile, insecure, and terrible about them? I've written C/C++ (among other languages) for years at work and it's quite rare that I see a bug that could be attributed to the language.


You rarely see uninitialized pointers or off-by-one array bounds errors? I wish that was true for my colleagues.


Late reply - sorry that I missed your response.

I've found that RAII classes like std::unique_ptr have really helped. Errors like an uninitialized pointer are often caused by poor ownership semantics in the program design, and I don't think you escape that in reference counted languages. You'll still have a messy class design without clear ownership, and there are types of resources other than memory to manage (file handles, locks, sockets).

I'm not saying that your colleagues are bad engineers, I've had many troubles with ownership design over the years and still do at times. I just don't think C++ is unique in this regard even if it does force you to think about it more regularly since memory interaction is so common.


what experience do you have with writing c++? what is the last large project you completed with it? i'm not a huge fan of c++, but it has been the foundation of an extremely large number of wildly successful and pervasive projects.


I have in the past written many large C++ projects. I now try to avoid it as much as possible, but the last time I programmed in C++ was just two weeks ago[1], and it reminded me how bad it was -- terrible error messages, and abysmal performance of the final program. At some point I will have to replace it because the program doesn't meet our performance requirements, and no one can tell me why[2]. (Edit: Apparently [2] has been answered now)

[1] http://git.annexia.org/?p=virt-bmap.git;a=tree

[2] https://stackoverflow.com/questions/27152834/how-to-improve-...


Of all the things you shudl expect from a non-C++ language, performance is highly unlikely to be on that list.

If you can't get something to be performant with C++, it's pretty unlikely you'll get it to be performant with most other languages.

(Unless you want to try it in fortran)


I expect C programs to outperform C++ programs, simply by virtue of being more explicit (about everything, from allocation to levels of indirection).


Not always faster.

Try comparing qsort to stl::sort...

Hint: stl::sort is a lot faster. It can inline the comparison function. Ok, it just shows how slow function pointers are. I'm assuming compilers still can't optimize the function pointer away in qsort.

Of course a specialized sorter in C would be faster, one that can inline comparison function. C++ copy pasting automation... err, I mean templates are useful, because it can often inline your function and data type in the algorithm without using function pointers. At cost of generating same or similar algorithm code multiple times bloating the binary.


Clang's error messages are much better than GCC - it will even spell-correct your typos and suggest which symbol you actually wanted to type. Around 2010 Google switched to a hybrid Clang + GCC environment (the same source code would be piped to both compilers, the error messages from Clang would be piped back to the user while the object files from GCC were linked to form the final binary), and C++ developer happiness skyrocketed.


To be honest, I think gcc has pretty much caught up with clang with gcc-4.9 when it comes to error message quality.

I have a slight preference for the way clang chooses to highlight the entire AST parent node that it thinks is responsible for the error, but the approach g++ takes is perfectly clear.


It has been answered, and according to the answer, there's an easy 20x-30x speedup to be had. Maybe your problem is not C++ per se.

I wouldn't be surprised if std::unordered_set would yield further speedup. (It doesn't look like you need the order criterion). Other things worth trying are:

* Not using a set at all - it seems most sets only have 3-4 elements. You might as well use a vector. * It doesn't look like your intervals overlap, and it looks like they're already ordered. interval_map may be unnecessary, and it might be worth keeping the data in a simple vector. * You don't want to copy object_set, either. A const-ref is more than enough * And finally, if you want to veer into nit-pick territory, you want to use range-based loops. It might or might not yield better compiled code. It's definitely much more readable.

  for (const auto& iter : map->equal_range(window))
    for (const auto& obj: iter->second)
      f(iter->first->start, iter->first->end, obj.c_str(), opaque);
I don't disagree with the abysmal error messages, or the thought that C++ is a beast to learn and control. But if there's one thing it does give you, it's decent performance, if used properly.

Let's put it that way: The browser people don't write C++ because they think it's such an awesome language :)


In #2, I'm curious why you didn't like sehe's answer?


C: Lynx, w3m, Mothra Java: Lobo, gngr (https://gngr.info/) Limbo/Inferno: Charon Lisp: Closure, eww


Sharing mutable buffers? Nope, you would still need to switch to immutable strings or copy the buffers.


By the way, reference counting may not have to be as bad as it is. Looks like there's a lot of opportunity for improvement. See here: http://users.cecs.anu.edu.au/~steveb/downloads/pdf/rc-ismm-2...


how do you "share" objects that multiple pieces of code wan't to own. you need to copy if you are passing around a mutable string.


If you dig deeper and actually look at the source diffs, you will see that this is not about std::string being "bad" (it's not), but it's about problems with how they're using std::string. Most of the trouble was constantly converting back and forth between std::string and const char*, which needlessly produces temporary allocations. Simply moving to passing everything around as const references should help enormously with memory churn.

EDIT: Spelling :)


I beat the subject of data representation completely to death in my article "Pointers, References and Values" at:

http://www.warplife.com/tips/code/c++/memory-management/para...

One of the worst performance problems in C++ is quite commonly the creation of invisible temporaries. Sometimes they are necessary but commonly they are not.

There are specific cases where it's better to pass a value rather than a reference, for example if you have an integer in a register, and your parameter passing conventions calls for passing parameters in registers, then passing that integer as a value, rather than a reference (const or otherwise) avoids the use of memory; it also avoids the use of the memory cache, which is likely to be a more serious problem.

In general, for a C++ program to make lots of allocations of just one class isn't such a bad thing. If your default implementation is slower than you like, you might be able to speed things up considerably, as well as save memory, reduce disk paging and so on by using a custom memory allocator just for that class.


folly::fbstring, a drop-in replacement for std::string, is part of the folly library that we (Facebook) open sourced a while back. It allocates small strings in-line and larger strings on the heap and has optimizations for medium and large strings, too. It's proven quite effective for us, particularly when used with jemalloc, which it conspires with for more optimal memory management. We use it as std::string for our C++ applications and libraries, completely replacing std::string both for our own code and third-party code.

https://github.com/facebook/folly/blob/master/folly/docs/FBS...

In addition, it is worth noting folly::StringPiece (from folly/Range.h), which is generally a better interface for working with in-memory ranges of bytes. Hardly a new idea (it's inspired by similar libraries, such as in re2), but it permeates the APIs of many of our larger C++ systems, and folly itself, and often avoids passing std::string objects around at all.

Finally, there is also folly::fbvector, which offers similar improvements over std::vector.


Yeah, and if I want to use it I need to replace std::string all around my code.

I can't use it for other libraries, unless I replace and recompile them.

And tomorrow a new library comes I need to replace everything again.

Your effort is commendable, and I know squeezing gains are important in the case of fb, but in the end it's just locking yourself in a library that should've been a second though/built in.


1. Find / replace, I'm sure the API is simple enough and performance gains are worth the effort.

3. You don't need to replace anything, ever - only if you feel like your string performance is lacking, and that particular new library satisfies your needs (in terms of effort vs performance gains). I don't get why people believe that whenever something new comes out they have to switch over, and as a result are terrified of innovation.

4. I disagree with the "should've been built in" statement; the default std::string implementation is good enough for most (like any std:: thing). Besides, std:string was designed 25+ years ago, over a dozen CPU generations ago; the demands of today are must different.


I agree with most points, but

"std:string was designed 25+ years ago, over a dozen CPU generations ago; the demands of today are must different."

Sure, but you don't need to break the API for that

You can replace the allocator in a C program without changing malloc/free to something else, this might have been a design goal. This way you can go back and forth to it, run your tests again, compare performance again, etc


Not necessarily - you can find the system headers on your machine and then do 'cp folly/FBString.h /usr/include/c++/4.9.1/string', and then every time you do #include <string> it will pick up Folly's FBString. If a new version comes out, just drop it there. If you're a large corporation, just modify your build system to use your custom standard headers.

It sounds like this is what Facebook does - no Facebook code actually references FBString, it just uses std::string and Facebook's implementation of std::string is the Folly version. Google does something similar, where they replaced std::string with their own more efficient version, and everybody just uses it like a normal string.


So what are you saying exactly? Stick with std::string until the the stdlib devs make it better?

I don't think find/replace and including this library until something better comes along is an insurmountable problem.


If only they all implemented something like java.lang.CharSequence...

"A CharSequence is a readable sequence of char values. This interface provides uniform, read-only access to many different kinds of char sequences. "


FWIW, the email thread does say that some "base::StringPiece" should be used more often.



Does folly "just build" on current Ubuntu / Debian boxes now? When it first came out I had enormously trouble getting it to build at all.


It hardly seems like introducing yet another type can solve this problem which is caused by temporary conversion back and forth between string and char*.


25000 (!!) allocations are made for every keystroke in the Omnibox.

The Omnibox is no doubt far more complex than simple text box since entering characters into it can invoke things like network connections (for search suggestions), but 25k allocs is still a bit on the excessive side.

Strings are an interesting case in that in general they are of indeterminate (and variable) length, which makes them somewhat difficult to accommodate in computer memory which is finite and allocated in fixed-length pieces. Abstractions like std::string have been created to make it simpler and easier to perform operations like appending, resizing, copying, and concenating, but I think this is part of the problem: by making these operations so easy and simple for the programmer, they're more inclined to overuse them instead of asking questions like "do I really need to create a copy just to modify one character? do I really need to append to this string? how long can it be?" Essentially, the abstraction encourages ignorance of the real nature of the actual operations, leading to more inefficient code. It only helps the programmer to perform these tedious operations more easily, and doesn't help at all with the decision of whether such tedious operations should be needed at all, which I think is more important; the first question when designing shouldn't be "what abstractions should I use to do X?", but "do I really need to do X, or is there are simpler way that doesn't need to?" The most efficient way to do something is to not do it at all.

Contrast this with a language like C, in which string operations are (unless the programmer writes or uses a library) far more explicit, and the programmer can be more aware of what his/her code is actually doing. That's why I believe every programmer who has to deal with strings should have at one point been exposed to implementing a resizable string buffer and/or length-delimited string library, to see the real nature of the problem (including how to do length management correctly.) Without this basic, low-level understanding of how to use memory, the advantages of all the other fancy string abstractions won't make much sense either.


> Contrast this with a language like C, in which string operations are (unless the programmer writes or uses a library) far more explicit

String operations are more explicit. Mostly.

But some things (like sharing between threads, type conversion, and memory ownership) are very implicit and unsafe. Some of the implicitness to C programmers is so familiar that they don't notice it:

  /* 1. This function returns an error code
   *    and not a size
   * 2. zero is success, nonzero is an error */
  int
  /* 3. this method belongs in mylib
   * 4. this function supports ASCII???
   * 5. this function allows memory to overlap? */
  mylib_copy_first_n(
      /*  6. *out_s is allocated
       *  7. *out_s is greater than n bytes in size
       *  8. out_s will hold a null terminated string */
      char const * out_s,
      /*  9. in_s is allocated
       * 10. *in_s holds at least n characters
       * 11. it's not a big deal if in_s has null characters */
      char const * in_s,
      /* 10. n is not negative
       * 11. if n is zero, something sane will happen */
      int n);
I wouldn't hold up C as an example of explicitness. Now, this isn't a great example of C code, but even in the best examples of C code, there is a lot of correctness by convention and documentation.

To be fair, I'm not aware of any languages and libraries that are fully explicit in type, behavior, and memory usage. Maybe Ada or one of the functional languages get it right. But there are a set of C++ tools that can make this sort of thing fast and correct. That's why it's disappointing that std::string isn't fast and is only mostly correct.


    mylib: DEFINITIONS =
    BEGIN
      copy_first_n: PROCEDURE [VAR out_s: STRING, in_s: STRING, n: CARDINAL ]
      RETURNS [error: BOOLEAN];
    END
Mesa at Xerox PARC - 1979, one of the influences for strong type systems programming languages.

But the 80's startups had to go build workstations based on UNIX.


Nitpick: I don't think the output buffer would be const.


I'm old enough that I actually programmed in Turbo Pascal 3 (around 1985) which of course produced quite fast code even for the speeds of the processors then (4 MHz processors were enough for everybody, not really, but that's what we had!). That Turbo Pascal had strings that were of the limited size, but they were able to use the stack, not the heap. I still don't understand that the library string in C++ can't use the heap instead of the stack even for the small strings. Most of the allocations detected in the post are actually the short-time ones, and I'm also quite sure that most of the strings aren't too big, which means that using the strings in the Turbo Pascal style (on the stack for the local variable) would remove the need for most of the allocations.

I guess that will maybe come in 2020 in C++ standard, if anybody of the people who would need that reads this and works hard. Yay for the march of progress.


C++'s std::string could store the first few characters inline, and only fall back to heap storage if the string is too big. I'd be surprised if some implementations don't do this already.

TurboPascal's strings had a number of quirks. Maximum length was limited to 255 characters, but different string variables could declare different lengths in their types. Writing string functions that could work with any string required using the OpenString type[1], which passed the string by reference instead of by value, with pseudo-const semantics (the parameter couldn't be passed as a var parameter elsewhere). Using the default String type meant allocating 256 bytes for every string variable. You'd want to pass it by reference (with 'var', and later using 'const') to avoid the extra copying. {$P+} in later versions made string-typed parameters openstring by default[2].

[1] http://putka.upm.si/langref/turboPascal/0109.html [2] http://putka.upm.si/langref/turboPascal/047E.html

(I implemented the migration from AnsiString to WideString in the Delphi compiler when I worked for Borland. I know far too much about Pascal and Delphi strings.)



It's called the short string optimization. It was used once but I am not sure if it is any more. C++11 move semantics may help a lot with string churning...


As others have mentioned some libraries do use it. It has tradeoffs though. In a "traditional" implementation sizeof(std::string)==sizeof(char * ) -- it keeps a pointer to the first byte of the text with the metadata (minimally: size and capacity) stored before it in memory. c_str() is just a "return p_;" and size() is something like "return reinterpret_cast<const size_t * >(p_)[-1];"

Now to add the short-string optimization you need to do one of two things:

1. Add a small buffer we can point "p_" to inside the std::string itself. However, strings are very common inside of structures so this will bloat objects throughout your program

2. If you're clever you can use the lowest bit of the pointer to indicate that it's an interned string which would let you store 6 byte strings inside the pointer itself on a 64-bit machine (remember you still need the '\0' for c_str()'s benefit) However now c_str(), size(), etc all need an extra branch instruction. These are normally inlined methods, so now you've bloated/slowed the code instead.

Personally I prefer a keep-it-simple string implementation for most things.


2. You can actually go up to 7 bytes as long as you're sure you get word-aligned pointers back from malloc (usually a good bet). Make the tag the last byte, indicate a short string by "tag_byte & 0x07 != 7", and then store the length as "7 - tag", reserving tag 7 for pointers. If it's a 7-byte string, then the tag byte itself will be 0, serving as the null terminator. If it's < 6 bytes, the null terminator gets stored in one of the earlier bytes. If it's a pointer, the pointer will be word-aligned, which on a 64-bit machine means that you can zero-out the low-order 3 bits (exactly what you need to store the tag) to extract the actual pointer. If it's a null string (the one case we couldn't represent, since we use 0x07 as the tag for pointers), store it with a null first byte, which also gives compatibility with C.

Whether these gymnastics are worth it is debatable. My intuition is that you lose more in bit-twiddling instructions on common operations than you gain by being able to store an extra byte in short strings, but I'd want to benchmark on real data before implementing.


Remember that on a little-endian machine (i.e. nearly everything now) the "last byte" of the pointer is actually the first byte of the string. The only way you can use the 8th byte of the string as your sentinel is if you can be sure that the allocator won't ever give you a pointer with MSB=0x00.

You are right that on big-endian machines you can smuggle a 7th byte into the pointer though by sharing the "tag" and the '\0' terminator. You don't really have to worry about the 0-byte case since in a traditional implementation there is a shared empty-string sentinel that the default constructor uses. So if you are mutating a short-string and the result is 0 bytes you can always just replace it with a pointer to the shared sentinel.

I agree with your intuition about the costs. I think your program would have to be pretty dominated with tiny strings for all of this optimization to help much. My guess is that it would microbenchmark well. However, all of those extra branches would add pressure to I-Cache and branch predictor history which would offset it in the real world.

folly's fbstring actually has 3 separate regimes (interned tiny strings, classic normal strings, threadsafe-COW large strings) so I guess they decided that the extra branches were worth it for them. I still prefer a simpler design where c_str()/size() don't require any branches though.


C++ libraries can optimize smaller strings and store them on the stack. It's commonly known as short string optimization. I'm not sure how many libraries actually use SSO, but I think the three major vendors use something similar.

https://stackoverflow.com/questions/10315041/meaning-of-acro...

https://stackoverflow.com/questions/21694302/what-are-the-me...


You can specify the allocator to use with basic_string, and stack based allocators are possible. I've certainly used them with STL containers. It's just not the default.


The C++ string library cannot bind the string to the stack frame since it doesn't know how long it's going to live.

I suppose the compiler might realize that however and replace dynamic allocation by a static one but I'm not sure it's allowed to do that.


That is the kind of magic thinking of C++ programmers sometimes exhibit that drives me insane.

Yes, compilers have become excellent at optimizing code, but they aren't magical. The generated code will have some semblance to the code we write. Eliding dynamic allocations is just not possible statically. Heck, we aren't even able to emit warnings for missing deallocations reliably and have to use run-time instrumentation for it.

Even in C++ or C, even with good compilers, we still have to go the extra mile if we want fast code.


That is the kind of magic thinking of C++ programmers sometimes exhibit that drives me insane.

I dunno. I've run into too many programmers on the other end of the spectrum. They're reluctant to take advantage of the genuinely zero-cost abstractions that today's excellent optimizing compilers for C++ enable.

I can't find the reference now, but I thought I had seen some discussions at one of the ISO C++ meetings (from Google?) about potential standards changes to allow allocations to be elided when they're limited to some known scope.


> I dunno. I've run into too many programmers on the other end of the spectrum. They're reluctant to take advantage of the genuinely zero-cost abstractions that today's excellent optimizing compilers for C++ enable.

Performance wise it might give a false impression of being zero-cost if you micro-benchmark or are very selective what you consider as a cost. Handy for developer effort wise, sure.

Yes, it's often cheap or even 'zero-cost'. But you can't assume that to be always the case.

In an actual system it might be not so zero-cost when there are multiple concrete instances of generated code of the template. Each consume require actual generated machine code. Larger code size can have a significant performance cost. TLB and L1C cache misses are far from free. Neither is loading the larger executable for short life time processes. Extra page fault, caused by larger memory requirements, that goes all the way to disk is also rather expensive.

The whole picture is simply way too complicated to make blanket statements like that.


I think the right approach is to get a decent high level understanding of how compilers work. Some surprising optimizations and failures to optimize become a lot clearer. For example, SSA makes it obvious why there's absolutely no penalty for using lots of named temporary register-sized variables.

Then learn your particular compilers' abilities by research and reading your assembly output. If you're depending on an optimization, always read the output. You might be pleasantly surprised as often as you're disappointed.

Understanding the rules of the language are important too. In C++, pointer aliasing and the worthlessness of 'const' are huge obstacles to some seemingly easy optimizations.


You probably mean N3664 [1]. The compiler is indeed allowed to get rid of allocations in new expressions it deems unnecessary.

You can verify that recent versions of clang allocate no memory in the function

    int f() { return *new int(123); }
[1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n366...


Devirtualization for example.


Interestingly, with Rust, the borrow checker makes it statically safe to know when you can stack alloc.


You can also do it in a more limited set of cases in other languages using escape analysis. You can't provide it in all cases, but in a subset of cases you can show that the allocation never escapes a certain stack context, and therefore it's safe to stack-allocate. I believe the JVM does that in many cases.


Yeah, I wish .NET would do that. Just calling, say, String.Split is gonna involve at least two allocations. At high performance levels, a single allocation per piece of work can be noticeable. And fixing it is difficult. You can actually manually allocate managed objects on the stack, but that's pretty hadn't hacky, unsafe, and doesn't help fix functions you call.


> You can't provide it in all cases,

I'm curious how reliable it is. It seems simple, intuitively, but Java (the JVM?) seems to have a reputation of being sub-par in this respect. Maybe it has to do with knowing how the the variables that are sent to other methods are used; if it only has local knowledge (the method that the object is created in), then I guess it has to be pessimistic with regards to all objects that are used as a parameter in method calls in that method.


Depends on which one, the Escape Analysis in java7 Hotspot is quite conservative. The one in Graal is very aggressive https://wiki.openjdk.java.net/display/Graal/Graal+Partial+Es.... Not sure what other VMs do.


> The C++ string library cannot bind the string to the stack frame since it doesn't know how long it's going to live.

Surely some kind of string that lives on the stack can have its storage on the stack because the storage goes out of scope exactly when the object does. I think the only place you might run into trouble is if you try to `move` a stack-allocated string somewhere, because it'll invariably result in a copy.


Without any compiler magic involved I don't think it's possible to implement that in C++ in a way compatible with the current std::string interface.

The problem is not the allocation of the std::string object itself, it's the allocation of the underlying buffer containing the string. std::string has a static size, the buffer itself is dynamically sized.

You could replace the heap alloc with something like calloc (or dynamically sized arrays) in the constructor but then the buffer would be tied to the stack frame of the constructor, not the calling scope (so the memory would become invalid as soon as the constructor returns).

Unless I'm missing something completely obvious the only way to implement that would be to ask for the caller to provide the buffer and let the string take ownership of it. It's definitely possible but it's not how std::string works and it's arguably more error-prone since unlike languages like rust there's no full blown lifetime checker in C++, so you'd have to make sure that the string object never outlives the supplied buffer.

Alternatively you might be able to supply a custom allocator to std::strings but here be dragons.


> std::string has a static size, the buffer itself is dynamically sized

Yeah, I guess if you're happy with the stack-allocated buffer being sized at compile-time (either switching or overflowing into heap-allocated storage if it gets too big) this all becomes very straightforward. If you want the stack storage to be runtime-sized (or resizeable) then I agree it's going to be a real pain in the neck. It doesn't fit into the C++ programming model cleanly at all. A pity.


If it only used for small strings, that copy would be cheap. It reads a small amount of contiguous memory.

Ironically, Google seems to have done something along this way. See http://stackoverflow.com/questions/354442/looking-for-c-stl-..., which links to https://code.google.com/p/chrome-browser/source/browse/trunk....

Many projects also have specialized string classes for the 'small string' case. See for example http://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallst....


> The C++ string library cannot bind the string to the stack frame since it doesn't know how long it's going to live.

Huh? If course it could, unless I'm not understanding you correctly.

A class like:

  class smallstr {
    char stack_str[128];
    char* heap_ptr;
  }
can easily (with enough logic in the public methods) store small strings in `stack_str` and once it grows larger, allocate some space on the heap, make `heap_ptr` point to it, and copy it over.

"since it doesn't know how long it's going to live" doesn't make any sense to me. A caller would just say:

  {
    smallstr s("foo");
    // do stuff
  } // stack_str[] is wiped clean with the stack, and a destructor would be implemented to `delete` heap_ptr if it was used.
I'm not sure what you're even driving at.


Well, you have to take the context of our discussion in consideration, I meant that you couldn't dynamically elect to allocate your buffer on the stack instead of the heap at runtime in the constructor. Your solution just allocates a 128B buffer on the stack for all strings which may or may not end up being used.

That makes your string object 128B larger in all situations. Since we were talking about std::string, I'm pretty sure many people wouldn't be happy if sizeof(std::string) suddenly took 128B (g++'s sizeof(std::string) is 8B on my system).

What I had in mind was stuff like alloca and other local scope allocation techniques that would get tied to the constructor stack frame not the caller's.


g++'s sizeof(std::string) is only 8 bytes because it's a single pointer such that the struct can be reinterpret_casted to a const char * and you get the same thing that c_str() gives you. It's reference-counted (across threads!), copy-on-write. I don't think people would mind a 24-byte std::string (on 64-bit systems), which would consist of a pointer, size, and capacity, and which would allow 22-byte strings to be stored inline.

On Windows, sizeof(std::string) is 24 or 32 depending on 32- or 64-bit. sizeof(vector<char>) is 12 and 24, respectively, of course. I believe the extra size of the std::string is for storing strings inline. I'd guess their implementation looks like struct { char *ptr; size_t size; union { size_t capacity; char buf[16]; } u; };, which sacrifices some size in favor of keeping the code paths simple.


I believe it would be possible to add the functionality I describe to the language if the wish of those involved existed. Even if that would mean not calling it std::string to avoid breaking the existing programs.

I don't know of any current compiler that doesn't use heap when the object is made to actually call malloc (which is typically what the "new" eventually calls).


You could just use std::array<char, 255>


Do I have all the string functionality, nicely existing in std::string?


I've worked on a project that used all of these: std::string, QString, OString, char*. All were required by a different library that we needed.

This is why a good string type should be in core language.


This is the problem with C++ (and in a certain part, with C)

"Everybody" has their own C++ string (MFC, QT, several libraries, heck, even GObject library has their strings)

Same with C

Now, you won't see anybody reimplementing strings in Java. C#, Python, JS, etc Lessons learned, their strings work

std:string is certainly a step forward and should be used for most projects today


"Lessons learned, their strings work"

Except that they don't. Either they are 'complete' but massive and thus slow, or they start as 'array of byte' and then their designers spend 10 years implementing a more 'complete' string type that is still fast enough and end up as #1 anyway.

Of course the C++ way where there is no string type that everyone uses sucks too, it's just that strings are almost impossible to get 'right' because there is no real 'right' and so many special cases that aren't apparent at first sight.


I think this bears trumpeting: strings are hard! It's easy to gloss over their issues via garbage collection and pervasive heap allocation, but once you're in a domain where you care about stack allocation and avoiding copies you start running into difficult tradeoffs (above and beyond even the question of string encoding, which is a different beast altogether).

Speaking as a dynamic language programmer who's trying to break into systems programming, it took a long time for me to fully appreciate the difficulties around strings. And now that I do, I'm perpetually paranoid about how many allocations my Python programs are doing behind the scenes...


If you care about performance use std::performant_but_tricky_string. 99% of code doesn't care about string-related performance, but needs string anyway.


If you care about performance at all, strings are going to slow you down in unexpected and counter-intuitive ways.

Far better if you pass around text data as binary buffers (with metadata describing encoding, please), and only convert those to strings once they are ready to be consumed by the user (which is typically not where performance bottlenecks show up anyways)


If you care about performance a lot, it may well be that most of your critical paths are in numeric code, and strings are only used to read input and write output. So you should just use strings unless profiling shows problems there.


I wonder how cavalier people treat strings when they feel like "primitive values" in the language - like Java having the (+) operator for concatenation.

I don't agree with the view that "costly" (it's all relative, but anyway) operations should look "costly" (i.e., be a relative eyesore). But I don't doubt that it can affect one's mindset.


Chrome is a web browser, not a real time operating system.


It boggles the mind how many different ways to represent Strings there are in C++, and String handling in general is the major reason I'll never touch it (or C) with a 10 feet pole. I'm interested to hear though how the String handling in Java is broken. This is everything I have to know:

- String

- CharSequence

- char[] / Character[]

- StringBuilder

Done. Finito. Strings are immutable, the GC will clean up after me. Equality works, Unicode works. Only downside: more memory needed, but nobody except the embedded guys care anymore.


Except that:

- In CJK languages, unicode codepoints can't be represented in 2 bytes. They take up 2 'char' objects; the first is a surrogate code point.

- In the presence of surrogates, charAt() and length() give wrong answers. Their indexes refer to the number of 'char' objects up through that point, not the number of Unicode codepoints. If there is a surrogate codepoint present anywhere before your index in the string, you will be off.

- To help get around this, the Java APIs added codePointAt and codePointBefore. These are still broken; the indexes are based off of chars, not codepoints.

- To get around this, we have codePointCount and offsetByCodePoint. Finally, these are semantically correct. However, they give up O(1) string indexing.

Did you know all of this? There's more, too, where calling CharSequence.subSequence causes a memory leak because you're pulling out a small portion (say, 10 bytes) of a large backing buffer (say, 1MB), and storing it in a persistent object prevents the GC from collecting the buffer, since a portion of its data is still live. This has caused real memory leaks in Google servers, enough that they had to educate the whole developer base about the pitfalls of it. But I figured I had enough to pick on with Java's broken Unicode handling to illustrate the grandparent's point: strings are hard. If you think you understand them, you're probably not aware of some of the trade-offs between performance, correctness, multilingual support, and developer APIs.


I was aware of the substring issue and the fact that CJK languages require 2 chars, but I might have used codePointAt() instead of offsetByCodePoint(), to be honest. So I agree unicode string indexing is not as easy as it should be, due to a stupid API - but if thats all, I'm not sure that passes for 'hard'. It is also not something inherent to strings in general, the language designers simply messed this up.

Btw., I think they fixed the memory leak in recent VMs by removing that optimization.


The point is that these are trade-offs. The reason for the stupid API is because by getting precise multilingual handling, you give up either O(1) string indexing or representing characters in less than 3 bytes/char. If you naively use offsetByCodePoint on megabyte-long strings, you may find your performance slows to a crawl.

The reason for the memory leak is that you can either have fast substring() or accidental memory leaks when substrings are stored in persistent data structures. Not both. If they removed the optimization, then suddenly str.substring(1) takes O(1) time and on a megabyte string allocates a full copy of the entire megabyte. In some other use-case, the full copy is far worse.

There are others, too. Idiomatic Java uses '+' for string concatenation, but this is O(n^2) time when done repeatedly. (Except that HotSpot optimizes out the repeated allocations when all concatenations appear in the source code, but it still can't do anything about loops.) To get around this performance pitfall, there is StringBuilder. Now Java programmers need to know 2 APIs. Well, more like 6, considering there's also CharSequence, ByteBuffer, StringBuffer, and char[].

C++ has always had the philosophy of "The programmer knows their requirements better than we do". This is why it is complex; because problems are complex, and it is designed to solve many problems. It's very rare for a language designer of a popular language to actually be stupid; it's very common for a language to get used outside of the domains that the language designer optimized for.


Even if you use 3 bytes per char you are not guaranteed O(1) string indexing as you will still have surrogate pairs :( So you need to normalize these first... or deal with them in searching :(

String in more modern languages are standardised in one form and widely used in all libraries. C and C++ are the exceptions to this.

For example I hardly ever see CharSequence implementations in Java code. It is the most underused interface I know of, because for most people String is good enough. In idiomatic Java code the performance issue discussed here does not occur. Strings being final and passed by reference (or more correctly the reference is copied by value) the same string does not get reallocated all over the place.

The C++ problem is that std:string does not suffice for the common case and we end up with QStrings that are difficult to convert to boost:string.

In Java land when String does not meet the perf needs we implement a custom CharSequence but these are easy to convert to Strings if needed (easier than QString to Boost string).

So we end up with most Java programers knowing just StringBuilder, String and char[]. ByteBuffer is really about bytes and easy to convert to a CharBuffer.

The Java Language has some downsides in standardising on UTF-16 for its common String type (Moving on from UCS-2). But even if had chosen extended ASCII like Oberon it would at least be consistent everywhere.

I believe that javac replaces + with StringBuilder calls. And in loops the multiple StringBuilder allocs are removed and replaced by appending to one StringBuilder.

The Java Strings are far from perfect in UniCode terms but a whole lot better than std::String and its 16+32 friend.


You made some good points, but I'm not sure all of these are necessary trade offs.

Lets say that the issue of code points not fitting into 16 bit chars requires a separate indexOf() API, to preserve O(1). They could have solved this one with better method naming, or at least improved documentation, so people are aware of the fact. The String#charAt() Javadoc is not really useful, unless you fully understand the implications of: "If the char value specified by the index is a surrogate, the surrogate value is returned." Also, if you are handling CJK strings, is there actually a need to split them by charAt()? The runtime could just tag such strings and use the fastest indexing method, falling back to the slow one for those.

Concatenation: true, and that the jit can't optimize most loops by using StringBuilder is a little embarrassing. Abstractions are always leaky, but this one I think could be improved, so that more cases would be made faster by the runtime.

Of course language designers aren't usually stupid, but its not like they all were created in some cozy Languages Workshop, where they could take their time to ripen, being guarded over by benevolent gardeners. See PHP, see Javascript, and of course Java, too. We don't have to accept accidental complexity as a given, there are useful abstractions, and we should use them. Rust and C# are better than previous attempts, imho, because they got proper funding and are designed by people who know what they're doing. And the whole field of software is better for it!

If you are Google, you are in the unique position of having top talent and huge amounts of servers. In that case, the trade-off to use C++ is probably the right one. People can handle its complexity, and it pays off due to increased performance, because the code runs on a million servers. But this simply isn't true for 99% of the business, and using C or C++ is not the optimal choice for them.


Hehe, ignorance truely is bliss I guess. Just ask yourself: what size is a char in Java? To unearth 90% of the problems with strings in any language, ask two things: first, what size is char? Any secondly, what is the length of a string? If you cannot talk about these things for an hour, you don't really understand how computers deal with strings.


Actually, no. 90% of the problems with strings in any language is somebody screwing up the character encoding, usually out of ignorance.

The fact that every object in Java has some overhead and probably needs some padding for alignment is utterly irrelevant. But since you asked:

- 8 bytes generic object overhead per String

- 4 bytes for the char[] ref

- 12 bytes for the char[] itself, if non-null, plus probably 4 bytes padding

- 12 bytes for length, offset and hash code int fields (3*4 bytes)

- and 2 bytes per character stored

So I guess 40 bytes for the empty string should be about it. Happy?

My customer's servers have usually 8Gb or RAM, 16Gb is becoming the norm. Nobody cares anymore. Maybe its important in your field, I don't know. No mine though.

So did I ever need to know this piece of trivia? No.

Did I have to fix someone else's code which was relying on the platform default encoding? Lots of times.

PP: actually in C, you get the usually security nightmares on top of the encoding stuff.


> - and 2 bytes per character stored

No, 2 bytes per code unit. A single UTF-16 character requires one or two code units. So a single "character" (code point in Unicode terminology) is either 2 or 4 bytes in Java. Additionally, a single Unicode character can require multiple code points.


What many C++ programmers often forget is that most of those overheads exist in C++ as well. They are only less visible. String length? Check. Char array header? 8-16 bytes reserved by the allocator and additional hidden length field created by c++ compiler. Array pointer: another 4 or 8 bytes. I guess the only thing c++ saves is for the fact the string object itself can be stack allocated. That's just 8 bytes for the object header saved.


Very late reply, but to clarify, what I meant was not 'was is the size of a string' but 'what is sizeof(char)'. Meaning, if your char is a fixed size it can't represent all characters or is wasteful in 95% of all cases (8, 16 or 32 bit) or if it's variable length the formal complexity goes up for a number of often-used operations.


> My customer's servers have usually 8Gb or RAM, 16Gb is becoming the norm.

Does the cache sizes not matter? Honest question.


Thats not so easy to answer. It starts to matter a lot when you get into very high performance architectures (think disruptor, anything requiring lots of mechanical sympathy, highly contended memory access etc.), but usually you're waiting for the database or the network anyway.

Supposing you meant memory bloat compared to C, increased developer productivity is almost always more important. Things like memory access patterns can be important when you are interested in optimizing hot loops, but not generally.


> Only downside: more memory needed, but nobody except the embedded guys care anymore.

As an Android user, I care. Android needs 2GB of memory to run the apps that iOS only needs 1GB for (running Android on a tablet with 1GB has been painful for me).


Mobile borders on embedded imho, and I agree that Java wasn't the smartest choice there. I'm on IOS, so I don't really know, but didn't Google push for 512MB minimum RAM with Android 4.0, to better serve the emerging markets?


> Lessons learned, their strings work

It is worse than that, because the other alternative systems programming languages that eventually lost to C and C++ also had their string type.

Just C did not, and C++ followed along due to its compatibility story.


And here is where C shines compared to C++:

At least in C you know that strings are garbage, and you never pretend for a second that everything will be fine. C++ will claim "oh ho ho, we have a string type, don't worry!" and then people will get burned.


If you read the discussion, you would that people are getting burned by converting all the time between char* and std::string, because Google isn't using the same string types across all APIs.

I rather have secure strings, even if every now and then they require some attention to.


So in C# (on the .NET VM) when you allocate a string that's more than 65K when does it get freed?

I wouldn't call that 'working'.


I guess the underlying problem is that C++ doesn't have a standardized ABI. So you get things like QStrings, etc.


It is a standardization problem.

When C++ got introduced, it used only C standard library, so everyone wrote their own library.

C++ was already 10 years old when C++98 was a thing and compilers still needed to catch up.

You don't re-write old code just because, so this cruft shows everywhere.


It's not a matter of ABI. Even source code level compatibility would solve most of the problems. I am happy to recompile as long as I don't need to convert on every api boundary.


The real language problem is that classes are closed -- you can't add your own methods to std::string (other than operators) As a C++ program grows it's more and more tempting to make your own private string class (either by inheriting from std::string, encapsulating it, or reimplementing it) that interacts more naturally with your program's other types. I've done this myself. Of course this works great until you need to link to some other C++ library which has its own different string type.

Honestly, if C++ added just allow classes to be reopened (even if they weren't granted access to private methods) a lot of the pressure to roll your own string would be diminished.


> Honestly, if C++ added just allow classes to be reopened (even if they weren't granted access to private methods) a lot of the pressure to roll your own string would be diminished.

But then you'd have the problem with different extensions to the same base class used in different libraries used in the same project conflicting, so you'd also need to make a way to scope the extensions to prevent that.


No different than any other symbol name in the global namespace though.


Well, yes, but there's a reason that you generally avoid using the global namespace, and open classes basically make extensions to classes have the same problems that globals have.

(In Ruby, where open classes are a major and frequently-leveraged feature, this became a serious issue; hence, why Ruby 2.x introduced refinements as an alternative that provides much of the utility of open classes but with less opportunity for conflicts.)


What's wrong with free functions?


Nothing really, but it mean your extensions aren't really first class syntactically. A developer has to remember that they write s.size() but not s.count_utf8_codepoints().

Bjarne Stroustrup actually has an interesting proposal to fix this by unifying free functions and method call syntax in C++17 although the proposal sounds pretty aggressive to me: http://isocpp.org/files/papers/N4174.pdf Would be an interesting change.


reminds me of the 100 functions perlism http://stackoverflow.com/questions/6016271/why-is-it-better-...


During the 10 years I used C++ I never saw a project that used std::string as their standard string. The implementation of std::string not standardized. It may or may not have a 'small string optimization' which may not be an optimization at all. Instead of specifying a mundane immutable built-in string like in Java the C++ Standards committee decided to add even more 'advanced' features to an already overloaded language.


Heh, as if Java's string was mundane, or even particularly well standardised. The implementation for String have varied in each of the JVM versions, and Android has a different version to the JDK.

Java's string, depending on the string, length, JVM version and a bunch of other factors may or may not:

- Be interned - so the result of "str1" == "str1" is compiler dependant (you must do "str1".equals("str1")).

- Maintain references to the string when doing a substring - e.g. "Java is a language full of quirks......".substring(0,4) may or may not hold a reference to the entire string. This has huge performance tradeoffs - e.g. if you parse JSON by doing .substring(), and aren't careful, then you can end up holding onto the entire unparsed JSON, even if you only keep track of a single obejct.


Are you seriously suggesting that String incompatibilites are a big issues in Java?

String interning is specified in the JLS: http://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#...

As far as I can tell, the substring "feature" has been there since Java 1.0 and was fixed in Java 7. Hardly groundbreaking stuff.


The interface and behavior of std::string is standardized.

Library writers are free to implement it in any way that honors the standard. However, sometimes the requirements may lead to less than optimal implementations, I believe.

So as long as Chrome uses std::string for internal use only, there shouldn't be any problems. Using them across ABI boundaries, however, is another story.


One of the recurring things I see in programming literature for the last 20 years or more is performance hits when using string. I've seen essays on this in at least four different programming languages.

You'd think it'd be simple, but it's not. String is an allocated buffer of unknown final size, and since it represents some kind of meaning in a human language, and since human languages have indeterminate length for conveying any one concept, concatentation is extremely common.

This is actually one of those cases where I like c better, warts and all. Whenever you use a string, you should carefully think about what you're going to do with it, and if at all possible allocate all you need up front. Beats the heck out of taking an unexpected GC call somewhere later when you weren't expeccting it.


> Beats the heck out of taking an unexpected GC call somewhere later when you weren't expeccting it.

How is that an issue with c++ `std::string`


I sure as hell believe it. I'm on an older, slightly creaky Macbook Pro, and typing in Chrome's box is frequently a nightmare if I'm running something like a VM. Keystrokes lag tremendously.


Same situation here... I have to clear my browser history every couple of months in order for it to perform adequately.


Yeah, same here. I've to clear history like every few weeks. Else it lags so badly.


It's always driven me nuts that the base string in the STL is a fully mutable container that must manage it's own memory. I much prefer a string being either immutable, or an interface that you can implement however you see fit: a view over a raw buffer, a ref to a globally shared string instance that may or may not be ref counted, or like the STL, a version that self-manages its own memory.

Besides boost::string_ref, there is also a more generic flyweight implementation requiring only an equality concept for it's template parameter:

http://www.boost.org/doc/libs/1_56_0/libs/flyweight/doc/tuto...


It all makes sense now. For some reason on my gaming PC which is pretty spec'd out in almost every way, Chrome will lag when typing into the Omnibox and requires me to close it completely and reopen it frequently. For a while I thought perhaps I had an issue with my CPU getting too hot, a bad plugin or faulty RAM, but this exact issue appears to be the cause of all of my problems. The only thing that seems to fix the issue temporarily is clearing out my browsing history every few weeks, otherwise the issue gets to the point where you can wait a couple of seconds for a word you have typed to appear.

Don't get me started on the performance of using Chrome inside of a VM, that is a whole other world of hurt right there.


On my Asus Transformer Book t100, that I use for all my personal stuff, I never have any issue with lag. In some rare case it can lag but it's temporary and then it's fast again.

If your history is pretty big and it's all stored on a slow HDD, I can see how your IO could be the bottleneck.


I am a front-end developer, so my history grows a lot every single day. After a few weeks, my history is a culmination of documentation links, testing links, StackOverflow posts and more. I actually did some testing a little while ago and could see the bottleneck taking place. My CPU usage would spike up to 100% whenever I would type something into the Omnibox when I had a few weeks worth of history. I spotted this on a clean install of Chrome, I signed into my Google account, but disabled all plugins and it still occurred. Weirdly enough, not everyone seems to experience this issue.


If you feel comfortable sharing your history, filing a bug on crbug.com with this info might help out the Chromium devs. (Don't share it right away - ask to have the bug closed to public view before you share)


>this exact issue appears to be the cause of all of my problems

You can't know that from the bug report alone just because it says Omnibox in it. Did you measure your own mallocs pre/post patch?


I'm not even sure if I do stuff like this in prototypes. My experience has been that using a matrix/arena/pool can speed a program up that has inner loop allocations by x7. I think the average pc can do about 10,000,000 heap allocations per second, but as far as I know it causes thread locking to some degree.

Don't many std::string implementations have small string optimizations? This is actually the first time I have every heard of C++ strings being the bottleneck of an application (and it seems that is even still up for debate here).


If you're interested in this topic then you should read the whole discussion in the link. There is a lot of talk about what optimizations are in play.


In a certain 3d game engine we use, the devs decided to refactor a lot of old code in the animation code which used a hashed string type to use their new all-in-one reference counted String type instead.

Safe to say this turned out to be a disaster for performance since every time an animation or not had to be evaluated from a name (which we ended up doing a lot), a string had to be allocated on the heap.

Some people sadly underestimate how bad objects which rely on heap allocation can be in performance-critical code.


There are many things in the standard C++ library which are named incorrectly (std::string), awkwardly (std::unordered_map) and and implemented inefficiently from the perspective of modern CPUs (same std::unordered_map, which uses linked list for the underlying hash table buckets). See the great talk on CppCon 2014 about those issues - https://www.youtube.com/watch?v=fHNmRkzxHWs.


One thing I've learned from PHP's internals is that using reference-counted strings and copying on write is a fantastic idea. You can save an awful lot of memory and allocations, and simplify your code.


Reference counted copy-on-write strings are little landmines just waiting to blow your leg off should you venture into multi-threaded territory.

If you use copies of such a string in multiple different threads, you may find that simply creating a new copy of the original string or any of its subsequent copies can cause an incorrect reference count which will trigger a double-free and a segfault at some later time, possibly long after all non-main threads have ended.

You can't even lock all accesses to a string with a simple mutex. You have to also lock all descendant and ancestor copies, including all temporary copies, as when passing by value to a function. A COW string is basically a pointer to an internal shared data structure. You need to lock accesses to the shared structure, not the pointers to it, and you can only do that internally.

The only way to reliably use such a string with multiple threads is to make the internal reference count and buffer manipulations thread-safe. It may save you some allocations and copies, but the thread safety will slow things down. How much, I don't know. I wouldn't be surprised if thread-safe copy-on-write is slower than copy-always.


This is where Rust is worth mentioning: its Rc smart pointer will statically prevent multiple threads from accessing the inner value, while its Arc (atomic reference counting) smart pointer will let you share the value while using atomic operations to adjust the refcount, same as C++'s shared_ptr. Rust's move semantics by-default also mean fewer refcount adjustments overall, since you can transfer ownership instead.

There's also a neat new copy-on-write smart pointer in the stdlib, though I have no experience with it yet: http://doc.rust-lang.org/std/borrow/enum.Cow.html


Yes, COW mutable strings is problematic. It makes more sense to have immutable COW strings, so locking for access isn't a concern. Of course, C++ doesn't have proper immutable data, so a carefully designed interface would be needed.

That's not to say it's a perfect solution.

As far as incorrect reference counts, that's a quality of implementation issue. In a truly multithreaded environment, you'd use locks or atomics to ensure thread safety.

In other words, if you have a nice wrapper around a shared_ptr<vector<char const> const>, I don't think most of what you wrote above really applies anymore. But, again, it would be even better if C++ had a proper concept of immutability.


Actually, the C++ standard string was designed to be easy to implement as a copy-on-write object. And one of the motivations for the libc++ project ( http://libcxx.llvm.org/ ) is to stop using that design (first bullet point under "Why a new C++ Standard Library for C++11?") because it's no longer the obviously best choice.


Copy on write is not at all helpful in a language like C++ where concurrency can be concerned.


...how so?


If you have a single-threaded runtime then reference counts are cheap and easy. Once you need to support multithreading you need to use CPU atomic instructions to maintain the reference count safely. As the number of CPU cores increases the relative cost of this operation goes up.

COW strings were a very common optimization when 1-2 CPUs were standard. That's why it's there in GNU's libstdc++ for example. On a modern server with 24 cores it's pretty dubious. It'd only make sense if you were copying (and then not mutating) a lot of your strings.


Why in the world are they even using c_str so often in the first place? That makes me worry more about security holes than performance...


Well, it is a good thing they wrote it in such a high-performance language then...


I literally started programming in C++ this week and I figured the over-use of std::string couldn't be a good thing hehe


It always depends on your needs.

Before you start avoiding std::string, note that the article said that one of the problems was that they went from std::string to C-strings and then allocated a new std::string. This often happens when you use both kind of strings in your code. The takeaway is to stick either to C strings or use std::strings with reference passing, dropping down to c_str() when needed.


Thanks, that was a lot more helpful response :) I think the most confused I've been so far (coming from a PHP background) was all the different types of strings!


No that's BS. `std::string` should be used where ever it is applicable.

The whole issue that this post about chrome was talking about was dealing with a poor usage of `std::string`, such as passing c_str() to then go and construct another string instead of passing by const ref.

Or building a set of `std::string` to simply check if a value exists.

That's just shit code, not an issue with `std::string`.


There's a right way to do it. It's not the obvious way. Berating people for coming to terms with that isn't helpful.

It's not their fault that C++ only really supports std::string out of the box. What are the alternatives?

1. const std::string & : what if I have a vector<char>?

2. const char * : what if it's not null terminated?

3. const char * and size_t : Better, but what if I have a deque<char>?

4. const char * start, const char * stop : Better because you can write algorithms around this, but still, doesn't help with deque<char>?

5. template on START_ITER and STOP_ITER : The best we have now if you need an extremely general solution. But I hope writing your implementation in headers is fine.

6. home grown type : a very popular choice, but http://xkcd.com/927/

7. boost::string_ref : maybe the best choice, as it can be created from 1-4 (and most 6's), but still doesn't work with deque<char>.

...so give the rookie a break. But I'll support any comment in a code review about not accepting std::string by reference or by value in an interface.


#5 and many, many other reasons is why C++ desperately needs a standardized range type. Let's pray it comes in C++17.


What would a range type offer over a pair of iterators?


cout << takeFirstN(sort(myVector), 10) << '\n';

...try that with iterators.


Right, I guess I misunderstood the article. As I said, new to C++.


std::string solves more problems than it creates. Use it.


Definitely, and where `std::string`'s interface is not enough, use thrid party libs which work on std::strings, like boost string algoriths, std regex, or uftcpp.


Well, you could just use char* and produce security exploits everywhere.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: