Hacker News new | comments | ask | show | jobs | submit login
Is C++ fast? (zeuxcg.org)
239 points by AlexeyBrin 29 days ago | hide | past | web | favorite | 172 comments



So if you get rid of bounds check, reimplement your own feature-limited vector and hash table and change the algorithm moving from a general purpose exact sort to an inexact, domain-specific one, you can improve your run time from 572ms to 320ms.

What does this have to do with C++?


I interpret the title to mean, "Is standard usage of C++ language features and the STL competitive with C?". The author is basically showing how you improve performance by selectively tearing out more and more features C++ hands you out of the box. This is characteristic of writing C++ in a more C-like style. In that sense I think the author is correct.

However I'd also say this is something already well known by most C and, particularly, C++ developers. Yes, you can get better performance by reimplementing much of what C++ gives you, and that is typical of writing C code instead of C++. But there's a fair bit of productivity trade off in doing so.


This is one of the defining principles of C++, after all: You pay only for what you use. Conversely, if you do opt-in to using more advanced features, it may come at a cost, but it's all optional.

It's been one of the things that have most strongly shaped the evolution of the language and the library, for both good and bad, so absolutely, it is very much well known.


'You pay only for what you use' Sorry, have you read the article? He is not using any C++20, C++17, C++14... features, just adding a compiler key. And he's paying with compilation time. https://twitter.com/zeuxcg/status/1085781851568914432

You might say that compilation time is nothing.. But it is C++, compilation time has a huge impact on big codebases.


He started out with among others, std::vector. That has tradeoffs, in that the standard specifies requirements that he may or may not need.

My point stands. The performance trade-offs you're paying for with various C++ features are extremely well understood, and better documented for C++ than most languages.

That doesn't mean there aren't surprises here and there, and obviously there are compiler difference, but the point was that the language is designed so that these tradeoffs are known and documented and for the most part comes as a surprise to people mostly when they don't know the language very well.

It is not a surprise that you can improve on things like std::vector by rolling your own that does exactly what you want; for starters if you know your requirements chances are you can speed up initialization, because you don't need to abide by standard semantics. For starters, if you know you're going to assign a fixed number of elements, you don't need to track size and capacity, and you can often avoid the need for bounds checking.

As for compilation time, that too is very much a matter of paying for what you're using. You can choose to disable debug info, you can choose to not include headers that are bound to result in lots of time spent on parsing templates.

That he's not using many advanced features is irrelevant; his starting point very much had made trade-offs in using functionality that comes with known costs.


For me this principle is also what often causes performance issues, because for many advanced features this leads to convoluted language semantics and sub-optimal implementations.


I'm pretty sure I could make a faster 'printf' but which only supports '%d'. Does that mean 'C is slow'?


No, not at all! At least not in an absolute sense. But what I mean is that writing in C naturally tends towards writing more special-purpose, bespoke functions because you can't lean on the existing, general-purpose functions of C++ (nor language features).

I agree with you about the title, but I also think that if you fix the title the author is making a good point. It's not groundbreaking, but the move "down" from C++ to C does run parallel with the move from more general purpose to more special purpose, in a lot of ways. C dominates embedded after all.


I'd say that printf, with its own DSL, is actually pretty slow. It's meant for debugging or "slow" I/O, if you need to write a lot of data it should be avoided.

A few years ago for debugging purposes I wanted to hexdump a large amount of data. Turns out that the standard hexdump tool was pretty slow at it, writing a simple C utility that would format the dump "by hand" instead of using printf was about an order of magnitude faster. You have fully static code instead of the interpreted format string, you can unroll all the formatting, you don't have to rewrite the characters that don't change in every line (spaces, delimiters etc...). It's lightning fast but of course not very flexible.


> I'd say that printf, with its own DSL, is actually pretty slow.

Yes, of course it is. That's why I took it as an example. Since I work on RealTime systems I usually have to use plain 'write' instead. But that is because 'printf' is an incredibly complex function, not because 'C is slow'.


At this point I feel like it's almost a philosophical argument but:

- printf is a standard function in C's stdlib.

- Because of the nature of the function (using a special purpose format strings and varargs instead of some kind of generic system) the function can't really be aggressively optimized and inlined without compiler magic. This limitation is caused by C's own limitations when it comes to generic programming and type introspection, printf needs to be told explicitly through a "side-channel" (the format string) what types to expect because C lacks the infrastructure to do that by itself. It also means that you can't extend the function for custom types and that you can't type-check the parameters without compiler magic.

- Even though some compilers do implement special magic around printf to perform more aggressive optimizations (like replacing it with "puts" if there's no formatting going on for instance) it's still generally trivial to beat its performance in non-trivial case with handwritten code.

Given all of the above I would say that makes this particular corner of C rather slow indeed. Or at least slower than it might be compared to an hypothetical language that made it easier to optimize string formatting.


You're entirely right, but I feel like you're trying really hard to misunderstand deng's original comment. It's an apt analogy, partly because of what you wrote. This article's title is as ridiculous as an article titled "Is C fast?" with a story about a faster_printf that only supports %d.


It's been a long time since I've done C++. I wonder if

   int x;
   cout << x;
ought to be faster than printf("%d", x), by virtue of knowing the types ahead of time?


Pedantic: No. If you write

  #include “studio.h”
  int x;
  printf(“%d”, x);
the compiler knows the types ahead of time, too, so that cannot explain any speed difference (and yes, C compilers do try to avoid the generic printf in the library. See http://www.ciselant.de/projects/gcc_printf/gcc_printf.html for examples)

(The #include is essential. With it, the compiler can know what printf does, and, in theory, optimize it to a int-to-string conversion and a puts system call. Without it, it cannot, because it doesn’t know what the printf function does (in this case, it is easier to generate fast code for the compiler if it doesn’t have the source available than when it would see source code for a function called printf)

Utterly pedantic: no. As given, both are undefined behavior, and it is trivial for a compiler to discover that.


> and yes, C compilers do try to avoid the generic printf in the library

That is true for certain format strings (containing only %s or %c conversions), but for %d GCC and Clang don't seem to want to call a print_int function, possibly because there is none: https://gcc.godbolt.org/z/rltFse


Yes, but only if you first call

  ios::sync_with_stdio(false);
https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdi...


If it weren't for locales, probably yes ;P


You could check out Nanolog [1] for inspiration.

[1] https://github.com/PlatformLab/NanoLog


That's why you measure first, and only if necessary go to these extreme measures if you find that this is the place that really costs you performance.

Or if performance is basically your only goal (HFT) then you just need to spend this time on a lot of parts of your code.

But then in 99% of code it's a waste of time.


The C++ standard library is pretty competitive with C given that C has no standard data structure and algorithms library! It's not even an apples and oranges comparison.


With proper use of constexpr, templates and types, you can make C++ at least as fast as C code without giving up on readability.


At least as. Often faster because of easier inlining.


> The author is basically showing how you improve performance by selectively tearing out more and more features C++ hands you out of the box.

In that case the author is showing what has been widely known by C++ developers for decades. In fact, the "don't use the STL because it's slow" is a mantra which, along with the old "replace STL's default allocators with your customized ones", is repeated ad nauseum in gamedev circles.


if you don't use any of the slow language features it's fast


If you use special-purpose tools rather than robust ones, you can get better performance. But nobody wants to ship a language without robust general-purpose tools.


Even the slow language features are fast enough for most use cases, though. I think the actual comparison here isn't between 'slow' and 'fast' but 'probably fast enough' and 'as fast as possible.'


In fairness to the article, it is putting some flesh on the bones of an area that I have seen cause surprise.

It's a variant of the observation that, although in C++ you aren't paying a direct abstraction cost, you are often paying a generalisation cost. This cost is obscured by the abstraction and may not be warranted in your application. In other words, an indirect abstraction cost.


Ya after changing the hash table and sort algorithms, the rest of the improvements are fairly minor (other than MSVC’s debug performance).

This is more like “is C++’s STL fast for me”.


For me the proper title of the article would be "std::unordered_set is slow and MSVC debug runtime is even slower", but I guess that wouldn't generate as many views as pretty much everyone knows it.


Came here to say almost exactly this. C++ is fast. All the things mentioned in the article are very specific replacements of C++ standard library implementations. C++ is a zero cost abstraction language, all these things could have been replaced with idiomatic C++. The fact that they were implemented in C is more of an authors language preference. Reimplementing some things might be easier in C because C is a simpler (and often more familiar) language. Idiomatic C++ may require knowledge of pretty arcane feature of already complicated language, like placement new and judicious use of references and move operators.


I said this elsewhere, but I would legitimately like to see a program like what the author wrote, written in idiomatic C++ that performs as well as what the author got to. Maybe in a comparison style like the author to show the slower C++, the faster "C with classes", and the comparable speed "idiomatic C++" or something. I feel like I see people saying that idiomatic C++ is just as fast as something, while other people say that idiomatic C++ is slower and should be replaced. I'm going to assume you can for sure write C++ using the STL to be just as fast, and would like to learn what needs to be done to get to that point so I can incorporate those techniques into my own C++ code where I might need that extra bit of performance. (I'm playing around with writing games in C++, so performance is more of a concern than an IO-heavy web server. Or I'm considering learning QT for making apps, etc.)


Author posted C version of their code[1]. As a couple of examples of how you can go back to idiomatic C++ starting from here: 1. replace all pointers by passing references to templatized vectors. On a high level of optimization they will be compiled to the same pointer arithmetic as C, but will give you type-safety on the source level. 2. Replace structs with clasess with short inlined accessors. On a high level of optimization they will be compiled to the same memory offset access 3. _If_ there is any value in making domain-specific sort generic - make it such

[1] https://gist.github.com/zeux/bf847986e0474cf48f61bb5749da38e...


All the things that were replaced were abstractions, which had costs. I think that there's a valid point that a lot of C++ programmers use these abstractions liberally, while sweeping the very real costs under the rug. There are also benefits to using them, of course.


I would say C++ is more like a zero runtime-cost abstraction language (in release builds).

You could achieve good release build runtime-speeds in modern idiomatic C++, but you have to trade off your compile times and Debug-build runtime-speeds for that.


> I would say C++ is more like a zero runtime-cost abstraction language (in release builds).

Well, other than the "zero-cost" features that increase code size. For example careless use of heavy template classes can generate a lot of code.

In other words, in C++ it's very easy to generate a lot of code from seemingly innocent constructs.

Instruction cache misses, page faults (and to lesser amount TLB misses) are not zero cost.


To your point, cost (or strength) should always consider size as well as time.

It would be nice to have a tool that could compute the size impact of a given line.


> What does this have to do with C++?

Given that most new C++ features in the "modern day" are implemented as std::whatever, in the "everything is a library" way, it's extremely relevant.


Just because I'm using a C++17 compiler doesn't mean I'm happy with the standard library.

I still have to google every time I need to sleep the current thread ; and don't get me started on iostream!

However, I'm happy to be able to return a buffer without copying it (move semantics), to precompute frequency tables at compile time (constexpr), to be able to pass around functions without having to create a dedicated class (lambdas), not to have to worry about having an extra conversion (auto) ... And soon, I will be happy to write serialization code without relying on dirty tricks (introspection).


Well, I can imagine there are some languages where the interpreted nature of the language keeps you from being able to build higher-performance highly-specialized data structures and algorithms that outperform the native implementations (say, JS). But even then, it still doesn't justify the title.


It sure shows that the sentence "you only pay for the features you use" is wrong at least.


What? It demonstrates the opposite. He stopped using general purpose containers and algorithms, and replaced them with implementations more tailored to his use case. He didn't keep paying for std::sort or std::unordered_set once he stopped using them. By contrast in most managed languages with a GC, I can't opt out of the GC even if I want to take responsibility for memory management and object lifetime.


> The issue is that math.h includes cmath in gcc/clang which pulls in a lot of C++ machinery, and in C++17 in libstdc++ adds a slew of new special functions, that are rarely useful but will make compilation slower anyway.

> This is definitely an area where libstdc++ and libc++ could improve in the future - I don’t think it’s reasonable to force users of C headers to pay for the C++ baggage.


You're being obtuse on purpose. The sentence you claim is wrong has only ever been used to describe performance at run time.


Most managed language runtimes I know allow you to opt out of a GC, or to tune it to your workload.


That's different from "I do not want a disabled GC baked into my program." It is relevant, especially to embedded coding with its ornery size restrictions.


In embedded environments the GC is fundamentally different to a large RAM machine. There a simple boehm-gc is enough and still faster, safer and smaller than refcounting. It can be made incremental.

On big machines a copying collector with a nursery is much faster and safer if you can deal with the pauses.

Memory safety is still an issue I heard.


when he stopped using them he also stopped paying for them


C++ comes with the OOP mindeset: OOP is about polymorphism, inheritance and encapsulation. Now what if you are abstracting over the wrong library for a given task? Will you be able to question your basic assumptions, or is that an assumption that is not allowed to be questioned? (due to the encapsulation principle)


I don't think C++ comes with the OOP mindset. At least, not recently. Most people will tell you it's "multi-paradigm." And I would say C++ certainly gives you enough flexibility to dig underneath the abstractions, as TFA shows.


I am not so sure about the multi-paradigm thing: In c++ everything is a class. Also of you try hard enough then you may implement any paradigm in any language.


You mean except booleans, integers, floats, pointers, free functions, namespaces, templates and lambdas?


Also a class in C++ is by default exactly the same as a struct (except for default access control which is a minor thing). An OOP style class would be any class type (including struct) with virtual functions which you would have to opt in.


yes, you can do a C++ program with functions only + integers and array of integers. technically that's C++. Very apt observation! I still guess that the main entities in your c++ program will be classes (not POD)

And I still maintain that most people treat class libraries (such as STL) as given and do not inspect how they are implemented and do not consider what constraints / alternatives do exist. One example: the std::dequeue is a dynamic array, and you can't quite tune the size of the inner array easily by means of template or instance parameters - no one cares. Also many people happily use the shared_ptr class and don't even know that it takes very expensive atomic operations to do the reference counting; I would have expected a template parameter that could determine if shared_ptr is meant for multithreaded use or not. Most people don't care and happily use the shared_ptr. Also there are no intrusive linked lists in STL (well, there are in boost)

I can go on and on about the STL (and other widely used libraries if you care) and I still suspect that they get away with these "issues" in a widely used library because people are used not to question library implementations and treat them as frigging black boxes.


I agree with your sentiment but the lambdas do get treated as anonymous classes overloading operator ().


Not really true. Lambdas can be converted to a class overloading operator() (such as std::function), but they are not inherently that. You can use them as just function pointers, too. Example:

    #include <cstdio>

    void call_func(void(*func)()) {
        func();
    }

    int main(int argc, char* argv[]) {
        // Lambda can be converted to a function pointer
        call_func([]() {
            printf("hello as a function pointer\n");
        });
    }


> but they are not inherently that.

I am fairly certain I have had error messages and disassemblies that suggest otherwise. This may however be an implementation detail.

> You can use them as just function pointers, too.

Yes, I know this. It only works when there are no captures. And AFAIK once there are captures you start getting object code that looks a lot like you put those captures into an anonymous class. (When not inlined.)


You can get that sort of behaviour from any class by implementing the conversion operator. I assume that lambda implements `operator(void(*)())` so it is implicitly convertible to a function pointer.


That's Java, isn't it? I write lots of C++ without classes.


Even though Java is still very "class-centric", in the strictest sense Java doesn't make "everything a class". There have always been primitives for things like int, long, boolean, float, double, etc. And now that we have MethodHandles and what-not, you can (sort of) start to think of functions as first class entities in Java. And there are Lambdas as well which effectively let you build a function without it needing a class to contain it.


I was being a little flippant and unfair to Java because I was trying to find where the claim "everything is a class" had appeared from.


The "STL" part (data structures and algorithms) part of the C++ standard library uses no OO design at all. The iostreams part does, but it's hardly visible to the user. (No, simply having syntax that allows you to write `data.function()` instead of `function(data)` doesn't make anything object-oriented.)


There is exactly 0 polymorphism and very limited inheritance in the code referenced by the article.

And STL really isn't based on an OOP mindeset.


There is exactly zero _subtype_ polymorphism. The STL uses templates as a means to achieve parametric polymorphism (i.e, generics).


Yeah, and zero encapsulation also. Keep up the good work!


I haven't (yet) read the article, but let me provide an anecdote to answer the question posed by the post title "Is C++ fast?":

I write in C++. I had a coworker who wrote in Ruby. He wanted to learn C++ because C++ is "fast". So he wrote two identical programs: one in C++ and one in Ruby. Then, he ran both programs though `time` to determine how long each took.

His C++ program took longer.

Why was that?

He didn't write his C++ program using any efficient algorithms whatsoever while his Ruby program was well-designed. He came to me to try to tell me that C++ wasn't as fast as Ruby.

So I rewrote his C++ program. My C++ program then ran almost instantly while his Ruby program hadn't even begun reading or writing; the C++ program finished while the Ruby program was still initializing the Ruby runtime libraries.

Is C++ fast? Yes, it can be. But it's only fast if the developer understands how to make it fast.

Don't use C++ because it's "fast". Use C++ because you understand what its intended purpose is: to be close to the hardware.


After reading the article, I think my anecdote is still relevant, and that the article reinforces it. As @the_duke surmised: If you don't understand your tools, you can use them so inefficiently that other, supposedly worse tools are much faster.

The article doesn't mention what versions of each compiler was used. I'd also challenge the author to use Compiler Explorer to discover compiled-to-assembly-level differences between their C and C++ code.

https://gcc.godbolt.org/


I really like your comment, thanks.

I suppose the question that many people would like to ask, but it's hard to test is:

"Can we rank the speed of languages, given each are used by a domain expert to take full advantage of the features of that language"

or some such.

i.e. if a masterful C programmer and a masterful C++ programmer and a masterful Java Programmer and a masterful Ruby programmer all write something, which is faster.

I'm sure the answer is "it depends", because that's always the answer.


For that particular question: They would all be the same speed because the optimization requirement would lead each one to write a compiler in their language of choice that emits a perfectly tuned assembly binary.

What gets overlooked in language performance is that you pick a language for what are, in essence, placeholder constructs. The performance matters only in that you need it to be fast enough "off the shelf" that you don't have to start writing a compiler. But you generally aren't prevented from using that approach to boost your existing code, and high performance systems often converge on that strategy in some degree, because it lets you keep maintainable source code and add toggles for debugging and optimization that you won't get from treating the application and compiler code as separate entities.


[flagged]


Maybe you should have spent a bit more time thinking about the above comment.

While using Ruby as an example, the point is not related to higher level languages.

If you don't understand your tools, you can use them so inefficiently that other, supposedly worse tools are much faster.


The last thing HN needs is discussions based around a comment from someone who didn't read the article. I don't care about your contextless philosophy.


> Let’s try to replace the only STL component we’re still using, std::vector, with a really simple dynamic array - we don’t need resize or push_back in our code, all arrays are initialized with the right size.

Given those requirements, a custom allocator would speed things up. A custom allocator could allocate even on the stack (small buffer optimisation) if possible or use a memory pool.

Embracing C++ features or reading the docs often gives you more than going back to C or reinventing the wheel.

However I think this demonstrates one of the many lurking problems of C++: It does not enforce the right way of doing things nor makes it easy. Instead it makes the right solution awkward or cumbersome to write.


> It does not enforce the right way of doing things nor makes it easy.

but what the author is doing here, while being "the right thing" for his own project with specific performance requirements, is absolutely not the right thing to do for the average project that has a few std::vector of 15 widgets, a dozen strings and six and a half pixmaps, or does the occasional web request.


Why not? How do you know?


As already said, it depends on the type of project. Sometimes performance is key, sometimes implementation time is. It is the old tradeoff between engineering effort spent and the value gained.


I know by shipping projects where the -O0 -fsanitize=address performance was enough for the client.


That average project would be written in JavaScript, not C++.


I feel like I've seen people say more than once that you can do things like this to improve the performance of C++ to what you see from this post with the "C with classes" style and reimplementing various bits of the stdlib. However, I don't believe I've ever seen somewhere that demonstrates this capability, and like you said, the "right solution" is not enforced, or is awkward, or hard to discover. What I would legitimately like to see, is something that takes something like what the author was doing, but also uses these features of C++ that seem to be talked about, but harder to find demonstrations of. Especially considering this is a relatively small amount of realistic code, I would love it if someone were to reimplement it using a custom allocator or memory pool or whatever other C++ features are available to get the performance to where it is in what the author wrote (or better?) to act as a kind of tutorial for how to use these things. This is something I'm particularly interested in because I've been learning some C++ again and playing around with writing some stuff with SDL, and would love to be able to incorporate the nice features of C++ into what I'm writing.


It would be nice if he used a newer version of GCC and a newer base system in general, rather than the 2+ year old version of Ubuntu. I would also like to see % change as a column in those tables.

Also I'd like to see other optimization levels, since at least GCC adds a lot more potentially slow (to compile) optimizations at -O3. Maybe -ffast-math. Static linking too could conceivably make a difference if he calls into external libraries frequently.

And where does perf say the time is going? Maybe it's all lost in branches or cache misses so a different algorithm that branches less or uses denser data could help.

How about vectorization? I've now looked at the code and it's all vectors, arrays and matrices so surely this is an application for AVX, maybe even a GPU.

Anyway I think what's clear is this doesn't have much to do with C vs C++.


The most surprising part of this to me was the speedup from replacing std::vector.

They write, "We don’t need resize or push_back in our code, all arrays are initialized with the right size." But if you're not doing any resizing or anything, and you're reserving the right amount, std::vector basically is a thin wrapper over a dynamically-allocated raw array -- which is what they tout as their replacement.

I guess I'm surprised that the overhead from default initialization was so large.

With vector, it seems there are always tradeoffs: resize and you have some initialization overhead, but only reserve and you will have a bounds-check-maybe-realloc every time you need to push_back. Of course, since you reserved, the realloc never happens, but you still have to check each time.

I wonder if something like generate_n or copy_n into a back_inserter can avoid the bounds checks?


> std::vector basically is a thin wrapper over a dynamically-allocated raw array

Not exactly. A dynamically-allocated array doesn't necessarily save its own size (the system allocator can do optimizations), but std::vector does (you could call size() on it). So std::vector has to save this size somewhere, which takes space and time.


For me, the overhead of keeping track of size and capacity is still pretty thin. It's not like it's a linked list or something.

The overhead is thin but non-zero.


That overhead can be thin or heavy, depending on the operation.

Doing "push_back" requires checking if "size < capacity", so this operation has a lot of overhead even for std::vector instances that never reallocate storage.


Here it would be great to see if the optimizer (which was only on O2 for unfathomable reasons) understands the reserve() beforehand to the point that it can eliminate those checks. You'd have to look at assembly for that.


I think Compiler Explorer shows that it usually doesn't, unfortunately.


if you know the vector's size in advance then you would use just the operator[] + resize and not push_back + reserve. you would use push_back and reserve if you have a pretty good idea about the size but you can't depend on it.


That has the overhead of initializing the entries (which may or may not be necessary).

Apparently you can prevent this by using custom allocators, which might be a fine solution but makes the code look less idiomatic.


Branch prediction should eliminate that.


Not completely. I can still imagine a few overheads:

- code size is bigger, taking up space in the cache and memory;

- instructions need to be decoded (whether this affects performance depends on surrounding code);

- this branch will take a slot in the branch predictor state (same here).


The virtues of the STL are that your code is, by default, typesafe and Not Slow. The article shows a few techniques for how to go from Not Slow to fast, mostly by fiddling with the data structures, and these jibe with my experience. In a nontrivial program, you will have your hot code and by creating custom data structures that make different trade offs than the STL (typically eschewing features like initialization, resize, etc), you can get some gain. But that effort isn’t worth it for the rest of your program, and yet the rest of your program is Not Slow.

The algorithm discussed also involves matrix math. It’s not object oriented, it’s not data processing. So, there’s just naturally not that much difference between C and C++ in this domain.


> I made a meshopt_Allocator class that could allocate large blocks of typed data and remember the allocated pointer; at the end of the scope all allocated blocks would be deleted.

I only write occasional C, and no C++ since college, but I think this Arena pattern is drastically underused. I'd love to see it in web frameworks, where every allocation inside a request comes from the same place, and at the end you just give it all back. It seems like you could even offer that in Ruby. Does Rocket use anything like this? It seems like the kind of thing Rust folks would take advantage of.

Postgres memory allocation follows an Arena pattern, where you use palloc to grab memory, but you rarely have to free it, because it is all freed when the "context" completes. There are nested contexts for things like the connection, the transaction, the sql statement, a function call, etc., so as long as you're allocating from the right context, you can practically ignore cleaning up after yourself. (This is from the perspective of a frequent extension author but someone who has only barely contributed to the core project, so perhaps a more expert contributor can correct me and/or elaborate.)

EDIT: Btw, I really enjoyed the article; thank you for sharing your experiences with us!


A nicely tuned generational GC should end up with first and second generations effectively as arena allocators that get recycled very cheaply. When the time comes around to collect the second generation, it should have almost no live data, as it will only be filled with whatever was in flight across all the times the first generation was collected.

When I understood this, it was an aha! moment for GC languages in stateless server contexts. I could see that for a properly written server - ideally keeping any long-lived mutable memory like caches off-heap - GC overhead could be negligible.


I'm very doubtful that's really true for databases. A lot of memory freed this way is not short lived. E.g. the contexts for running [parts of] an analytics query can both be large and quite long lived. And it's crucial that the memory is available again the instant the last query [part] finished, otherwise it's quite realistic to run out of memory. And adding latency by running through the many GB of otherwise used memory in hypothetical older generations isn't acceptable either.


I wrote stateless servers, if you missed that bit.

Anything requiring large semi-long-lived intermediate buffers would need a different approach, I agree. Similar to caches; you don't want a mid-life crisis in a generational GC.


I recently introduced similar patterns in a few things I work on. The speedup I got over having to manage tons of small allocations surprised me a little.


Main criticisms from my side (that have not been discussed to death already):

- Why O2? Why not O3? You're tuning your code for maximum performance and then leave the optimizer halfway? That'll of course disadvantage code with more abstractions.

- Compile times are improved by dropping STL headers - that's why precompiled headers exist. Having that factored into the comparison would indeed be interesting, sadly this was not done.

- MSVC is slow in debug mode because it's doing every debug check it can. If that's not what you want, you can just change the iterator debug level. I was hoping the author would figure as much after finding the gcc/clang switches (and complaining that they were off by default - very ironic) with note 6 but no dice.

- Regarding the choice of title: Almost all performance improvements were of algorithmic/data structure nature. That's completely natural for the level of performance tuning here. And I wouldn't really care if those specialized implementations are in C++ or C, do what you want. Just don't implement them in C and then draw conclusions from comparing them against the C++ STL under the flag of "Is C++ fast?". I think a better title for the article would've been "Are defaults fast?".


Better question: Can C++ be fast? Yes. But depending on the skill -- or lack thereof -- of the developer it can be dreadfully slow. Besides, speed isn't the best metric for judging a language. In the late 90s there was a joke among Windows desktop developers: Which is faster, C++ or VB6? The answer was "VB6, by six months" and there was a ton of truth to that statement (and by the time the C++ version would have been done the parts of the VB desktop app that had performance issues could have been identified and the logic replaced with C++ COM Objects).

The language is just a tool; USE the tool, don't be a tool.


Not a big surprise that hand implemented data structures with domain specific optimizations are faster than generic standard library implementations.


C++ is relatively fast but MSVC’s STL library’s debug performance is abysmal. It’s been like this since at least MSVC 2008 too. I’m not sure why Microsoft doesn’t think it’s a priority to improve.


It's because it's trivial to improve. The reason the MSVC STL is so slow in debug mode, is because it's really debug; they pull out all the stops to bounds-check every dereference, keep a generational count to detect iterators that could potentially have been invalidated by an operation that moved elements or reallocates the container (and do so whether or not the potential resize actually happened), etc. This adds a ton of code to the iterator paths, and (due to that code size) loses most inlining opportunities.

If you want it to be as fast as other implementations, ask it to do less of this. They provide [_ITERATOR_DEBUG_LEVEL](https://docs.microsoft.com/en-us/cpp/standard-library/iterat...) for exactly that purpose. I think their choice of default (Debug means you want it to catch as many bugs as it can, and Release means it should be fast) is appropriate.

If they hadn't provided the knob to adjust this I'd agree they had a problem for some use-cases, but if you want the faster-but-less-thorough mode you just have to ask for it.


As you explained, the debugging mode for STL is well beyond normal debugging mode (i.e. no optimiziations, all debug symbols). It explicitly changes complexity guarantees of the stl to catch issues with iterator invalidation and other nasty bugs. As such it adds a considerable runtime cost to any use of the STL.

In a way it is comparable to the overhead of the sanitizers available in other compilers.

IIRC MS chose to enable it by default in debug mode as part of their effort of fortifying unsafe C and C++ code against security bugs, but of course it has a large penalty.

libstdc++ has a similar debug mode, but it is disabled by default, as, in addition to be more expensive, it also breaks the ABI.


> C++ is relatively fast but MSVC’s STL library’s debug performance is abysmal. It’s been like this since at least MSVC 2008 too. I’m not sure why Microsoft doesn’t think it’s a priority to improve.

If you want fast performance, you use a Release build. The purpose of a Debug build is that you can detect the source of as many bugs in your program as possible. So for Release build, it is a priority to improve performance (and in my opinion Microsoft did a good job). For the Debug build, it is a priority to improve the detectability of bugs in your program. Here also in my opinion Microsoft did a good job.


That sounds reasonable, until you have to debug an issue that only manifests itself when working with a huge dataset, in which case the whole thing becomes pretty much unusable. As the article says, debug mode is expected to be slower, but the difference between MSVC and the other 2 is just crazy.


It's crazy because the knob for "how much debugging do I want to do" (aka iterator debug level) has been left untouched. Do you want the most rigorous checks (the default)? Don't do anything. Just want to avoid "variable optimized out" while debugging? Turn iterator debug level to "none".


Indeed, standard library is often slow. Sometimes by orders of magnitude.

Here’s an example where a fast linked list implementation (not mine, `CAtlList` from ATL) approaches or outperforms std::vector: https://github.com/Const-me/CollectionMicrobench

Fortunately, in C++ we don’t pay for features we don’t use. If you want fast code, no need to go back to C, you only need better C++ containers.


> Release performance also slightly increased across the board - this is because for many of our arrays, the default initialization performed by std::vector’s constructor is redundant as we’re going to fill the array anyway. With std::vector, you can either resize a large array and then compute the items (which requires default-initializing every item redundantly), or reserve and push_back repeatedly (which requires a bit more code for adding each item, and this overhead can also add up). With a custom container it’s easy to have an option to skip initialization - in fact, in our replacement that’s the only option since it’s easy to memset the array manually if necessary.

I have a hard time to get the rationale here. Is the author implying that using memset is inherently better than initializing an std::vector with the constructor that takes a count and a default value, or resetting it with the same variant of .assign()? Skimming through the code that's linked in the article, it looks like there's no use case where either the former or the latter wouldn't fit.


I read it as the author almost never uses the uninitialized data and when they do need a zeroed array it's simple to do so why pay the cost to initialize the elements when you're about to set them anyway.


But you can easily avoid that cost on std::vector, too. std::vector doesn't default-initialize unless you ask it to via resize(). reserve() doesn't initialize, though, it just allocates. Take this toy example:

    #include <vector>
    #include <cstdio>

    struct MyFoo {
        MyFoo() { printf("default ctor\n"); }
        MyFoo(int i) { printf("Initialized with %d\n", i); }
        MyFoo(const MyFoo& foo) {}
    };

    int main() {
        std::vector<MyFoo> myvec;
        myvec.reserve(100);
        printf("resize(2)\n");
        myvec.resize(2);
        printf("emplace_back()\n");
        myvec.emplace_back();
        printf("emplace_back(1)\n");
        myvec.emplace_back(1);
        printf("push_back(MyFoo{1})\n");
        myvec.push_back(MyFoo{2});
        return 0;
    }
It will print:

    resize(2)
    default ctor
    default ctor
    emplace_back()
    default ctor
    emplace_back(1)
    Initialized with 1
    push_back(MyFoo{1})
    Initialized with 2
Only the resize() call did a default initialization "silently", nothing else did.


The tradeoff is that emplace_back and push_back need to check the capacity (even though they won't reallocate because of reserve). Whereas with resize, assignments into the initialized area need no checks at all.


There is also the approach to just copy objects existing objects to safe yourself the cost of running the constructor in order to shorten your startup time.


With great power comes great responsibility.

C++ is no different. If you want it to be blazing fast, you need to tell it exactly what you want. What data structure do you want, where do you want it, where should it ask for more memory if needed?

If you don't do that, there's some sane defaults but the compiler doesn't know whether you are going to relocate the structure or how you plan to access it.


Interesting, but not necessarily surprising. It's pretty well known that the STL favors generality and interface consistency over speed.

In any C++ conference talk about performance the speaker's almost guaranteed to mention that their project uses their own optimized library instead of the STL.


The most interesting part of this article to me was introducing me to Jonathan Blow and his Jai language. I've skimmed over some of his videos, and they look quite interesting. The compile time execution of any code supported by the language looks especially interesting.


I'm confused by the premise:

> dropping C++11 requirement allowed me to make sure anybody can compile the library on any platform, removing std::vector substantially improved performance of unoptimized builds

What platform doesn't have C++11 support? And why is the performance of an unoptimized build interesting? -O2 is basically a standard compiler flag...?


C++ gets you into trouble really fast ;-)


What a terrible article.

It starts with the comparison of a general-purpose hash table with specifically-tailored one.

Then it throws in a switch from a comparison-based sort to a radix-based one.

Then, the stunning discovery that std::vector value-initialises its elements (fixable with 3 lines of a custom allocator, if you don't want that behaviour).

In the middle, it preaches about portability while at the same time advocating the use of __builtin functions vs standard functions, and type-punning floats to int and back (undefined behaviour in C++).

And all of this for a 572ms -> 320ms performance improvement?


Could you please nudge the criticism a bit towards information and away from theatrical dismissal? It's not wrong to react that way to the article, but when we post here we want to focus on what we can learn.


In the grand tradition of HN i haven't read the article (but in my defense it is blocked by work's firewall). Still, if he claims that unordered_map has issues he is not alone. It is well known that the standard requirements preclude a more efficient implementation.

And the issue with vector value initializing it's elements is also well known and programmers have been asking for a solution (more user friendly than a custom allocator) for a long time.


I don't understand the problems. Want a better hash map ? Use whatever makes your day here : https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.... ; they are all header-only just like the standard library's.

Want a vector that does not initialize its elements ? Ask and thou shalt receive: https://github.com/aguinet/pector

The whole point of C++ is to make it easy for people to write generic libraries beyond de standard base provided by the stdlib. So why won't you use those ?


Oh, I'm in violent agreement. The single biggest contribution of the STL is its set of concepts more than the actual implementations. If you do not like the specific tradeoffs for a container or algo implementation, you can swap it for a different implementation often without code changes. I.e. you do not need to remove abstraction, you just swap it for a different one.

Still it is fair to critique the specific tradeoff done by the standard library.


It’s the linear chaining effectively required by the standard. I had an application years ago which was 10x as fast after dropping in khash over unordered_map.


IIRC that's because of the iterator and address invalidation requirements that effectively require an node based implementation. But if your code does not rely on these guarantees then, as you suggest, you can swap it with a different implementation with the same interface.

You do not have to pay for what you do not use.


I wonder what the reaction would have been if he/she had written a blog post explaining how moving from list to numpy.array for storing numerical data in Python resulted in massive performance benefits. You gotta admit the situation is kinda similar.

What is unfriendly about custom allocators? Note that in this specific case you don't need to write any allocation function, you just have to re-implement the construct() function with something that does not value-initialises if no construction arguments are provided.


Changing the allocator changes the type of the vector, which has ABI/API implications, especially if the vector type is not under your control.

Also, IMHO tying allocation with initialization in the allocator concept is not one of C++ brightest moments.

There have been continuous talks about adding an unsafe_push_back and an unsafe resize for these and other reasons.


I have written custom containers with malloc/realloc/placement new. It’s a hassle, but once it works, you have a very efficient and memory-safe implementation.


> You gotta admit the situation is kinda similar.

I disagree.

1. Everybody knows Python is a scripting language, and trying to do heavy computation without handing it off to a library is going to be slow.

2. A vector is the de facto data structure for this in c++. In Python, numpy is effectively the de facto approach for math.


Sorry but I think this is still an apt analogy.

1. "Everybody" knows that unordered sets/maps in C++ must de-facto use chaining due to the requirements imposed by the standard.

2. numpy.array is not part of the Python standard library. It is an extra component you still need to install separately. As someone who deals regularly with novice Python users (often coming from Matlab & the likes), one of the very first points of pain/confusion is the fact that I need to teach/convince them not to use the facilities of the core language to represent "vectors".


If you're rapidly removing and adding, a simple fix for unordered_map is to add your own stack allocator. That way the memory you touch for it's lists is nearly always hot.

The same technique vastly improves map and set too.


Maybe the title of the article is bit unlucky, if it was named "how I made my library faster and more compatible", would that change your opinion?

And yes, it's only a 252 ms improvement, but that's a whopping 44% time saved. If you need to do lots and lots of mesh optimizing, that might be very well worth it.


Yeah, but moving towards C is not where those 44% came from.


> advocating the use of __builtin functions vs standard functions

It wouldn't say it "advocates" for the use of builtins. The author demonstrates a problem they had (including math.h made their compile times 2x slower in c++17 mode) and shows their solution, and advocates changing libc++ so that c headers don't pull in c++ stuff.

In any case, the snippet you object to uses the builtins if they are available, or falls back to math.h. It's a hack to avoid math.h when it's slow to compile, but stills seems pretty portable?


I would interpret the article as one with the following abstract:

C++ has a lot of generic programming solutions like std::vector and algorithms, [and we think of these things as being fast even though they are generic because of templates and “zero cost abstractions”] but we show that we still pay for their being generic (because they must support operations which make everything else expensive, or are doing more work than we need done), and so [for situations like this] we prefer the trade-off of implementing specialised functions that may be similar to other functions elsewhere to gain runtime and compile-time performance improvements. We demonstrate this thesis with a case study of a mash simplification algorithm.

I think this is a reasonable topic to write about, and I think the article does an ok job of it. I think it is silly to get hung up on specific cases (e.g. you need to do some magic custom allocator thing to make a c++ thing about as cheap as a c thing that achieves the purposes you have, or that you might want to replace an exact sort with an approximate one (note that in this case, using a full radix sort would have you pay 30ms instead of 10, so 572 -> 340ms which is still a large improvement, and one that will become larger with larger inputs)). I think one should instead treat it as a case study which shows some general ideas like “you often do pay for what you don’t use in c++/stl” or “often one can use a more specialised algorithm with much better performance then a general one so when one is writing performance sensitive code, having generic algorithms to hand may not be useful”


> And all of this for a 572ms -> 320ms performance improvement?

The library is supposed to be part of a graphics pipeline. It will be called many times a second, and it will be part of what the eventual framerate is.


A 44% performance improvement does not sound good to you?


Of course it sounds good. But you could get basically the same improvement while staying fully within C++ idioms. The correct title would be "Is my library fast without performance tuning?".


What a terrible article.

I thought it was ok. Asking "Is C++ fast?" is a little like asking if internal combustion engines are fast. The actual answer is, "It depends." The content is ok for this site, which also has people who are just starting out with optimization or who might not be that familiar with C++.

"Is [lang] fast?" is a typical clickbait tactic which works by nibbling at a programmer's pride. Here's the thing. Either one is a bit naive, the pride is hurt and the clickbait works, because one isn't seasoned enough to know that optimization is a complex subject within a complex world of tradeoffs or you're so invested in making things fast (possibly so good at it) that you can't resist taking it apart.

And all of this for a 572ms -> 320ms performance improvement?

78%? "Not tea bag!" as AvE would say.


I never did figure why the author insisted on reporting the runtime statistics for the debug build.


From the post:

>One number that we haven’t moved substantially yet is the time it takes to run this code in debug using MSVC. While it’s natural to expect unoptimized builds to be slower than optimized, they have to be fast enough. Sometimes you want to debug your problem on a non-trivial input dataset. Sometimes you want to run the debug build with full checks through your tests to make sure they don’t trigger any bugs that could disappear in release. Sometimes you are trying to debug a different part of the program, but you still need to run the rest of it.

The library, by the author, is https://github.com/zeux/meshoptimizer. I presume it will be used in graphics pipelines. Debug builds working at a speed that allow graphics to be tolerable by humans sounds like a good goal to have.


"Value-semantics is the way to optimal performance!", said no one, ever.


There was an article on HN, where the author stated that they didn't like to write own algorithms "like in '90".

In this context, the Arseny's article should be titled: "How to be old-school and tailor algorithms & data structures to your needs". I have this haunting feeling that people nowadays are afraid of "reinventing wheel" even though it'd pay off. A recurring argument is that the standard library or boost or something available on github is better than a custom solution. But as we see, there are cases where saying "nope, I'll do better job myself" has sense.


"C with structs" never heard of this saying. What does it mean exactly?


The first version of C++ ever, designed by Bjarne Stroustrup at AT&T as a thin layer on top of C, was called "C with classes". It was implemented as some kind of source code preprocessor that took ".cc" code and output actual ".c" code, probably not even fully parsing the input code.

It is already a somewhat funny first joke saying that the subset of modern C++ that's a nice language is just the C part and the "class" paradigm ("C with classes").

It is another second somewhat funny joke to say that actually the nice part is only an even more restrictive subset, "C with structs". Yes, structs and classes are virtually equivalent in C++, but let's assume they mean structs as in "classes without member functions". Which were already present in C. Which is why it is funny.

Jokes are never funny when explained :/ Source: I'm a somewhat experienced stand-up comedy practicioner, apart from a very experienced software developer.

I wholeheartedly recommend reading Stroustrup's "Design and evolution of C++", gives you so much valuable and interesting background.


"C with classes" was never any sort of "source code preprocessor", except in the sense that every compiler is.

From day 1 it was a full compiler that happened to generate C at the back end. At the time this was novel, and was easier for people to understand as if it were a preprocessor like RATFOR. Nowadays, generating C is an extremely familiar approach to bootstrapping a new language, and nobody pretends your compiler is "just a preprocessor". (Also, nowadays, you might better generate LLVM. But that option is new.)

So, saying "C with Classes was a preprocessor" amounts to simple slander.


It is even funnier that all major C compilers, and in some cases libc, are nowadays implemented in C++.


No it is not (funnier).


From a former C++ advocate on USENET point of view, it surely is.


I guess it's meant as a reduction of the "C with classes" C++ style: Reduce C++ feature usage even more towards C, but without actually switching to a C compiler.


I presume it is a glib humorous attempt to say you have structs in c so don't need c++


It's just a snarky way of saying plain C.


I read it to imply that C is fine without classes.


"removing algorithm includes sped up compilation."

Why is he optimizing for compile times when his compile times are only 1/2 second the start out with? Interesting article though.


1/2 second for this single file, not for the whole project. The bit you quoted about removing includes was about the changes throughout the project.


There are some valid points on optimization here, but its always incredibly tricky to draw conclusions from a series of microbenchmarks.


I don't see why anyone considers compilation time.

Your code will be compiled only once for your customers, yet run millions of times. Clearly runtime is millions of times more important than compilation time.

If compilation is taking too long, start compiling in the cloud.


The headline is a click bait. Slightly better one would be "is idiomatic C++ fast"


There isn't anything you can make the processor do in C that you cannot also make it do in C++ (if I am not mistaken!) Given that, C will only appear faster if you are not using C++ to its full potential. This might be because you don't know what you're doing or because the C++ language makes it unreasonably awkward. Most articles I have read like this are the former, unfortunately.


No, C will be faster because you have more control, in the same way that Assembly will always be faster than C.


C++ allows you to write code that is equally expressive as C. It is no longer true that every valid pice of C is also valid C+++, but the deviations are purely minor syntactical differences. I know no C construct for which a C++ counterpart does not exist. Please enlighten me if you have an example.



Replace it with __restrict and you're fine.


not standard c++


Designated initializers.

(yet)


that's just syntactic sugar


VLA.


Okay, this is true. Using alloca directly is not a good alternative.


I wonder if his MSVC debug performance wors would have been solved by linking against the release CRT


I don’t think you can, it won’t build.

There’s a supported way to speed up debug builds by a huge factor, for the price of less runtime checks.

https://docs.microsoft.com/en-us/cpp/standard-library/iterat...


Sure it does, I use it every day!


You can't link debug builds with release CRT. Won't link. Here's an example of missing symbols: __imp__invalid_parameter __imp__CrtDbgReport __imp__CrtDbgReportW

You can workaround by undefining DEBUG, _DEBUG, defining NDEBUG, etc., to make headers think it's a release build, but then it won't be a debug build anymore.


If your goal is to be able to step it in the debugger and follow what is going on, then it is definitely still a debug build.

If you want sanitization instead of stepping through the debugger, use that.

The people who are best at debugging avoid debuggers entirely, and use logging statements inserted in release-built code, and little proofs. It is hard to persuade people who don't think that way, but it is the only route to mastery.


I'm very good at debugging and when I have choice I use all the tools I can get. Not only I use debuggers, I extend them to do what I need for particular project, e.g. autoexp.dat/natvis/custom plugins when coding C++ in VC.

At the same time I also use logging a lot. These two things aren't always mutually exclusive.

Only sometimes they are. Some platforms don't have good debuggers. For other stuff it's the opposite, e.g. you can't log from code running on GPU (very hard to accomplish, borderline impossible) but there're specialized GPU debuggers.

I don't think people who use just logging, or just debuggers, are the best.


They are bloating C++ so much that it'll end up as slow as Python or any other high level language with none of the advantages of being a high level language.

Just the fact that we're asking ourselves "is C++ fast" should be frightening, for a language that has always rivalled C closely.


yes




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: