So if you get rid of bounds check, reimplement your own feature-limited vector and hash table and change the algorithm moving from a general purpose exact sort to an inexact, domain-specific one, you can improve your run time from 572ms to 320ms.
I interpret the title to mean, "Is standard usage of C++ language features and the STL competitive with C?". The author is basically showing how you improve performance by selectively tearing out more and more features C++ hands you out of the box. This is characteristic of writing C++ in a more C-like style. In that sense I think the author is correct.
However I'd also say this is something already well known by most C and, particularly, C++ developers. Yes, you can get better performance by reimplementing much of what C++ gives you, and that is typical of writing C code instead of C++. But there's a fair bit of productivity trade off in doing so.
This is one of the defining principles of C++, after all: You pay only for what you use. Conversely, if you do opt-in to using more advanced features, it may come at a cost, but it's all optional.
It's been one of the things that have most strongly shaped the evolution of the language and the library, for both good and bad, so absolutely, it is very much well known.
'You pay only for what you use'
Sorry, have you read the article? He is not using any C++20, C++17, C++14... features, just adding a compiler key. And he's paying with compilation time. https://twitter.com/zeuxcg/status/1085781851568914432
You might say that compilation time is nothing.. But it is C++, compilation time has a huge impact on big codebases.
He started out with among others, std::vector. That has tradeoffs, in that the standard specifies requirements that he may or may not need.
My point stands. The performance trade-offs you're paying for with various C++ features are extremely well understood, and better documented for C++ than most languages.
That doesn't mean there aren't surprises here and there, and obviously there are compiler difference, but the point was that the language is designed so that these tradeoffs are known and documented and for the most part comes as a surprise to people mostly when they don't know the language very well.
It is not a surprise that you can improve on things like std::vector by rolling your own that does exactly what you want; for starters if you know your requirements chances are you can speed up initialization, because you don't need to abide by standard semantics. For starters, if you know you're going to assign a fixed number of elements, you don't need to track size and capacity, and you can often avoid the need for bounds checking.
As for compilation time, that too is very much a matter of paying for what you're using. You can choose to disable debug info, you can choose to not include headers that are bound to result in lots of time spent on parsing templates.
That he's not using many advanced features is irrelevant; his starting point very much had made trade-offs in using functionality that comes with known costs.
For me this principle is also what often causes performance issues, because for many advanced features this leads to convoluted language semantics and sub-optimal implementations.
No, not at all! At least not in an absolute sense. But what I mean is that writing in C naturally tends towards writing more special-purpose, bespoke functions because you can't lean on the existing, general-purpose functions of C++ (nor language features).
I agree with you about the title, but I also think that if you fix the title the author is making a good point. It's not groundbreaking, but the move "down" from C++ to C does run parallel with the move from more general purpose to more special purpose, in a lot of ways. C dominates embedded after all.
I'd say that printf, with its own DSL, is actually pretty slow. It's meant for debugging or "slow" I/O, if you need to write a lot of data it should be avoided.
A few years ago for debugging purposes I wanted to hexdump a large amount of data. Turns out that the standard hexdump tool was pretty slow at it, writing a simple C utility that would format the dump "by hand" instead of using printf was about an order of magnitude faster. You have fully static code instead of the interpreted format string, you can unroll all the formatting, you don't have to rewrite the characters that don't change in every line (spaces, delimiters etc...). It's lightning fast but of course not very flexible.
> I'd say that printf, with its own DSL, is actually pretty slow.
Yes, of course it is. That's why I took it as an example. Since I work on RealTime systems I usually have to use plain 'write' instead. But that is because 'printf' is an incredibly complex function, not because 'C is slow'.
At this point I feel like it's almost a philosophical argument but:
- printf is a standard function in C's stdlib.
- Because of the nature of the function (using a special purpose format strings and varargs instead of some kind of generic system) the function can't really be aggressively optimized and inlined without compiler magic. This limitation is caused by C's own limitations when it comes to generic programming and type introspection, printf needs to be told explicitly through a "side-channel" (the format string) what types to expect because C lacks the infrastructure to do that by itself. It also means that you can't extend the function for custom types and that you can't type-check the parameters without compiler magic.
- Even though some compilers do implement special magic around printf to perform more aggressive optimizations (like replacing it with "puts" if there's no formatting going on for instance) it's still generally trivial to beat its performance in non-trivial case with handwritten code.
Given all of the above I would say that makes this particular corner of C rather slow indeed. Or at least slower than it might be compared to an hypothetical language that made it easier to optimize string formatting.
You're entirely right, but I feel like you're trying really hard to misunderstand deng's original comment. It's an apt analogy, partly because of what you wrote. This article's title is as ridiculous as an article titled "Is C fast?" with a story about a faster_printf that only supports %d.
the compiler knows the types ahead of time, too, so that cannot explain any speed difference (and yes, C compilers do try to avoid the generic printf in the library. See http://www.ciselant.de/projects/gcc_printf/gcc_printf.html for examples)
(The #include is essential. With it, the compiler can know what printf does, and, in theory, optimize it to a int-to-string conversion and a puts system call. Without it, it cannot, because it doesn’t know what the printf function does (in this case, it is easier to generate fast code for the compiler if it doesn’t have the source available than when it would see source code for a function called printf)
Utterly pedantic: no. As given, both are undefined behavior, and it is trivial for a compiler to discover that.
> and yes, C compilers do try to avoid the generic printf in the library
That is true for certain format strings (containing only %s or %c conversions), but for %d GCC and Clang don't seem to want to call a print_int function, possibly because there is none: https://gcc.godbolt.org/z/rltFse
The C++ standard library is pretty competitive with C given that C has no standard data structure and algorithms library! It's not even an apples and oranges comparison.
> The author is basically showing how you improve performance by selectively tearing out more and more features C++ hands you out of the box.
In that case the author is showing what has been widely known by C++ developers for decades. In fact, the "don't use the STL because it's slow" is a mantra which, along with the old "replace STL's default allocators with your customized ones", is repeated ad nauseum in gamedev circles.
If you use special-purpose tools rather than robust ones, you can get better performance. But nobody wants to ship a language without robust general-purpose tools.
Even the slow language features are fast enough for most use cases, though. I think the actual comparison here isn't between 'slow' and 'fast' but 'probably fast enough' and 'as fast as possible.'
In fairness to the article, it is putting some flesh on the bones of an area that I have seen cause surprise.
It's a variant of the observation that, although in C++ you aren't paying a direct abstraction cost, you are often paying a generalisation cost. This cost is obscured by the abstraction and may not be warranted in your application. In other words, an indirect abstraction cost.
For me the proper title of the article would be "std::unordered_set is slow and MSVC debug runtime is even slower", but I guess that wouldn't generate as many views as pretty much everyone knows it.
Came here to say almost exactly this. C++ is fast. All the things mentioned in the article are very specific replacements of C++ standard library implementations. C++ is a zero cost abstraction language, all these things could have been replaced with idiomatic C++. The fact that they were implemented in C is more of an authors language preference. Reimplementing some things might be easier in C because C is a simpler (and often more familiar) language. Idiomatic C++ may require knowledge of pretty arcane feature of already complicated language, like placement new and judicious use of references and move operators.
I said this elsewhere, but I would legitimately like to see a program like what the author wrote, written in idiomatic C++ that performs as well as what the author got to. Maybe in a comparison style like the author to show the slower C++, the faster "C with classes", and the comparable speed "idiomatic C++" or something. I feel like I see people saying that idiomatic C++ is just as fast as something, while other people say that idiomatic C++ is slower and should be replaced. I'm going to assume you can for sure write C++ using the STL to be just as fast, and would like to learn what needs to be done to get to that point so I can incorporate those techniques into my own C++ code where I might need that extra bit of performance. (I'm playing around with writing games in C++, so performance is more of a concern than an IO-heavy web server. Or I'm considering learning QT for making apps, etc.)
Author posted C version of their code[1]. As a couple of examples of how you can go back to idiomatic C++ starting from here:
1. replace all pointers by passing references to templatized vectors. On a high level of optimization they will be compiled to the same pointer arithmetic as C, but will give you type-safety on the source level.
2. Replace structs with clasess with short inlined accessors. On a high level of optimization they will be compiled to the same memory offset access
3. _If_ there is any value in making domain-specific sort generic - make it such
All the things that were replaced were abstractions, which had costs. I think that there's a valid point that a lot of C++ programmers use these abstractions liberally, while sweeping the very real costs under the rug. There are also benefits to using them, of course.
I would say C++ is more like a zero runtime-cost abstraction language (in release builds).
You could achieve good release build runtime-speeds in modern idiomatic C++, but you have to trade off your compile times and Debug-build runtime-speeds for that.
Just because I'm using a C++17 compiler doesn't mean I'm happy with the standard library.
I still have to google every time I need to sleep the current thread ; and don't get me started on iostream!
However, I'm happy to be able to return a buffer without copying it (move semantics), to precompute frequency tables at compile time (constexpr), to be able to pass around functions without having to create a dedicated class (lambdas), not to have to worry about having an extra conversion (auto) ... And soon, I will be happy to write serialization code without relying on dirty tricks (introspection).
Well, I can imagine there are some languages where the interpreted nature of the language keeps you from being able to build higher-performance highly-specialized data structures and algorithms that outperform the native implementations (say, JS). But even then, it still doesn't justify the title.
What? It demonstrates the opposite. He stopped using general purpose containers and algorithms, and replaced them with implementations more tailored to his use case. He didn't keep paying for std::sort or std::unordered_set once he stopped using them. By contrast in most managed languages with a GC, I can't opt out of the GC even if I want to take responsibility for memory management and object lifetime.
> The issue is that math.h includes cmath in gcc/clang which pulls in a lot of C++ machinery, and in C++17 in libstdc++ adds a slew of new special functions, that are rarely useful but will make compilation slower anyway.
> This is definitely an area where libstdc++ and libc++ could improve in the future - I don’t think it’s reasonable to force users of C headers to pay for the C++ baggage.
That's different from "I do not want a disabled GC baked into my program." It is relevant, especially to embedded coding with its ornery size restrictions.
In embedded environments the GC is fundamentally different to a large RAM machine. There a simple boehm-gc is enough and still faster, safer and smaller than refcounting. It can be made incremental.
On big machines a copying collector with a nursery is much faster and safer if you can deal with the pauses.
C++ comes with the OOP mindeset: OOP is about polymorphism, inheritance and encapsulation. Now what if you are abstracting over the wrong library for a given task? Will you be able to question your basic assumptions, or is that an assumption that is not allowed to be questioned? (due to the encapsulation principle)
I don't think C++ comes with the OOP mindset. At least, not recently. Most people will tell you it's "multi-paradigm." And I would say C++ certainly gives you enough flexibility to dig underneath the abstractions, as TFA shows.
I am not so sure about the multi-paradigm thing: In c++ everything is a class. Also of you try hard enough then you may implement any paradigm in any language.
Also a class in C++ is by default exactly the same as a struct (except for default access control which is a minor thing). An OOP style class would be any class type (including struct) with virtual functions which you would have to opt in.
yes, you can do a C++ program with functions only + integers and array of integers. technically that's C++. Very apt observation! I still guess that the main entities in your c++ program will be classes (not POD)
And I still maintain that most people treat class libraries (such as STL) as given and do not inspect how they are implemented and do not consider what constraints / alternatives do exist. One example: the std::dequeue is a dynamic array, and you can't quite tune the size of the inner array easily by means of template or instance parameters - no one cares. Also many people happily use the shared_ptr class and don't even know that it takes very expensive atomic operations to do the reference counting; I would have expected a template parameter that could determine if shared_ptr is meant for multithreaded use or not. Most people don't care and happily use the shared_ptr. Also there are no intrusive linked lists in STL (well, there are in boost)
I can go on and on about the STL (and other widely used libraries if you care) and I still suspect that they get away with these "issues" in a widely used library because people are used not to question library implementations and treat them as frigging black boxes.
Not really true. Lambdas can be converted to a class overloading operator() (such as std::function), but they are not inherently that. You can use them as just function pointers, too. Example:
#include <cstdio>
void call_func(void(*func)()) {
func();
}
int main(int argc, char* argv[]) {
// Lambda can be converted to a function pointer
call_func([]() {
printf("hello as a function pointer\n");
});
}
I am fairly certain I have had error messages and disassemblies that suggest otherwise. This may however be an implementation detail.
> You can use them as just function pointers, too.
Yes, I know this. It only works when there are no captures. And AFAIK once there are captures you start getting object code that looks a lot like you put those captures into an anonymous class. (When not inlined.)
You can get that sort of behaviour from any class by implementing the conversion operator. I assume that lambda implements `operator(void(*)())` so it is implicitly convertible to a function pointer.
Even though Java is still very "class-centric", in the strictest sense Java doesn't make "everything a class". There have always been primitives for things like int, long, boolean, float, double, etc. And now that we have MethodHandles and what-not, you can (sort of) start to think of functions as first class entities in Java. And there are Lambdas as well which effectively let you build a function without it needing a class to contain it.
The "STL" part (data structures and algorithms) part of the C++ standard library uses no OO design at all. The iostreams part does, but it's hardly visible to the user. (No, simply having syntax that allows you to write `data.function()` instead of `function(data)` doesn't make anything object-oriented.)
I haven't (yet) read the article, but let me provide an anecdote to answer the question posed by the post title "Is C++ fast?":
I write in C++. I had a coworker who wrote in Ruby. He wanted to learn C++ because C++ is "fast". So he wrote two identical programs: one in C++ and one in Ruby. Then, he ran both programs though `time` to determine how long each took.
His C++ program took longer.
Why was that?
He didn't write his C++ program using any efficient algorithms whatsoever while his Ruby program was well-designed. He came to me to try to tell me that C++ wasn't as fast as Ruby.
So I rewrote his C++ program. My C++ program then ran almost instantly while his Ruby program hadn't even begun reading or writing; the C++ program finished while the Ruby program was still initializing the Ruby runtime libraries.
Is C++ fast? Yes, it can be. But it's only fast if the developer understands how to make it fast.
Don't use C++ because it's "fast". Use C++ because you understand what its intended purpose is: to be close to the hardware.
After reading the article, I think my anecdote is still relevant, and that the article reinforces it. As @the_duke surmised: If you don't understand your tools, you can use them so inefficiently that other, supposedly worse tools are much faster.
The article doesn't mention what versions of each compiler was used. I'd also challenge the author to use Compiler Explorer to discover compiled-to-assembly-level differences between their C and C++ code.
I suppose the question that many people would like to ask, but it's hard to test is:
"Can we rank the speed of languages, given each are used by a domain expert to take full advantage of the features of that language"
or some such.
i.e. if a masterful C programmer and a masterful C++ programmer and a masterful Java Programmer and a masterful Ruby programmer all write something, which is faster.
I'm sure the answer is "it depends", because that's always the answer.
For that particular question: They would all be the same speed because the optimization requirement would lead each one to write a compiler in their language of choice that emits a perfectly tuned assembly binary.
What gets overlooked in language performance is that you pick a language for what are, in essence, placeholder constructs. The performance matters only in that you need it to be fast enough "off the shelf" that you don't have to start writing a compiler. But you generally aren't prevented from using that approach to boost your existing code, and high performance systems often converge on that strategy in some degree, because it lets you keep maintainable source code and add toggles for debugging and optimization that you won't get from treating the application and compiler code as separate entities.
The last thing HN needs is discussions based around a comment from someone who didn't read the article. I don't care about your contextless philosophy.
> Let’s try to replace the only STL component we’re still using, std::vector, with a really simple dynamic array - we don’t need resize or push_back in our code, all arrays are initialized with the right size.
Given those requirements, a custom allocator would speed things up. A custom allocator could allocate even on the stack (small buffer optimisation) if possible or use a memory pool.
Embracing C++ features or reading the docs often gives you more than going back to C or reinventing the wheel.
However I think this demonstrates one of the many lurking problems of C++:
It does not enforce the right way of doing things nor makes it easy. Instead it makes the right solution awkward or cumbersome to write.
> It does not enforce the right way of doing things nor makes it easy.
but what the author is doing here, while being "the right thing" for his own project with specific performance requirements, is absolutely not the right thing to do for the average project that has a few std::vector of 15 widgets, a dozen strings and six and a half pixmaps, or does the occasional web request.
As already said, it depends on the type of project. Sometimes performance is key, sometimes
implementation time is. It is the old tradeoff between engineering effort spent and the value gained.
I feel like I've seen people say more than once that you can do things like this to improve the performance of C++ to what you see from this post with the "C with classes" style and reimplementing various bits of the stdlib. However, I don't believe I've ever seen somewhere that demonstrates this capability, and like you said, the "right solution" is not enforced, or is awkward, or hard to discover. What I would legitimately like to see, is something that takes something like what the author was doing, but also uses these features of C++ that seem to be talked about, but harder to find demonstrations of. Especially considering this is a relatively small amount of realistic code, I would love it if someone were to reimplement it using a custom allocator or memory pool or whatever other C++ features are available to get the performance to where it is in what the author wrote (or better?) to act as a kind of tutorial for how to use these things. This is something I'm particularly interested in because I've been learning some C++ again and playing around with writing some stuff with SDL, and would love to be able to incorporate the nice features of C++ into what I'm writing.
It would be nice if he used a newer version of GCC and a newer base system in general, rather than the 2+ year old version of Ubuntu. I would also like to see % change as a column in those tables.
Also I'd like to see other optimization levels, since at least GCC adds a lot more potentially slow (to compile) optimizations at -O3. Maybe -ffast-math. Static linking too could conceivably make a difference if he calls into external libraries frequently.
And where does perf say the time is going? Maybe it's all lost in branches or cache misses so a different algorithm that branches less or uses denser data could help.
How about vectorization? I've now looked at the code and it's all vectors, arrays and matrices so surely this is an application for AVX, maybe even a GPU.
Anyway I think what's clear is this doesn't have much to do with C vs C++.
The most surprising part of this to me was the speedup from replacing std::vector.
They write, "We don’t need resize or push_back in our code, all arrays are initialized with the right size." But if you're not doing any resizing or anything, and you're reserving the right amount, std::vector basically is a thin wrapper over a dynamically-allocated raw array -- which is what they tout as their replacement.
I guess I'm surprised that the overhead from default initialization was so large.
With vector, it seems there are always tradeoffs: resize and you have some initialization overhead, but only reserve and you will have a bounds-check-maybe-realloc every time you need to push_back. Of course, since you reserved, the realloc never happens, but you still have to check each time.
I wonder if something like generate_n or copy_n into a back_inserter can avoid the bounds checks?
> std::vector basically is a thin wrapper over a dynamically-allocated raw array
Not exactly. A dynamically-allocated array doesn't necessarily save its own size (the system allocator can do optimizations), but std::vector does (you could call size() on it). So std::vector has to save this size somewhere, which takes space and time.
That overhead can be thin or heavy, depending on the operation.
Doing "push_back" requires checking if "size < capacity", so this operation has a lot of overhead even for std::vector instances that never reallocate storage.
Here it would be great to see if the optimizer (which was only on O2 for unfathomable reasons) understands the reserve() beforehand to the point that it can eliminate those checks. You'd have to look at assembly for that.
if you know the vector's size in advance then you would use just the operator[] + resize and not push_back + reserve.
you would use push_back and reserve if you have a pretty good idea about the size but you can't depend on it.
The virtues of the STL are that your code is, by default, typesafe and Not Slow. The article shows a few techniques for how to go from Not Slow to fast, mostly by fiddling with the data structures, and these jibe with my experience. In a nontrivial program, you will have your hot code and by creating custom data structures that make different trade offs than the STL (typically eschewing features like initialization, resize, etc), you can get some gain. But that effort isn’t worth it for the rest of your program, and yet the rest of your program is Not Slow.
The algorithm discussed also involves matrix math. It’s not object oriented, it’s not data processing. So, there’s just naturally not that much difference between C and C++ in this domain.
> I made a meshopt_Allocator class that could allocate large blocks of typed data and remember the allocated pointer; at the end of the scope all allocated blocks would be deleted.
I only write occasional C, and no C++ since college, but I think this Arena pattern is drastically underused. I'd love to see it in web frameworks, where every allocation inside a request comes from the same place, and at the end you just give it all back. It seems like you could even offer that in Ruby. Does Rocket use anything like this? It seems like the kind of thing Rust folks would take advantage of.
Postgres memory allocation follows an Arena pattern, where you use palloc to grab memory, but you rarely have to free it, because it is all freed when the "context" completes. There are nested contexts for things like the connection, the transaction, the sql statement, a function call, etc., so as long as you're allocating from the right context, you can practically ignore cleaning up after yourself. (This is from the perspective of a frequent extension author but someone who has only barely contributed to the core project, so perhaps a more expert contributor can correct me and/or elaborate.)
EDIT: Btw, I really enjoyed the article; thank you for sharing your experiences with us!
A nicely tuned generational GC should end up with first and second generations effectively as arena allocators that get recycled very cheaply. When the time comes around to collect the second generation, it should have almost no live data, as it will only be filled with whatever was in flight across all the times the first generation was collected.
When I understood this, it was an aha! moment for GC languages in stateless server contexts. I could see that for a properly written server - ideally keeping any long-lived mutable memory like caches off-heap - GC overhead could be negligible.
I'm very doubtful that's really true for databases. A lot of memory freed this way is not short lived. E.g. the contexts for running [parts of] an analytics query can both be large and quite long lived. And it's crucial that the memory is available again the instant the last query [part] finished, otherwise it's quite realistic to run out of memory. And adding latency by running through the many GB of otherwise used memory in hypothetical older generations isn't acceptable either.
I wrote stateless servers, if you missed that bit.
Anything requiring large semi-long-lived intermediate buffers would need a different approach, I agree. Similar to caches; you don't want a mid-life crisis in a generational GC.
I recently introduced similar patterns in a few things I work on. The speedup I got over having to manage tons of small allocations surprised me a little.
Main criticisms from my side (that have not been discussed to death already):
- Why O2? Why not O3? You're tuning your code for maximum performance and then leave the optimizer halfway? That'll of course disadvantage code with more abstractions.
- Compile times are improved by dropping STL headers - that's why precompiled headers exist. Having that factored into the comparison would indeed be interesting, sadly this was not done.
- MSVC is slow in debug mode because it's doing every debug check it can. If that's not what you want, you can just change the iterator debug level. I was hoping the author would figure as much after finding the gcc/clang switches (and complaining that they were off by default - very ironic) with note 6 but no dice.
- Regarding the choice of title: Almost all performance improvements were of algorithmic/data structure nature. That's completely natural for the level of performance tuning here. And I wouldn't really care if those specialized implementations are in C++ or C, do what you want. Just don't implement them in C and then draw conclusions from comparing them against the C++ STL under the flag of "Is C++ fast?". I think a better title for the article would've been "Are defaults fast?".
Better question: Can C++ be fast? Yes. But depending on the skill -- or lack thereof -- of the developer it can be dreadfully slow. Besides, speed isn't the best metric for judging a language. In the late 90s there was a joke among Windows desktop developers: Which is faster, C++ or VB6? The answer was "VB6, by six months" and there was a ton of truth to that statement (and by the time the C++ version would have been done the parts of the VB desktop app that had performance issues could have been identified and the logic replaced with C++ COM Objects).
The language is just a tool; USE the tool, don't be a tool.
C++ is relatively fast but MSVC’s STL library’s debug performance is abysmal. It’s been like this since at least MSVC 2008 too. I’m not sure why Microsoft doesn’t think it’s a priority to improve.
It's because it's trivial to improve. The reason the MSVC STL is so slow in debug mode, is because it's really debug; they pull out all the stops to bounds-check every dereference, keep a generational count to detect iterators that could potentially have been invalidated by an operation that moved elements or reallocates the container (and do so whether or not the potential resize actually happened), etc. This adds a ton of code to the iterator paths, and (due to that code size) loses most inlining opportunities.
If you want it to be as fast as other implementations, ask it to do less of this. They provide [_ITERATOR_DEBUG_LEVEL](https://docs.microsoft.com/en-us/cpp/standard-library/iterat...) for exactly that purpose. I think their choice of default (Debug means you want it to catch as many bugs as it can, and Release means it should be fast) is appropriate.
If they hadn't provided the knob to adjust this I'd agree they had a problem for some use-cases, but if you want the faster-but-less-thorough mode you just have to ask for it.
As you explained, the debugging mode for STL is well beyond normal debugging mode (i.e. no optimiziations, all debug symbols). It explicitly changes complexity guarantees of the stl to catch issues with iterator invalidation and other nasty bugs. As such it adds a considerable runtime cost to any use of the STL.
In a way it is comparable to the overhead of the sanitizers available in other compilers.
IIRC MS chose to enable it by default in debug mode as part of their effort of fortifying unsafe C and C++ code against security bugs, but of course it has a large penalty.
libstdc++ has a similar debug mode, but it is disabled by default, as, in addition to be more expensive, it also breaks the ABI.
> C++ is relatively fast but MSVC’s STL library’s debug performance is abysmal. It’s been like this since at least MSVC 2008 too. I’m not sure why Microsoft doesn’t think it’s a priority to improve.
If you want fast performance, you use a Release build. The purpose of a Debug build is that you can detect the source of as many bugs in your program as possible. So for Release build, it is a priority to improve performance (and in my opinion Microsoft did a good job). For the Debug build, it is a priority to improve the detectability of bugs in your program. Here also in my opinion Microsoft did a good job.
That sounds reasonable, until you have to debug an issue that only manifests itself when working with a huge dataset, in which case the whole thing becomes pretty much unusable. As the article says, debug mode is expected to be slower, but the difference between MSVC and the other 2 is just crazy.
It's crazy because the knob for "how much debugging do I want to do" (aka iterator debug level) has been left untouched. Do you want the most rigorous checks (the default)? Don't do anything. Just want to avoid "variable optimized out" while debugging? Turn iterator debug level to "none".
> Release performance also slightly increased across the board - this is because for many of our arrays, the default initialization performed by std::vector’s constructor is redundant as we’re going to fill the array anyway. With std::vector, you can either resize a large array and then compute the items (which requires default-initializing every item redundantly), or reserve and push_back repeatedly (which requires a bit more code for adding each item, and this overhead can also add up). With a custom container it’s easy to have an option to skip initialization - in fact, in our replacement that’s the only option since it’s easy to memset the array manually if necessary.
I have a hard time to get the rationale here. Is the author implying that using memset is inherently better than initializing an std::vector with the constructor that takes a count and a default value, or resetting it with the same variant of .assign()? Skimming through the code that's linked in the article, it looks like there's no use case where either the former or the latter wouldn't fit.
I read it as the author almost never uses the uninitialized data and when they do need a zeroed array it's simple to do so why pay the cost to initialize the elements when you're about to set them anyway.
But you can easily avoid that cost on std::vector, too. std::vector doesn't default-initialize unless you ask it to via resize(). reserve() doesn't initialize, though, it just allocates. Take this toy example:
The tradeoff is that emplace_back and push_back need to check the capacity (even though they won't reallocate because of reserve). Whereas with resize, assignments into the initialized area need no checks at all.
There is also the approach to just copy objects existing objects to safe yourself the cost of running the constructor in order to shorten your startup time.
C++ is no different. If you want it to be blazing fast, you need to tell it exactly what you want. What data structure do you want, where do you want it, where should it ask for more memory if needed?
If you don't do that, there's some sane defaults but the compiler doesn't know whether you are going to relocate the structure or how you plan to access it.
Interesting, but not necessarily surprising. It's pretty well known that the STL favors generality and interface consistency over speed.
In any C++ conference talk about performance the speaker's almost guaranteed to mention that their project uses their own optimized library instead of the STL.
The most interesting part of this article to me was introducing me to Jonathan Blow and his Jai language. I've skimmed over some of his videos, and they look quite interesting. The compile time execution of any code supported by the language looks especially interesting.
> dropping C++11 requirement allowed me to make sure anybody can compile the library on any platform, removing std::vector substantially improved performance of unoptimized builds
What platform doesn't have C++11 support? And why is the performance of an unoptimized build interesting? -O2 is basically a standard compiler flag...?
It starts with the comparison of a general-purpose hash table with specifically-tailored one.
Then it throws in a switch from a comparison-based sort to a radix-based one.
Then, the stunning discovery that std::vector value-initialises its elements (fixable with 3 lines of a custom allocator, if you don't want that behaviour).
In the middle, it preaches about portability while at the same time advocating the use of __builtin functions vs standard functions, and type-punning floats to int and back (undefined behaviour in C++).
And all of this for a 572ms -> 320ms performance improvement?
Could you please nudge the criticism a bit towards information and away from theatrical dismissal? It's not wrong to react that way to the article, but when we post here we want to focus on what we can learn.
In the grand tradition of HN i haven't read the article (but in my defense it is blocked by work's firewall). Still, if he claims that unordered_map has issues he is not alone. It is well known that the standard requirements preclude a more efficient implementation.
And the issue with vector value initializing it's elements is also well known and programmers have been asking for a solution (more user friendly than a custom allocator) for a long time.
The whole point of C++ is to make it easy for people to write generic libraries beyond de standard base provided by the stdlib. So why won't you use those ?
Oh, I'm in violent agreement. The single biggest contribution of the STL is its set of concepts more than the actual implementations. If you do not like the specific tradeoffs for a container or algo implementation, you can swap it for a different implementation often without code changes. I.e. you do not need to remove abstraction, you just swap it for a different one.
Still it is fair to critique the specific tradeoff done by the standard library.
It’s the linear chaining effectively required by the standard. I had an application years ago which was 10x as fast after dropping in khash over unordered_map.
IIRC that's because of the iterator and address invalidation requirements that effectively require an node based implementation. But if your code does not rely on these guarantees then, as you suggest, you can swap it with a different implementation with the same interface.
I wonder what the reaction would have been if he/she had written a blog post explaining how moving from list to numpy.array for storing numerical data in Python resulted in massive performance benefits. You gotta admit the situation is kinda similar.
What is unfriendly about custom allocators? Note that in this specific case you don't need to write any allocation function, you just have to re-implement the construct() function with something that does not value-initialises if no construction arguments are provided.
I have written custom containers with malloc/realloc/placement new. It’s a hassle, but once it works, you have a very efficient and memory-safe implementation.
1. "Everybody" knows that unordered sets/maps in C++ must de-facto use chaining due to the requirements imposed by the standard.
2. numpy.array is not part of the Python standard library. It is an extra component you still need to install separately. As someone who deals regularly with novice Python users (often coming from Matlab & the likes), one of the very first points of pain/confusion is the fact that I need to teach/convince them not to use the facilities of the core language to represent "vectors".
If you're rapidly removing and adding, a simple fix for unordered_map is to add your own stack allocator. That way the memory you touch for it's lists is nearly always hot.
The same technique vastly improves map and set too.
Maybe the title of the article is bit unlucky, if it was named "how I made my library faster and more compatible", would that change your opinion?
And yes, it's only a 252 ms improvement, but that's a whopping 44% time saved. If you need to do lots and lots of mesh optimizing, that might be very well worth it.
> advocating the use of __builtin functions vs standard functions
It wouldn't say it "advocates" for the use of builtins. The author demonstrates a problem they had (including math.h made their compile times 2x slower in c++17 mode) and shows their solution, and advocates changing libc++ so that c headers don't pull in c++ stuff.
In any case, the snippet you object to uses the builtins if they are available, or falls back to math.h. It's a hack to avoid math.h when it's slow to compile, but stills seems pretty portable?
I would interpret the article as one with the following abstract:
C++ has a lot of generic programming solutions like std::vector and algorithms, [and we think of these things as being fast even though they are generic because of templates and “zero cost abstractions”] but we show that we still pay for their being generic (because they must support operations which make everything else expensive, or are doing more work than we need done), and so [for situations like this] we prefer the trade-off of implementing specialised functions that may be similar to other functions elsewhere to gain runtime and compile-time performance improvements. We demonstrate this thesis with a case study of a mash simplification algorithm.
I think this is a reasonable topic to write about, and I think the article does an ok job of it. I think it is silly to get hung up on specific cases (e.g. you need to do some magic custom allocator thing to make a c++ thing about as cheap as a c thing that achieves the purposes you have, or that you might want to replace an exact sort with an approximate one (note that in this case, using a full radix sort would have you pay 30ms instead of 10, so 572 -> 340ms which is still a large improvement, and one that will become larger with larger inputs)). I think one should instead treat it as a case study which shows some general ideas like “you often do pay for what you don’t use in c++/stl” or “often one can use a more specialised algorithm with much better performance then a general one so when one is writing performance sensitive code, having generic algorithms to hand may not be useful”
> And all of this for a 572ms -> 320ms performance improvement?
The library is supposed to be part of a graphics pipeline. It will be called many times a second, and it will be part of what the eventual framerate is.
Of course it sounds good. But you could get basically the same improvement while staying fully within C++ idioms. The correct title would be "Is my library fast without performance tuning?".
I thought it was ok. Asking "Is C++ fast?" is a little like asking if internal combustion engines are fast. The actual answer is, "It depends." The content is ok for this site, which also has people who are just starting out with optimization or who might not be that familiar with C++.
"Is [lang] fast?" is a typical clickbait tactic which works by nibbling at a programmer's pride. Here's the thing. Either one is a bit naive, the pride is hurt and the clickbait works, because one isn't seasoned enough to know that optimization is a complex subject within a complex world of tradeoffs or you're so invested in making things fast (possibly so good at it) that you can't resist taking it apart.
And all of this for a 572ms -> 320ms performance improvement?
>One number that we haven’t moved substantially yet is the time it takes to run this code in debug using MSVC. While it’s natural to expect unoptimized builds to be slower than optimized, they have to be fast enough. Sometimes you want to debug your problem on a non-trivial input dataset. Sometimes you want to run the debug build with full checks through your tests to make sure they don’t trigger any bugs that could disappear in release. Sometimes you are trying to debug a different part of the program, but you still need to run the rest of it.
The library, by the author, is https://github.com/zeux/meshoptimizer. I presume it will be used in graphics pipelines. Debug builds working at a speed that allow graphics to be tolerable by humans sounds like a good goal to have.
There was an article on HN, where the author stated that they didn't like to write own algorithms "like in '90".
In this context, the Arseny's article should be titled: "How to be old-school and tailor algorithms & data structures to your needs". I have this haunting feeling that people nowadays are afraid of "reinventing wheel" even though it'd pay off. A recurring argument is that the standard library or boost or something available on github is better than a custom solution. But as we see, there are cases where saying "nope, I'll do better job myself" has sense.
The first version of C++ ever, designed by Bjarne Stroustrup at AT&T as a thin layer on top of C, was called "C with classes". It was implemented as some kind of source code preprocessor that took ".cc" code and output actual ".c" code, probably not even fully parsing the input code.
It is already a somewhat funny first joke saying that the subset of modern C++ that's a nice language is just the C part and the "class" paradigm ("C with classes").
It is another second somewhat funny joke to say that actually the nice part is only an even more restrictive subset, "C with structs". Yes, structs and classes are virtually equivalent in C++, but let's assume they mean structs as in "classes without member functions". Which were already present in C. Which is why it is funny.
Jokes are never funny when explained :/ Source: I'm a somewhat experienced stand-up comedy practicioner, apart from a very experienced software developer.
I wholeheartedly recommend reading Stroustrup's "Design and evolution of C++", gives you so much valuable and interesting background.
"C with classes" was never any sort of "source code preprocessor", except in the sense that every compiler is.
From day 1 it was a full compiler that happened to generate C at the back end. At the time this was novel, and was easier for people to understand as if it were a preprocessor like RATFOR. Nowadays, generating C is an extremely familiar approach to bootstrapping a new language, and nobody pretends your compiler is "just a preprocessor". (Also, nowadays, you might better generate LLVM. But that option is new.)
So, saying "C with Classes was a preprocessor" amounts to simple slander.
I guess it's meant as a reduction of the "C with classes" C++ style: Reduce C++ feature usage even more towards C, but without actually switching to a C compiler.
I don't see why anyone considers compilation time.
Your code will be compiled only once for your customers, yet run millions of times. Clearly runtime is millions of times more important than compilation time.
If compilation is taking too long, start compiling in the cloud.
There isn't anything you can make the processor do in C that you cannot also make it do in C++ (if I am not mistaken!) Given that, C will only appear faster if you are not using C++ to its full potential. This might be because you don't know what you're doing or because the C++ language makes it unreasonably awkward. Most articles I have read like this are the former, unfortunately.
C++ allows you to write code that is equally expressive as C. It is no longer true that every valid pice of C is also valid C+++, but the deviations are purely minor syntactical differences. I know no C construct for which a C++ counterpart does not exist. Please enlighten me if you have an example.
You can't link debug builds with release CRT. Won't link. Here's an example of missing symbols: __imp__invalid_parameter __imp__CrtDbgReport __imp__CrtDbgReportW
You can workaround by undefining DEBUG, _DEBUG, defining NDEBUG, etc., to make headers think it's a release build, but then it won't be a debug build anymore.
If your goal is to be able to step it in the debugger and follow what is going on, then it is definitely still a debug build.
If you want sanitization instead of stepping through the debugger, use that.
The people who are best at debugging avoid debuggers entirely, and use logging statements inserted in release-built code, and little proofs. It is hard to persuade people who don't think that way, but it is the only route to mastery.
I'm very good at debugging and when I have choice I use all the tools I can get. Not only I use debuggers, I extend them to do what I need for particular project, e.g. autoexp.dat/natvis/custom plugins when coding C++ in VC.
At the same time I also use logging a lot. These two things aren't always mutually exclusive.
Only sometimes they are. Some platforms don't have good debuggers. For other stuff it's the opposite, e.g. you can't log from code running on GPU (very hard to accomplish, borderline impossible) but there're specialized GPU debuggers.
I don't think people who use just logging, or just debuggers, are the best.
They are bloating C++ so much that it'll end up as slow as Python or any other high level language with none of the advantages of being a high level language.
Just the fact that we're asking ourselves "is C++ fast" should be frightening, for a language that has always rivalled C closely.
What does this have to do with C++?