Why is stack allocation so slow? Fundamentally, it should compile down to a single machine expression (mov stack pointer), correct? Sure, the first time it may cause a page exception to allow the OS to grow your stack, but that should only happen once. What else is happening that I'm missing?
AFAICT, the big difference in the linked code is that the stack allocation code uses the constructor whereas the pool allocator uses a special function called reset(). Is this just a bogus benchmark that is measuring the speed difference between the constructor and reset()?
It won't even modify the stack pointer during the loop.
Here's what happens: Inbetween every loop iteration, for the correct code (stack & heap), the BigObject has its destructor called, then the constructor is called again, zero filling the array.
For the pool, it calls "reset", which does nothing. What should be done is calling the destructor in place, then the constructor in place. Otherwise you're not really doing the same thing as the other allocations.
Object pools are better than heap allocations, but they are never faster than stack allocation (which is virtually free).
TL;DR: The author is cheating, the benchmark is plain wrong.
Note also that the real cost of the pool allocation is up front in the "Pool construction" field. At 368ms in this benchmark, it takes even longer than all of the heap allocations!
With pools, you pay an up-front cost for the faster allocation later. So a better way to arrange the benchmark would be showing off frequent pool creation and deletion of objects.
To be honest, I'd consider the entire benchmark void.
It really depends on how heap allocation is implemented, so it's hard to compare. By using memory pools, you can allocate a huge block of memory at once from the OS (a small initial cost), and lazily initialize the values. Which brings you: 1. Cache locality. 2. A lot less fragmentation.
I'd rank it as:
1. Stack allocation (essentially free)
2. Memory pool (a small hit every once in a while)
3. Heap allocation (significant overhead compared to the other ones)
By explicitly calling the destructor, you could do away with the requirement that classes must implement a "reset()" member function (which, aside from being an extra step for users, does kind of make RAII in their classes somewhat confusing):
void release(OBJ* obj) {
// ...
obj->OBJ::~OBJ(); // OBJ was the template type
// ...
}
acquire() could be improved by using placement new and forwarding the constructor arguments via variadic template:
Finally, you can exchange potential upfront construction cost (and remove the need for a default constructor in user classes) by making an inner dummy struct that matches the size:
Did similar profiling with Borland/Turbo C++ a while back. (yes, I am that young. )
When I needed to alloc() 1-100 millions objects, the alloc, delete() api times did increase exponentially. BC++ profiler tracked that down to free list traversal.
Implemented my own version of pool with C++, and the new/delete time dropped down to constant times. Later one, I implemented a version on Linux (both in Kernel and User space + SMP) on faster CPU (1.5GHz) and get that API time down to sub micro-seconds measured with rdtsc.
It makes a huge difference for any code that need to alloc/delete millions+ objects as fast as possible.
Why is the author testing with optimizations off? If the worry is that the optimizer will delete the code you're trying to time (a valid worry, since it actually does for me) then you should come up with better test code that the optimizer won't remove, not just turn it off. The performance of unoptimized code can be vastly different in all sorts of ways, and it's not usually very informative.
This code is not only non-optimal, but also inneficient.
First, it does a linear search - O(n) time - because it keeps adding null pointers for elements that are removed from each list. This makes literally no sense. Just removing the null pointer insertion would already improve the performance a lot.
Second, the multiset is also a poor choice for the available objects list, given its performance characteristics. A structure like a deque, with O(1) insertion/removal on both ends, would be much more suited for the job.
I'd suggest giving them a read, they give some good insights about simple yet very powerful techniques for improving memory allocation with little code.
TL;DR: the author compares object allocation times of heap, stack, and pool methods, in C++. Heap is slowest, stack is much faster, pool is insanely fast. A pool implementation is shown.
You are almost certainly not the only one, but I can't say I agree with you. In an age where we're bombarded by information and it is a struggle (if even possible) to stay current on technology trends, TL;DR's give a summary allowing you to make a more informed decision on whether an article is worth your time - titles just aren't usually sufficient to determine that.
As I have said, this is a problem with specific TL;DRs not the general concept of them. (It is still a relevant concern, but it isn't sufficient reason to disregard them completely.)
Also, I generally use TL;DRs to decide WHAT to read, I don't use them as my source of information. So I would have disregarded the TL;DRs incorrect conclusion and read the article for myself (which I did).
The problem is that often the TL;DR is shorter than the article, but also incorrectly represents it or is sometimes just plain wrong. Following this leads you to an misinformed decision, not an informed one.
TL;DR isn't a new concept. Academic papers have had an 'abstract' for decades (centuries?). They're of enormous value in that community to determine how much interest the paper is to you.
AFAICT, the big difference in the linked code is that the stack allocation code uses the constructor whereas the pool allocator uses a special function called reset(). Is this just a bogus benchmark that is measuring the speed difference between the constructor and reset()?