A big reason the small object optimization exists in libstdc++ containers is because system malloc() is not fast enough.
We're not talking about another optimization (small object / locality) as his issue was caused by libstdc++ alloc pools which would not need to exist in the first place if system malloc was better. So libstdc++ reinvents end up reinventing the wheel poorly.
As the author mentioned, when he disabled the optimization behavior GLIBCPP_FORCE_NEW he ended up burning more CPU via system (glibc) malloc(). Once he added jemalloc on top of GLIBCPP_FORCE_NEW, this pretty much evened out with previous behavior runtime performance.
The conclusion towards the end of article:
> The right answer to "malloc is slow" is to make it faster.