std::fill(p, p + n, 0) : 1.71732 GB/s
std::fill(p, p + n, '\0') : 30.3459 GB/s
memset(p,0,n) : 30.419 GB/s
There likely is some template specialization in place, that if it’s called with a ‘char’ list, that it gets translated into a call into libc’s memset(). That’s why the results for the last two are close to identical.
Alternatively, there is no template specialization going on, but it’s the case that the compiler detects that this is identical to memset() and converts it accordingly. LLVM/Clang does this through something called the LibCall optimization pass, IIRC.
Pushing C++ dev to make premature C-style type of optimization is counter-productive... And in this case it is even wrong microbenchmarking.
A simple switch to "-O3 -march=native" flag switch in the makefile to enable vectorization and a proper use of '\0' inverse the result make std::fill() globally faster than memset.
page count: 262144, volume:1024 MB
std::fill(p, p + n, '\0') : 25.9191 GB/s
memset(p,0,n) : 25.3283 GB/s
std::fill(p, p + n, '\0') : 25.6167 GB/s
memset(p,0,n) : 25.1646 GB/s
std::fill(p, p + n, '\0') : 25.8355 GB/s
memset(p,0,n) : 25.6276 GB/s
It would have been much more productive to describe why std::fill with '0' is slower than with '\0' ( integer vs char )
If you analyze the ASM generated by the benchmark he proposed, you see something very different of what the godbolt "simple" compilation of micro-isolated function give you.
- Most call to "fill" are generally transparently replace by a "memset" call directly by the compiler in high level of optimization when done in a normal glibc/GCC linux configuration.
- It can even being removed entirely sometimes and this is also a problem ( https://www.usenix.org/system/files/conference/usenixsecurit... )
- This does not seem to occur on godbolt for ...reasons. https://godbolt.org/z/RGEXH8
- Enable vectorization + some compiler flags can in some case make the memset call replaced by a bunch of SIMD code depending of your compiler / compiler options and/or cpu.
- The intel compiler is (like usual) much more aggressive and replace everything by their home-made memset implementation https://godbolt.org/z/7RAYGc
Lemire actually links to BeOnRope post that fully explains the reason. He probably felt that repeating the explanation didn't add anything.
The operation needs to be blocking otherwise you could of course do it without CPU at all (DMA). But you still have to go through the memory controller and actually write each word of memory.
In fact all ones is not hard either, there were some experiments to give RAM ability to perform simple page-level computations with promising results.
I also wonder, for typical workloads, what percent of CPU time is spent zeroing pages.
The amount of time wasted zeroing out memory pages in a typical OS is quite significant, also take into account that such an operation will trash perfectly good cache space for no good reason.
In the end memory design is limited by the laws of physics.
Edit: nvm they probably do this out of band. But still, things like this.
simply running spring-framework tests are enough to demonstrate the impact (of course jvm with related arguments)