...by calling memset(). The referenced article [1] is much clearer and more detailed. The bit in that article about lowering std::fill to memset but only with character types is interesting, e.g.
In my opinion the most interesting part - that does not even get addressed in the article - is this:
std::fill(p, p + n, 0) : 1.71732 GB/s
std::fill(p, p + n, '\0') : 30.3459 GB/s
memset(p,0,n) : 30.419 GB/s
Ok, memset ist fast, alright. But why the difference between simply setting 0 vs null char/byte? Is this because 0 is actually larger than a single byte?
It’s likely because std::fill() is a generic function that can be used to fill any list with any kind of value (even class instances).
There likely is some template specialization in place, that if it’s called with a ‘char’ list, that it gets translated into a call into libc’s memset(). That’s why the results for the last two are close to identical.
Alternatively, there is no template specialization going on, but it’s the case that the compiler detects that this is identical to memset() and converts it accordingly. LLVM/Clang does this through something called the LibCall optimization pass, IIRC.
I would actually consider this a bug of the std library implementation. Would love to see a comparison between different implementations (libc++, libstdc++, MSVC)
Pushing C++ dev to make premature C-style type of optimization is counter-productive... And in this case it is even wrong microbenchmarking.
A simple switch to "-O3 -march=native" flag switch in the makefile to enable vectorization and a proper use of '\0' inverse the result make std::fill() globally faster than memset.
```
page count: 262144, volume:1024 MB
std::fill(p, p + n, '\0') : 25.9191 GB/s
memset(p,0,n) : 25.3283 GB/s
std::fill(p, p + n, '\0') : 25.6167 GB/s
memset(p,0,n) : 25.1646 GB/s
std::fill(p, p + n, '\0') : 25.8355 GB/s
memset(p,0,n) : 25.6276 GB/s
```
It would have been much more productive to describe why std::fill with '0' is slower than with '\0' ( integer vs char )
This is also mis-interpretation of micro-benchmark and illustrate pretty well that benchmarking this kind of thing is difficult.
If you analyze the ASM generated by the benchmark he proposed, you see something very different of what the godbolt "simple" compilation of micro-isolated function give you.
- Most call to "fill" are generally transparently replace by a "memset" call directly by the compiler in high level of optimization when done in a normal glibc/GCC linux configuration.
- Enable vectorization + some compiler flags can in some case make the memset call replaced by a bunch of SIMD code depending of your compiler / compiler options and/or cpu.
- The intel compiler is (like usual) much more aggressive and replace everything by their home-made memset implementation https://godbolt.org/z/7RAYGc
> It is a very bad blog post.
[...]
> It would have been much more productive to describe why std::fill with '0' is slower than with '\0' ( integer vs char )
Lemire actually links to BeOnRope post that fully explains the reason. He probably felt that repeating the explanation didn't add anything.
Well, Lemire blog was actually ranked high on HN, while BeOnRope post didn't show up until, I assume, dang manually fixed the ranking. At the very least, Lemire blogging it increased visibility (it certainly put it on my radar).
Programmers "typically" use std::fill?? Where are those programmers? I've been a professional C++ programmer for a while now and I've never seen it used over standard memset, even on different platforms with different architectures and multitude of compilers.
Seems like something that could be done in HW? IDK much about HW architecture, but I'm surprised there's no way a CPU can tell the memory to zero out some range in a single command.
Well, on e.g. x86 you have "rep stos" instruction which pretty much amounts to memset in HW.
The operation needs to be blocking otherwise you could of course do it without CPU at all (DMA). But you still have to go through the memory controller and actually write each word of memory.
Thing is, you still need to zero out any cache line that might be caching those lines, which would conflict with the CPU accessing the cache. Might as well just let the cpu doing the zeroing.
The amount of time wasted zeroing out memory pages in a typical OS is quite significant, also take into account that such an operation will trash perfectly good cache space for no good reason.
Why can't memory modules be smarter? Or wouldn't it make a difference? i.e. is the cpu-to-memory bus just as fast as any action exclusively inside the memory module itself could be?
Ah, I never knew that was how it worked. Thanks! Still, it seems like there's a way it could be redesigned such that zeroing large blocks is done completely in the module and is faster, and provide an extra line for it. With the naive (and probably wrong) implementation I'm thinking, you'd get log N zeroing but bigger constant multiplier on lookup, so that would be the wrong compromise in just about every situation. Maybe you could generify it so that you could do various operations beyond just zeroing and that might be interesting? Though I'm sure this is a plenty well explored topic already, and has come up with what we have now.
The reason is fully explained in BeOnRope article. It is due to a missed optimization in the standard library. The optimizer still fixes the issue at O3.