
Filling large arrays with zeroes quickly in C++ - ibobev
https://lemire.me/blog/2020/01/20/filling-large-arrays-with-zeroes-quickly-in-c/
======
gpderetta
BeOnRope article linked in the post is the better article in this case:
[https://travisdowns.github.io/blog/2020/01/20/zero.html](https://travisdowns.github.io/blog/2020/01/20/zero.html)

~~~
dang
Currently discussed at
[https://news.ycombinator.com/item?id=22104576](https://news.ycombinator.com/item?id=22104576).

------
wyldfire
...by calling memset(). The referenced article [1] is much clearer and more
detailed. The bit in that article about lowering std::fill to memset but only
with character types is interesting, e.g.

[1]
[https://travisdowns.github.io/blog/2020/01/20/zero.html](https://travisdowns.github.io/blog/2020/01/20/zero.html)

------
numlock86
In my opinion the most interesting part - that does not even get addressed in
the article - is this:

    
    
        std::fill(p, p + n, 0)      : 1.71732 GB/s
        std::fill(p, p + n, '\0')   : 30.3459 GB/s
        memset(p,0,n)               : 30.419 GB/s
    

Ok, memset ist fast, alright. But why the difference between simply setting 0
vs null char/byte? Is this because 0 is actually larger than a single byte?

~~~
EdSchouten
It’s likely because std::fill() is a generic function that can be used to fill
any list with any kind of value (even class instances).

There likely is some template specialization in place, that if it’s called
with a ‘char’ list, that it gets translated into a call into libc’s memset().
That’s why the results for the last two are close to identical.

Alternatively, there is no template specialization going on, but it’s the case
that the compiler detects that this is identical to memset() and converts it
accordingly. LLVM/Clang does this through something called the LibCall
optimization pass, IIRC.

------
adev_
It is a very _bad_ blog post.

Pushing C++ dev to make premature C-style type of optimization is counter-
productive... And in this case it is even wrong microbenchmarking.

A simple switch to "-O3 -march=native" flag switch in the makefile to enable
vectorization and a proper use of '\0' inverse the result make std::fill()
globally faster than memset.

```

page count: 262144, volume:1024 MB

std::fill(p, p + n, '\0') : 25.9191 GB/s

memset(p,0,n) : 25.3283 GB/s

std::fill(p, p + n, '\0') : 25.6167 GB/s

memset(p,0,n) : 25.1646 GB/s

std::fill(p, p + n, '\0') : 25.8355 GB/s

memset(p,0,n) : 25.6276 GB/s

```

It would have been much more productive to describe _why_ std::fill with '0'
is slower than with '\0' ( integer vs char )

~~~
adrianN
Clang produces exactly the same code for both:
[https://godbolt.org/z/C7UTqN](https://godbolt.org/z/C7UTqN) and at -O3 gcc
produces almost the same code for both (not at O2 though)

~~~
adev_
This is also mis-interpretation of micro-benchmark and illustrate pretty well
that benchmarking this kind of thing is difficult.

If you analyze the ASM generated by the benchmark he proposed, you see
something very different of what the godbolt "simple" compilation of micro-
isolated function give you.

\- Most call to "fill" are generally transparently replace by a "memset" call
_directly_ by the compiler in high level of optimization when done in a normal
glibc/GCC linux configuration.

\- It can even being removed entirely sometimes and this is also a problem (
[https://www.usenix.org/system/files/conference/usenixsecurit...](https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-yang.pdf)
)

\- This does not seem to occur on godbolt for ...reasons.
[https://godbolt.org/z/RGEXH8](https://godbolt.org/z/RGEXH8)

\- Enable vectorization + some compiler flags _can_ in some case make the
memset call replaced by a bunch of SIMD code depending of your compiler /
compiler options and/or cpu.

\- The intel compiler is (like usual) much more aggressive and replace
everything by their home-made memset implementation
[https://godbolt.org/z/7RAYGc](https://godbolt.org/z/7RAYGc)

------
gambiting
Programmers "typically" use std::fill?? Where are those programmers? I've been
a professional C++ programmer for a while now and I've never seen it used over
standard memset, even on different platforms with different architectures and
multitude of compilers.

------
daxfohl
Seems like something that could be done in HW? IDK much about HW architecture,
but I'm surprised there's no way a CPU can tell the memory to zero out some
range in a single command.

~~~
kristoffer
Well, on e.g. x86 you have "rep stos" instruction which pretty much amounts to
memset in HW.

The operation needs to be blocking otherwise you could of course do it without
CPU at all (DMA). But you still have to go through the memory controller and
actually write each word of memory.

~~~
DmitryOlshansky
Adding an ability to zero-out arbitrary range directly to dram chip is very
simple addition HW-wise.

In fact all ones is not hard either, there were some experiments to give RAM
ability to perform simple page-level computations with promising results.

~~~
gpderetta
Thing is, you still need to zero out any cache line that might be caching
those lines, which would conflict with the CPU accessing the cache. Might as
well just let the cpu doing the zeroing.

~~~
DmitryOlshansky
Invalidating cache lines is not that hard, it happens all the time. Refilling
them from DRAM takes time.

------
hoseja
I bet it's some standardese about pointer aliasing preventing memset-like
performance.

~~~
gpderetta
The reason is fully explained in BeOnRope article. It is due to a missed
optimization in the standard library. The optimizer still fixes the issue at
O3.

------
beci
I can't imagine any situation, when this could be the bottleneck.

~~~
daxfohl
A cloud host launching a new 30GB VM securely?

Edit: nvm they probably do this out of band. But still, things like this.

