Filling large arrays with zeroes quickly in C++

gpderetta · on Jan 21, 2020

BeOnRope article linked in the post is the better article in this case: https://travisdowns.github.io/blog/2020/01/20/zero.html

dang · on Jan 21, 2020

Currently discussed at https://news.ycombinator.com/item?id=22104576.

wyldfire · on Jan 21, 2020

...by calling memset(). The referenced article [1] is much clearer and more detailed. The bit in that article about lowering std::fill to memset but only with character types is interesting, e.g.

[1] https://travisdowns.github.io/blog/2020/01/20/zero.html

numlock86 · on Jan 21, 2020

In my opinion the most interesting part - that does not even get addressed in the article - is this:

    std::fill(p, p + n, 0)      : 1.71732 GB/s
    std::fill(p, p + n, '\0')   : 30.3459 GB/s
    memset(p,0,n)               : 30.419 GB/s

Ok, memset ist fast, alright. But why the difference between simply setting 0 vs null char/byte? Is this because 0 is actually larger than a single byte?

EdSchouten · on Jan 21, 2020

It’s likely because std::fill() is a generic function that can be used to fill any list with any kind of value (even class instances).

There likely is some template specialization in place, that if it’s called with a ‘char’ list, that it gets translated into a call into libc’s memset(). That’s why the results for the last two are close to identical.

Alternatively, there is no template specialization going on, but it’s the case that the compiler detects that this is identical to memset() and converts it accordingly. LLVM/Clang does this through something called the LibCall optimization pass, IIRC.

wyldfire · on Jan 21, 2020

The linked article goes into more detail. One calls memset and the other doesn't.

simfoo · on Jan 21, 2020

I would actually consider this a bug of the std library implementation. Would love to see a comparison between different implementations (libc++, libstdc++, MSVC)

simfoo · on Jan 21, 2020

In fact all three major compilers manage to use memset in all cases, see here: https://godbolt.org/z/hVJfPE

adev_ · on Jan 21, 2020

It is a very bad blog post.

Pushing C++ dev to make premature C-style type of optimization is counter-productive... And in this case it is even wrong microbenchmarking.

A simple switch to "-O3 -march=native" flag switch in the makefile to enable vectorization and a proper use of '\0' inverse the result make std::fill() globally faster than memset.

```

page count: 262144, volume:1024 MB

std::fill(p, p + n, '\0') : 25.9191 GB/s

memset(p,0,n) : 25.3283 GB/s

std::fill(p, p + n, '\0') : 25.6167 GB/s

memset(p,0,n) : 25.1646 GB/s

std::fill(p, p + n, '\0') : 25.8355 GB/s

memset(p,0,n) : 25.6276 GB/s

```

It would have been much more productive to describe why std::fill with '0' is slower than with '\0' ( integer vs char )

adrianN · on Jan 21, 2020

Clang produces exactly the same code for both: https://godbolt.org/z/C7UTqN and at -O3 gcc produces almost the same code for both (not at O2 though)

adev_ · on Jan 21, 2020

This is also mis-interpretation of micro-benchmark and illustrate pretty well that benchmarking this kind of thing is difficult.

If you analyze the ASM generated by the benchmark he proposed, you see something very different of what the godbolt "simple" compilation of micro-isolated function give you.

- Most call to "fill" are generally transparently replace by a "memset" call directly by the compiler in high level of optimization when done in a normal glibc/GCC linux configuration.

- It can even being removed entirely sometimes and this is also a problem ( https://www.usenix.org/system/files/conference/usenixsecurit... )

- This does not seem to occur on godbolt for ...reasons. https://godbolt.org/z/RGEXH8

- Enable vectorization + some compiler flags can in some case make the memset call replaced by a bunch of SIMD code depending of your compiler / compiler options and/or cpu.

- The intel compiler is (like usual) much more aggressive and replace everything by their home-made memset implementation https://godbolt.org/z/7RAYGc

gpderetta · on Jan 21, 2020

> It is a very bad blog post. [...] > It would have been much more productive to describe why std::fill with '0' is slower than with '\0' ( integer vs char )

Lemire actually links to BeOnRope post that fully explains the reason. He probably felt that repeating the explanation didn't add anything.

dspig · on Jan 21, 2020

I wonder what he did think he was adding?

gpderetta · on Jan 21, 2020

Well, Lemire blog was actually ranked high on HN, while BeOnRope post didn't show up until, I assume, dang manually fixed the ranking. At the very least, Lemire blogging it increased visibility (it certainly put it on my radar).

gambiting · on Jan 21, 2020

Programmers "typically" use std::fill?? Where are those programmers? I've been a professional C++ programmer for a while now and I've never seen it used over standard memset, even on different platforms with different architectures and multitude of compilers.

daxfohl · on Jan 21, 2020

Seems like something that could be done in HW? IDK much about HW architecture, but I'm surprised there's no way a CPU can tell the memory to zero out some range in a single command.

kristoffer · on Jan 21, 2020

Well, on e.g. x86 you have "rep stos" instruction which pretty much amounts to memset in HW.

The operation needs to be blocking otherwise you could of course do it without CPU at all (DMA). But you still have to go through the memory controller and actually write each word of memory.

DmitryOlshansky · on Jan 21, 2020

Adding an ability to zero-out arbitrary range directly to dram chip is very simple addition HW-wise.

In fact all ones is not hard either, there were some experiments to give RAM ability to perform simple page-level computations with promising results.

gpderetta · on Jan 21, 2020

Thing is, you still need to zero out any cache line that might be caching those lines, which would conflict with the CPU accessing the cache. Might as well just let the cpu doing the zeroing.

DmitryOlshansky · on Jan 30, 2020

Invalidating cache lines is not that hard, it happens all the time. Refilling them from DRAM takes time.

wildmusings · on Jan 21, 2020

Do you have a link where I can read more?

I also wonder, for typical workloads, what percent of CPU time is spent zeroing pages.

DmitryOlshansky · on Jan 30, 2020

https://parallel.princeton.edu/papers/micro19-gao.pdf

The amount of time wasted zeroing out memory pages in a typical OS is quite significant, also take into account that such an operation will trash perfectly good cache space for no good reason.

daxfohl · on Jan 21, 2020

Why can't memory modules be smarter? Or wouldn't it make a difference? i.e. is the cpu-to-memory bus just as fast as any action exclusively inside the memory module itself could be?

kristoffer · on Jan 21, 2020

You can only address one row/column at a time in a normal DDR memory. Start by reading this if you want to learn more: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

In the end memory design is limited by the laws of physics.

daxfohl · on Jan 21, 2020

Ah, I never knew that was how it worked. Thanks! Still, it seems like there's a way it could be redesigned such that zeroing large blocks is done completely in the module and is faster, and provide an extra line for it. With the naive (and probably wrong) implementation I'm thinking, you'd get log N zeroing but bigger constant multiplier on lookup, so that would be the wrong compromise in just about every situation. Maybe you could generify it so that you could do various operations beyond just zeroing and that might be interesting? Though I'm sure this is a plenty well explored topic already, and has come up with what we have now.

hoseja · on Jan 21, 2020

I bet it's some standardese about pointer aliasing preventing memset-like performance.

gpderetta · on Jan 21, 2020

The reason is fully explained in BeOnRope article. It is due to a missed optimization in the standard library. The optimizer still fixes the issue at O3.

beci · on Jan 21, 2020

I can't imagine any situation, when this could be the bottleneck.

daxfohl · on Jan 21, 2020

A cloud host launching a new 30GB VM securely?

Edit: nvm they probably do this out of band. But still, things like this.

pvtmert · on Jan 21, 2020

when thousands of megabytes allocated then deallocated for multiple users, this indeed cause impact on your application.

simply running spring-framework tests are enough to demonstrate the impact (of course jvm with related arguments)