Hacker News new | past | comments | ask | show | jobs | submit login
Filling large arrays with zeroes quickly in C++ (lemire.me)
33 points by ibobev on Jan 21, 2020 | hide | past | favorite | 30 comments

BeOnRope article linked in the post is the better article in this case: https://travisdowns.github.io/blog/2020/01/20/zero.html

...by calling memset(). The referenced article [1] is much clearer and more detailed. The bit in that article about lowering std::fill to memset but only with character types is interesting, e.g.

[1] https://travisdowns.github.io/blog/2020/01/20/zero.html

In my opinion the most interesting part - that does not even get addressed in the article - is this:

    std::fill(p, p + n, 0)      : 1.71732 GB/s
    std::fill(p, p + n, '\0')   : 30.3459 GB/s
    memset(p,0,n)               : 30.419 GB/s
Ok, memset ist fast, alright. But why the difference between simply setting 0 vs null char/byte? Is this because 0 is actually larger than a single byte?

It’s likely because std::fill() is a generic function that can be used to fill any list with any kind of value (even class instances).

There likely is some template specialization in place, that if it’s called with a ‘char’ list, that it gets translated into a call into libc’s memset(). That’s why the results for the last two are close to identical.

Alternatively, there is no template specialization going on, but it’s the case that the compiler detects that this is identical to memset() and converts it accordingly. LLVM/Clang does this through something called the LibCall optimization pass, IIRC.

The linked article goes into more detail. One calls memset and the other doesn't.

I would actually consider this a bug of the std library implementation. Would love to see a comparison between different implementations (libc++, libstdc++, MSVC)

In fact all three major compilers manage to use memset in all cases, see here: https://godbolt.org/z/hVJfPE

It is a very bad blog post.

Pushing C++ dev to make premature C-style type of optimization is counter-productive... And in this case it is even wrong microbenchmarking.

A simple switch to "-O3 -march=native" flag switch in the makefile to enable vectorization and a proper use of '\0' inverse the result make std::fill() globally faster than memset.


page count: 262144, volume:1024 MB

std::fill(p, p + n, '\0') : 25.9191 GB/s

memset(p,0,n) : 25.3283 GB/s

std::fill(p, p + n, '\0') : 25.6167 GB/s

memset(p,0,n) : 25.1646 GB/s

std::fill(p, p + n, '\0') : 25.8355 GB/s

memset(p,0,n) : 25.6276 GB/s


It would have been much more productive to describe why std::fill with '0' is slower than with '\0' ( integer vs char )

Clang produces exactly the same code for both: https://godbolt.org/z/C7UTqN and at -O3 gcc produces almost the same code for both (not at O2 though)

This is also mis-interpretation of micro-benchmark and illustrate pretty well that benchmarking this kind of thing is difficult.

If you analyze the ASM generated by the benchmark he proposed, you see something very different of what the godbolt "simple" compilation of micro-isolated function give you.

- Most call to "fill" are generally transparently replace by a "memset" call directly by the compiler in high level of optimization when done in a normal glibc/GCC linux configuration.

- It can even being removed entirely sometimes and this is also a problem ( https://www.usenix.org/system/files/conference/usenixsecurit... )

- This does not seem to occur on godbolt for ...reasons. https://godbolt.org/z/RGEXH8

- Enable vectorization + some compiler flags can in some case make the memset call replaced by a bunch of SIMD code depending of your compiler / compiler options and/or cpu.

- The intel compiler is (like usual) much more aggressive and replace everything by their home-made memset implementation https://godbolt.org/z/7RAYGc

> It is a very bad blog post. [...] > It would have been much more productive to describe why std::fill with '0' is slower than with '\0' ( integer vs char )

Lemire actually links to BeOnRope post that fully explains the reason. He probably felt that repeating the explanation didn't add anything.

I wonder what he did think he was adding?

Well, Lemire blog was actually ranked high on HN, while BeOnRope post didn't show up until, I assume, dang manually fixed the ranking. At the very least, Lemire blogging it increased visibility (it certainly put it on my radar).

Programmers "typically" use std::fill?? Where are those programmers? I've been a professional C++ programmer for a while now and I've never seen it used over standard memset, even on different platforms with different architectures and multitude of compilers.

Seems like something that could be done in HW? IDK much about HW architecture, but I'm surprised there's no way a CPU can tell the memory to zero out some range in a single command.

Well, on e.g. x86 you have "rep stos" instruction which pretty much amounts to memset in HW.

The operation needs to be blocking otherwise you could of course do it without CPU at all (DMA). But you still have to go through the memory controller and actually write each word of memory.

Adding an ability to zero-out arbitrary range directly to dram chip is very simple addition HW-wise.

In fact all ones is not hard either, there were some experiments to give RAM ability to perform simple page-level computations with promising results.

Thing is, you still need to zero out any cache line that might be caching those lines, which would conflict with the CPU accessing the cache. Might as well just let the cpu doing the zeroing.

Invalidating cache lines is not that hard, it happens all the time. Refilling them from DRAM takes time.

Do you have a link where I can read more?

I also wonder, for typical workloads, what percent of CPU time is spent zeroing pages.


The amount of time wasted zeroing out memory pages in a typical OS is quite significant, also take into account that such an operation will trash perfectly good cache space for no good reason.

Why can't memory modules be smarter? Or wouldn't it make a difference? i.e. is the cpu-to-memory bus just as fast as any action exclusively inside the memory module itself could be?

You can only address one row/column at a time in a normal DDR memory. Start by reading this if you want to learn more: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

In the end memory design is limited by the laws of physics.

Ah, I never knew that was how it worked. Thanks! Still, it seems like there's a way it could be redesigned such that zeroing large blocks is done completely in the module and is faster, and provide an extra line for it. With the naive (and probably wrong) implementation I'm thinking, you'd get log N zeroing but bigger constant multiplier on lookup, so that would be the wrong compromise in just about every situation. Maybe you could generify it so that you could do various operations beyond just zeroing and that might be interesting? Though I'm sure this is a plenty well explored topic already, and has come up with what we have now.

I bet it's some standardese about pointer aliasing preventing memset-like performance.

The reason is fully explained in BeOnRope article. It is due to a missed optimization in the standard library. The optimizer still fixes the issue at O3.

I can't imagine any situation, when this could be the bottleneck.

A cloud host launching a new 30GB VM securely?

Edit: nvm they probably do this out of band. But still, things like this.

when thousands of megabytes allocated then deallocated for multiple users, this indeed cause impact on your application.

simply running spring-framework tests are enough to demonstrate the impact (of course jvm with related arguments)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact