The winning entry left me in the dust - rather than zeroing the memory in the winning program’s array was initialized with a template of the first few primes. There can be solutions that are even faster than the fastest possible zeroing.
I guess it’s enough that I’m surprised that there’s no hardware support for the memory system to support a way to handle it by itself on command, without taking up bus bandwidth or CPU cycles. Kind of like old-fashioned REP STOS but handled off-chip, as it were.
[Added:] Concerning various instructions for clearing whole cache lines in one go, you still end up with lots of dirty cache lines that have to be sent to L1, L2, ..., RAM (not to mention the stuff that was previously in those cache lines), so there’s still lots of bus bandwidth being consumed.
When you allocate memory that needs to be initialized to zeroes, it doesn't actually allocate memory and zero it. Nor does it allocate memory without mapping the pages, and when a page fault occurs, allocate and zero it.
Here's what actually happens. Upon boot up, the OS allocates a page, zeros it, and marks it read only. When pages are requested in the future that need to be zero, it assigns an address and maps the zero page to it. This zero page could be mapped to hundreds, thousands of pages. If you try to read from the page, it happily tells you it contains zeroes. So you still haven't allocated new physical pages to that address, despite the fact that you're actively using it. Only when you try to write to it does a page fault happen. (because you're trying to write to a read only page) Then it allocates a new page of physical RAM, zeroes it, and gives you a real physical page to use.
It's clever hacks and abstractions all the down.
Most ISAs provide a "zero this cache line" instruction: https://en.wikipedia.org/wiki/Cache_control_instruction#Data...
I remember using `dcbz` (Data Cache Block Set to Zero) on PowerPC to quickly zero-out big chunks of memory.
Not much, since on any modern system (that supports basic memory management in hardware), the OS have a single, special page full of zeroes and configured to trap on write that it maps every time a new page is requested. Only when a write occurs (and is trapped) does the OS substitute an actual page.
It's also why on some of these systems, calloc is faster than malloc + memset : calloc may be special-cased not to zero anything on the userland side if it allocates a brand new set of pages, while the memset will always trigger the CoW (and write useless zeroes as well).
Granted, the OS still needs to write zeroes a lot to implement the actual CoW operation, but it occurs less often than you would think.
Unless you are zeroing a really big chunk of memory, you expect that it might already be in the cache and you almost always want it in the cache after, so pushing it down to the memory is often counter-productive in this sense: you want it close to the core.
Zeroing memory does take up a non-negligible amount of time some some workflows, especially those that make sparse use of the memory (e.g., don't end up writing all the memory in the end, think a large-radix histogram where many counters stay zero, or allocations that are larger than needed and don't end up getting used, etc). That said, zeroing can be pretty fast as this post shows: around 30 bytes per cycle, which on a 4 GHz CPU is 120 GB/s. AVX-512 doubles that to ~60 bytes/cycle.
At first, I thought you might just want to get fresh pages from the kernel (which are always zeroed), but this answer convinced me that might not actually be faster because of the overhead from syscalls and fiddling with virtual memory https://stackoverflow.com/questions/49896578/fastest-way-to-... . And Linux doesn't idle-zero or pre-zero pages (though I believe there's a switch to enable pre-zeroing for purposes of security hardening), so you're probably gonna end up with the OS doing a synchronous zeroing anyway.
 Synchronous from when you actually first write to each page. My understanding is that when your process gets new pages, they're all mapped to a special zero-page and set to copy-on-write. So there is still some efficiency here in theory: you don't have a long wait for the entire range to be zeroed all at once and you never have to zero pages that you don't modify.
For something like a process that starts up, does its job as quickly as possible and then stops, it is less obvious how to do it: you could certainly start another thread to do this, but if you are willing to add threading you are probably better off just trying to parallelize the underlying work, rather than doing zero async.
The main downsides to this approach:
1) It is cache unfriendly: if you zero the pages synchronously immediately prior to use, you bring them into cache during the zeroing and then they are hot when you use them on the same thread. For idle zeroing this link is lost so you are likely to bring them into cache twice: once during the zeroing and once during the subsequent use.
2) On NUMA systems, doing the idling on some thread/CPU distinct from where you will use it is an anti-pattern: with the usual "first touch" allocation, the memory will be allocated on the node doing the zeroing, potentially different from the node doing the using - which could be an arbitrarily large penalty over time (as you keep using the remote memory for an arbitrarily long time)
Note that (1) is a common confounding factor when measuring the cost of zeroing. E.g., if you have something like:
In some cases do_work alone can even be slower than the sum of zeroing and do_work (e.g, 1,500 time units) - because the interleaving of the work with cache misses can be less efficient in an MLP sense (because the work clogs up the OoO buffers so fewer memory requests fit in the window).
Of course you could have short running servers or long running full-CPU processes too.
Long/short probably are the
But anything is possible. Maybe some big customer used this pattern and now Java detects it and does something fancy like what I suggested. But really, this is an unusual scenario for an application.
Efficient zeroing is something JVMs put some thought into. Well, with a notable exception, in HotSpot itself: We were able to substantially improve its performance by using memset() to perform the zeroing.
 (PDF) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.225...
The issue here is a quirk of overloading rules: after template instantiations two overloads of __fill_a are available, one with a deduced 'int' type for the fill value type and another with a 'char' type. Both are perfectly valid instantiations, but the 'int' valued one is preferred by the partial ordering rules as it does not require a conversion.
I think this might easily be solved by making the type of the value to be filled in a separate template parameter even for the pointer variant.
Also, in addition to pointers, the memset optimization should really be applied for all contiguous iterators (for example std::vector::iterator).
Just use -O3.
edit: minor fixes and rewording.
Eh right, that's something I somewhat misunderstood. I just thought the memset overload didn't match at all because the pointer and value-to-fill types are different, but of course it would match, with a conversion: but the other overload wins as you point out.
> I think this might easily be solved by making the type of the value to be filled in a separate template parameter even for the pointer variant.
Yeah, although you'd need a couple variants of that still with different enable_if conditions to filter the safe cases for this optimization ("complex" value-to-fill types can't use this, for some unclear definition of "complex").
> Also, in addition to pointers, the memset optimization should really be applied for all contiguous iterators (for example std::vector::iterator).
This already works, see fill3 at .
I didn't check, but assumed decaying certain iterators to pointers was the function of the `std::__niter_base(iter)` calls in std::fill.
As the article says, you have to do at least some checking on the type of the source value, because it could be a user-defined type with an overloaded implicit conversion operator (operator char()) that does something non-trivial. (I'm not completely sure that my comment contradicts what you've just said, but that's because I don't quite see what you're saying.)
It isn't even correct:
1) I find no evidence wmemset is ever used (e.g., for wchar_t).
2) It doesn't work for "char types", as in the example in this post: the memory is char type but when literal zero is passed, it breaks.
Actually I was thinking about bzero().
Seeing memset() made me smile :-)
from the manpage on linux.
Seeing some copied-and-pasted autoconf setup, for a program that isn't even portable from Linux, valiantly checking to see whether you have a memset() function will make you despair once more.
Or watching autoconf insist that on Debian Linux there is no memset() library function.
Come 1989, this stuff will be standardized, you know. (-:
The q variants (as opposed to b) are a bit of a grey area: Intel has this REP ERMBS thing  that promises fast performance for rep movsb and rep movsb specifically, but not for the w, d or q variants. However, I think all Intel hardware that has implemented it has implemented the d and q variants just as fast. It would be good to verify it though...
TL;DR: benchmark for your specific hardware if it really matters.
Try at -O1.
If it's still fast at -O1, I guess the libc++ implementation is different (AFAIK clang uses libstdc++ on Linux but libc++ on OSX).
OTOH, if it is asking the OS for fresh memory to satisfy the request, or is satisfying the request from an earlier allocation returned by the OS (which hasn't yet been fully used), it knows the memory is zero and can just return it directly. In that scenario calloc() is as fast as malloc(), yet it provides the additional guarantee of zeroed memory.
Unfortunately, calloc() and C++ don't really place nice together, and C++ has no equivalent concept to calloc() in its allocator model (the same complaint applies to realloc() which would otherwise be very useful).
For Linux, I want to grant MAP_UNINITIALIZED access to unprivileged users (keep reading) but actually zero pages it gives to these unprivileged MAP_UNINITIALIZED --- users unless it knows that a page doesn't contain privileged information. For example, if a process munmap()s a particular region and then immediately mmap()s a different one, and no other process has ever mapped that page, there's no harm in allowing the process to see its own previous page content.
Of course, the kernel way has some advantages:
- You are making the memory available to the rest of the system if needed.
- It could give you other types of pages w/o privileged info, e.g., page cache pages from files you have permission to read, or something like that.
Furthermore, this doesn't help you for repeatedly zeroing existing memory, which is probably the most important: are you going to mmap fresh pages every time you want zeros in there rather than just zeroing in userspace the memory you have?
The kernel does have some tricks available --- it can't clean pages any faster than userspace, but it can keep a cache of clean pages, and keep the cache filled by cleaning free pages during idle, including before your program was launched. I'm sure that the kernel was able to do the work before your program was launched will look great in microbenchmarks, if used right, but if you need to do it repeatedly, mmunap/mmap syscall processing time is going to be a lot bigger than rep stosb, unless you have a giant block to zero out and the block is nicely aligned.
Another comment in this general thread suggests that Linux doesn't do idle zeroing, but it's not the only kernel; FreeBSD has idle zeroing .
At boot, the kernel allocates a physical page of RAM, zeroes it, and marks it read only.
Some time later, a user process requests a zeroed page. The kernel finds a new address in virtual memory, and tells the mmu to point that address to the zeroed page.
The user process reads from the page, and gets zeroes back, because that's what's in the zero page. And there's nothing wrong with reading a read only page.
Then the user tries to write into the zero page, and a page fault happens, because the page is read only. The kernel allocates new physical RAM, zeroes it, and updates the mmu mapping.
This doesn't appear to do any less work, but gigabytes of virtual memory can in 4kB of cache, because it's all the same page. So it's way faster. One of my favorite benchmarks of all time is creating an identity matrix by using malloc, memset(0) it, then assign ones to the diagonal, vs using calloc, (which as per its specification zeroes it) and assigning ones to the diagonal. Then comparing the run time of multiplying the two identity matrices by some other matrix. The calloc identity matrix is multiple orders of magnitude faster.
You can of course also implement this trick in userspace, or at least without relying on the zero CoW thig on fault, e.g., by mapping /dev/zero.
Sometimes the zero page thing comes out worse: if you first read from a zero page and then ultimately write to it, for every page, you take 2x the number of faults.