Way back in my 6502 days I entered a competition for the fastest sieve program. My program had self modifying code running on the zero page. To reset the 8K memory the program employed 24k memory for the store instructions.
The winning entry left me in the dust - rather than zeroing the memory in the winning program’s array was initialized with a template of the first few primes. There can be solutions that are even faster than the fastest possible zeroing.
I’ve always wondered how much CPU time and memory bandwidth is taken up by the OS zeroing out pages before handing them out, as well as programs and libraries clearing chunks of memory.
I guess it’s enough that I’m surprised that there’s no hardware support for the memory system to support a way to handle it by itself on command, without taking up bus bandwidth or CPU cycles. Kind of like old-fashioned REP STOS but handled off-chip, as it were.
[Added:] Concerning various instructions for clearing whole cache lines in one go, you still end up with lots of dirty cache lines that have to be sent to L1, L2, ..., RAM (not to mention the stuff that was previously in those cache lines), so there’s still lots of bus bandwidth being consumed.
Zeroing memory is probably already really fast compared to other things the system is doing. Something that happened to me a couple years ago: I was writing a benchmark that needed a lot of memory. I just used a vector<char> for that. Of course, the vector zeroes the memory, so I thought hey, I don't actually need it to be zeroed, so why not just use unique_ptr<char[]>? Well it turns out (and most people here probably know what's coming next) that when you reserve memory, you don't actually get any physical pages. They get reserved as you access them. And since apparently the access pattern was worse (or something to that effect), the benchmark ended up running way slower than it did without explicit zeroing, even if I put the zeroing into the benchmark which it hadn't been previously. It turns out compared to reserving memory, zeroing was essentially inconsequential. So I'm not too optimistic that there are a lot of gains to be had here.
When you allocate memory that needs to be initialized to zeroes, it doesn't actually allocate memory and zero it. Nor does it allocate memory without mapping the pages, and when a page fault occurs, allocate and zero it.
Here's what actually happens. Upon boot up, the OS allocates a page, zeros it, and marks it read only. When pages are requested in the future that need to be zero, it assigns an address and maps the zero page to it. This zero page could be mapped to hundreds, thousands of pages. If you try to read from the page, it happily tells you it contains zeroes. So you still haven't allocated new physical pages to that address, despite the fact that you're actively using it. Only when you try to write to it does a page fault happen. (because you're trying to write to a read only page) Then it allocates a new page of physical RAM, zeroes it, and gives you a real physical page to use.
You're mixing up zeroing here, virtual memory is backed by physical memory and that is what is getting zeroed, in fact Linux recently got zeroing on free or malloc as a security feature.
> I’ve always wondered how much CPU time and memory bandwidth is taken up by the OS zeroing out pages before handing them out
Not much, since on any modern system (that supports basic memory management in hardware), the OS have a single, special page full of zeroes and configured to trap on write that it maps every time a new page is requested. Only when a write occurs (and is trapped) does the OS substitute an actual page.
It's also why on some of these systems, calloc is faster than malloc + memset : calloc may be special-cased not to zero anything on the userland side if it allocates a brand new set of pages, while the memset will always trigger the CoW (and write useless zeroes as well).
Granted, the OS still needs to write zeroes a lot to implement the actual CoW operation, but it occurs less often than you would think.
> I guess it’s enough that I’m surprised that there’s no hardware support for the memory system to support a way to handle it by itself on command, without taking up bus bandwidth or CPU cycles. Kind of like old-fashioned REP STOS but handled off-chip, as it were.
Not as far out as the memory system, but Arm provides 'DC ZVA' (https://developer.arm.com/docs/ddi0595/b/aarch64-system-inst...) which clears a full cache line to zero at once, which is more efficient than a pile of explicit stores. (You'll see from the description that it correctly handles the case of a watchpoint being set on the memory to be cleared, or an attempt to clear non-writeable memory -- getting those cases right would I think be a lot harder to do outside the CPU.)
One issue is that the memory controller is in the cpu itself (what goes on the wire are the low level line control signals). Also for small ranges, the memory actually being zeroed would be in cache anyway and anything touching the cache would interfere with the cpu, so it is hard to do better than rep stos.
Unless you are zeroing a really big chunk of memory, you expect that it might already be in the cache and you almost always want it in the cache after, so pushing it down to the memory is often counter-productive in this sense: you want it close to the core.
Zeroing memory does take up a non-negligible amount of time some some workflows, especially those that make sparse use of the memory (e.g., don't end up writing all the memory in the end, think a large-radix histogram where many counters stay zero, or allocations that are larger than needed and don't end up getting used, etc). That said, zeroing can be pretty fast as this post shows: around 30 bytes per cycle, which on a 4 GHz CPU is 120 GB/s. AVX-512 doubles that to ~60 bytes/cycle.
I'm sure there's some DMA going on there. The OS could simply keep a reserved zeroed out page and tell the DMA hardware to copy it wherever when it needs to zero out a page. I'd be rather surprised if tricks like that don't already happen. I'd be doubly surprised if modern DMA hardware doesn't support fancier tricks, if writing a repeated byte pattern is supported you wouldn't even need to reserve a whole page of zeroes, a single byte would do.
If you're routinely zeroing this much memory and the performance matters, you might benefit from idle-zeroing it. That is, when you need to zero the massive block, just switch to a different block that has already been zeroed or partly-zeroed in the background. Whatever hasn't already been zeroed, finish synchronously. The background thread doing the zeroing would be scheduled with the lowest priority, so that it only runs when the system otherwise has nothing to do.
At first, I thought you might just want to get fresh pages from the kernel (which are always zeroed), but this answer convinced me that might not actually be faster because of the overhead from syscalls and fiddling with virtual memory https://stackoverflow.com/questions/49896578/fastest-way-to-... . And Linux doesn't idle-zero or pre-zero pages (though I believe there's a switch to enable pre-zeroing for purposes of security hardening), so you're probably gonna end up with the OS doing a synchronous[1] zeroing anyway.
[1] Synchronous from when you actually first write to each page. My understanding is that when your process gets new pages, they're all mapped to a special zero-page and set to copy-on-write. So there is still some efficiency here in theory: you don't have a long wait for the entire range to be zeroed all at once and you never have to zero pages that you don't modify.
It's an interesting idea, mostly applicable to long running processes (e.g,. a server) where the concept of "idle time" applies.
For something like a process that starts up, does its job as quickly as possible and then stops, it is less obvious how to do it: you could certainly start another thread to do this, but if you are willing to add threading you are probably better off just trying to parallelize the underlying work, rather than doing zero async.
The main downsides to this approach:
1) It is cache unfriendly: if you zero the pages synchronously immediately prior to use, you bring them into cache during the zeroing and then they are hot when you use them on the same thread. For idle zeroing this link is lost so you are likely to bring them into cache twice: once during the zeroing and once during the subsequent use.
2) On NUMA systems, doing the idling on some thread/CPU distinct from where you will use it is an anti-pattern: with the usual "first touch" allocation, the memory will be allocated on the node doing the zeroing, potentially different from the node doing the using - which could be an arbitrarily large penalty over time (as you keep using the remote memory for an arbitrarily long time)
Note that (1) is a common confounding factor when measuring the cost of zeroing. E.g., if you have something like:
zero(mem)
do_work(mem)
You might measure the zeroing taking 500 (time units) and the work taking 800, total of 1300. Lets say you find a clever way to totally avoid the zeroing (e.g., maybe do_work didn't actually need zeroed memory): you expect only the 800 time units of work to remain - but actually measure 1200: the zeroing more than just zeroing: it was bringing in all the memory into cache and suffering all the cache, TLB misses, etc. When you remove it, that work just moves over to the subsequent work.
In some cases do_work alone can even be slower than the sum of zeroing and do_work (e.g, 1,500 time units) - because the interleaving of the work with cache misses can be less efficient in an MLP sense (because the work clogs up the OoO buffers so fewer memory requests fit in the window).
I don't think it's about long-running vs. short-running. That might be a heuristic, but what really matters is whether there is idle CPU time to be had between when you need large zeroed chunks of memory, or even if not, if you have some other performance characteristic you want (e.g. stable, consistent throughput) that would benefit from parallelizing the zeroing instead of having to periodically block.
Yes, it was just a terse way of putting it: by long running I really mean something that runs more or less indefinitely, usually serving "external" requests, like a server or daemon vs "short running" where I mean something that is triggered to do a specific job: think a compiler or compressor or whatever.
Of course you could have short running servers or long running full-CPU processes too.
I doubt it. Needing to zero ultra large chunks of memory often enough that you care about performance to this extent is a very niche scenario. If I had to guess, Java probably just zeroes the memory upon garbage collecting it or before handing it back out; fresh pages from the OS are already zeroed.
But anything is possible. Maybe some big customer used this pattern and now Java detects it and does something fancy like what I suggested. But really, this is an unusual scenario for an application.
There's a brilliantly-named paper on this: Why Nothing Matters: The Impact of Zeroing.
Efficient zeroing is something JVMs put some thought into. Well, with a notable exception, in HotSpot itself: We were able to substantially improve its performance by using memset() to perform the zeroing.
Because if you're needing to repeatedly free giant amounts of RAM, the OS probably doesn't have many zeroed pages and will need to zero them itself. Linux, at least, does not idle-zero, so if you want to take advantage of otherwise idle CPU time to do the zeroing, then you need to take matters into your own hands.
This goes to show that sometimes it really pays to read the docs: the doc comment of "fill" says "For char types filling contiguous areas of memory, this becomes an inline call to @c memset or @c wmemset"...
arguably is still a standard library bug (or better, missed optimization). What triggers the optimization should be the value type of the range, not the type of the value to be filled in.
The issue here is a quirk of overloading rules: after template instantiations two overloads of __fill_a are available, one with a deduced 'int' type for the fill value type and another with a 'char' type. Both are perfectly valid instantiations, but the 'int' valued one is preferred by the partial ordering rules as it does not require a conversion.
I think this might easily be solved by making the type of the value to be filled in a separate template parameter even for the pointer variant.
Also, in addition to pointers, the memset optimization should really be applied for all contiguous iterators (for example std::vector::iterator).
> The issue here is a quirk of overloading rules: after template instantiations two overloads of __fill_a are available, one with a deduced 'int' type for the fill value type and another with a 'char' type. Both are perfectly valid instantiations, but the 'int' valued one is preferred by the partial ordering rules as it does not require a conversion.
Eh right, that's something I somewhat misunderstood. I just thought the memset overload didn't match at all because the pointer and value-to-fill types are different, but of course it would match, with a conversion: but the other overload wins as you point out.
> I think this might easily be solved by making the type of the value to be filled in a separate template parameter even for the pointer variant.
Yeah, although you'd need a couple variants of that still with different enable_if conditions to filter the safe cases for this optimization ("complex" value-to-fill types can't use this, for some unclear definition of "complex").
> Also, in addition to pointers, the memset optimization should really be applied for all contiguous iterators (for example std::vector::iterator).
This already works, see fill3 at [1].
I didn't check, but assumed decaying certain iterators to pointers was the function of the `std::__niter_base(iter)` calls in std::fill.
> What triggers the optimization should be the value type of the range, not the type of the value to be filled in.
As the article says, you have to do at least some checking on the type of the source value, because it could be a user-defined type with an overloaded implicit conversion operator (operator char()) that does something non-trivial. (I'm not completely sure that my comment contradicts what you've just said, but that's because I don't quite see what you're saying.)
Sure, but the enable_if<is_scalar<> > already takes care of it, I'm not suggesting taking the check away (although it might be possible to relax it to checking whether a type is trivially constructible and copyable). Now that I think of it you probably need to check that both the iterator value type and the value itself are trivially copyable, constructible and that the value is trivially convertible to the iterator value type itself. No wonder that the std library maintainers went for the easier way.
> The bzero() function is deprecated (marked as LEGACY in POSIX.1-2001); use memset(3) in new programs. POSIX.1-2008 removes the specification of bzero().
Seeing some copied-and-pasted autoconf setup, for a program that isn't even portable from Linux, valiantly checking to see whether you have a memset() function will make you despair once more.
Or watching autoconf insist that on Debian Linux there is no memset() library function.
Interestingly the std::array::fill member function is identical in the case of int or char, I suppose because there's only one overload of fill and it has to take the element type. No idea if the generated stosq is as fast as built-in memset: https://godbolt.org/z/4iYGup
Recent glibc seems to use 'rep stosb' for largish regions of memset. At least for the numbers I give in this post, the ~30 bytes/cycle is actually coming from rep stosb inside memset.
The q variants (as opposed to b) are a bit of a grey area: Intel has this REP ERMBS thing [1] that promises fast performance for rep movsb and rep movsb specifically, but not for the w, d or q variants. However, I think all Intel hardware that has implemented it has implemented the d and q variants just as fast. It would be good to verify it though...
IIRC it's very CPU dependent. I remember running into a similar issue with memcpy vs. memmove where the latter was actually faster on some CPUs because it used stosb instead of AVX/SSE and some intel CPUs could make that really go fast.
TL;DR: benchmark for your specific hardware if it really matters.
Agreed, but it is especially insidious here because for integer literals like 0 or 1, etc - it is common to simply assign or pass those directly to other integer types, and 99% of the time, C++ does the right thing. Here it still does the right thing in that the code is correct, but you walk off a performance cliff.
another day, another reason I am glad I only know some C++ idioms. If I had a byte-oriented block of data, it would never occur to me to use std::fill() ... because it would never occur to me use std::fill() for anything at all ! :)
Where do you think those zeros come from? The kernel demand faults that mmap, and on each fault, has to find a page containing zero. Sometimes it has to make one.
Yes, if you know you want zeroed memory, calloc() is best: because a smart userspace allocator will have two paths: if it is able to re-use memory already allocated and freed from userspace, it will explicitly zero it (probably with a memset) since it it is not (guaranteed to be) zero.
OTOH, if it is asking the OS for fresh memory to satisfy the request, or is satisfying the request from an earlier allocation returned by the OS (which hasn't yet been fully used), it knows the memory is zero and can just return it directly. In that scenario calloc() is as fast as malloc(), yet it provides the additional guarantee of zeroed memory.
Unfortunately, calloc() and C++ don't really place nice together, and C++ has no equivalent concept to calloc() in its allocator model (the same complaint applies to realloc() which would otherwise be very useful).
Sure. I'm just saying that the kernel doesn't have some kind of magic capability that lets it zero memory faster than userspace can. It does have a small free-list of already-zero pages, granted, but a really big allocation will deplete that list, and that list needs to be replenished somehow.
For Linux, I want to grant MAP_UNINITIALIZED access to unprivileged users (keep reading) but actually zero pages it gives to these unprivileged MAP_UNINITIALIZED --- users unless it knows that a page doesn't contain privileged information. For example, if a process munmap()s a particular region and then immediately mmap()s a different one, and no other process has ever mapped that page, there's no harm in allowing the process to see its own previous page content.
Heh, I've had similar thoughts about "safe MAP_UNINITIALIZED", although the demand for this feature is probably pretty low since you can do much the same thing in userspace (just don't return that memory in the first place).
Of course, the kernel way has some advantages:
- You are making the memory available to the rest of the system if needed.
- It could give you other types of pages w/o privileged info, e.g., page cache pages from files you have permission to read, or something like that.
The kernel doesn't really have any tricks that userspace doesn't have to zero arrays quickly. In fact, it is somewhat hamstrung when zeroing since use of SIMD instructions in the kernel is discouraged and generally avoided. It usually ends up using `rep stosb` which is nice and fast on modern Intel (up to 32 bytes/cycle for AVX-supporting boxes), so that's not currently a problem, but it was slower in the past.
Furthermore, this doesn't help you for repeatedly zeroing existing memory, which is probably the most important: are you going to mmap fresh pages every time you want zeros in there rather than just zeroing in userspace the memory you have?
> The kernel doesn't really have any tricks that userspace doesn't have to zero arrays quickly.
The kernel does have some tricks available --- it can't clean pages any faster than userspace, but it can keep a cache of clean pages, and keep the cache filled by cleaning free pages during idle, including before your program was launched. I'm sure that the kernel was able to do the work before your program was launched will look great in microbenchmarks, if used right, but if you need to do it repeatedly, mmunap/mmap syscall processing time is going to be a lot bigger than rep stosb, unless you have a giant block to zero out and the block is nicely aligned.
Another comment in this general thread suggests that Linux doesn't do idle zeroing, but it's not the only kernel; FreeBSD has idle zeroing [1].
This isn't actually true. The kernel has two giant tricks up its sleeve when it comes to serving zero pages: the kernel gets to decide what happens when a page fault happens, and it gets to decide what to tell the mmu.
At boot, the kernel allocates a physical page of RAM, zeroes it, and marks it read only.
Some time later, a user process requests a zeroed page. The kernel finds a new address in virtual memory, and tells the mmu to point that address to the zeroed page.
The user process reads from the page, and gets zeroes back, because that's what's in the zero page. And there's nothing wrong with reading a read only page.
Then the user tries to write into the zero page, and a page fault happens, because the page is read only. The kernel allocates new physical RAM, zeroes it, and updates the mmu mapping.
This doesn't appear to do any less work, but gigabytes of virtual memory can in 4kB of cache, because it's all the same page. So it's way faster. One of my favorite benchmarks of all time is creating an identity matrix by using malloc, memset(0) it, then assign ones to the diagonal, vs using calloc, (which as per its specification zeroes it) and assigning ones to the diagonal. Then comparing the run time of multiplying the two identity matrices by some other matrix. The calloc identity matrix is multiple orders of magnitude faster.
Sure, I know how the zero page works - so for "sparse" scenarios like your diagonal matrix examples, it works great.
You can of course also implement this trick in userspace, or at least without relying on the zero CoW thig on fault, e.g., by mapping /dev/zero.
Sometimes the zero page thing comes out worse: if you first read from a zero page and then ultimately write to it, for every page, you take 2x the number of faults.
When I benchmarked that a few years ago, I found the overhead of a syscall made mmap much slower (assuming the C or C++ memory allocator had already received some memory from the OS it could dole out).
My point is that nobody cares about the time to zero out small blocks of memory, but when allocating large blocks you'll typically have to request the memory anyway...in which case don't re-zero it.
The winning entry left me in the dust - rather than zeroing the memory in the winning program’s array was initialized with a template of the first few primes. There can be solutions that are even faster than the fastest possible zeroing.