
The Hunt for the Fastest Zero - nikbackm
https://travisdowns.github.io/blog/2020/01/20/zero.html
======
heisenbit
Way back in my 6502 days I entered a competition for the fastest sieve
program. My program had self modifying code running on the zero page. To reset
the 8K memory the program employed 24k memory for the store instructions.

The winning entry left me in the dust - rather than zeroing the memory in the
winning program’s array was initialized with a template of the first few
primes. There can be solutions that are even faster than the fastest possible
zeroing.

------
drfuchs
I’ve always wondered how much CPU time and memory bandwidth is taken up by the
OS zeroing out pages before handing them out, as well as programs and
libraries clearing chunks of memory.

I guess it’s enough that I’m surprised that there’s no hardware support for
the memory system to support a way to handle it by itself on command, without
taking up bus bandwidth or CPU cycles. Kind of like old-fashioned REP STOS but
handled off-chip, as it were.

[Added:] Concerning various instructions for clearing whole cache lines in one
go, you still end up with lots of dirty cache lines that have to be sent to
L1, L2, ..., RAM (not to mention the stuff that was previously in those cache
lines), so there’s still lots of bus bandwidth being consumed.

~~~
pingyong
Zeroing memory is probably already really fast compared to other things the
system is doing. Something that happened to me a couple years ago: I was
writing a benchmark that needed a lot of memory. I just used a vector<char>
for that. Of course, the vector zeroes the memory, so I thought hey, I don't
actually need it to be zeroed, so why not just use unique_ptr<char[]>? Well it
turns out (and most people here probably know what's coming next) that when
you reserve memory, you don't actually get any physical pages. They get
reserved as you access them. And since apparently the access pattern was worse
(or something to that effect), the benchmark ended up running way slower than
it did without explicit zeroing, _even if I put the zeroing into the
benchmark_ which it hadn't been previously. It turns out compared to reserving
memory, zeroing was essentially inconsequential. So I'm not too optimistic
that there are a lot of gains to be had here.

~~~
nwallin
That's not actually what's happening.

When you allocate memory that needs to be initialized to zeroes, it doesn't
actually allocate memory and zero it. Nor does it allocate memory without
mapping the pages, and when a page fault occurs, allocate and zero it.

Here's what actually happens. Upon boot up, the OS allocates a page, zeros it,
and marks it read only. When pages are requested in the future that need to be
zero, it assigns an address and maps the zero page to it. This zero page could
be mapped to hundreds, thousands of pages. If you try to read from the page,
it happily tells you it contains zeroes. So you still haven't allocated new
physical pages to that address, despite the fact that you're actively using
it. Only when you try to write to it does a page fault happen. (because you're
trying to write to a read only page) Then it allocates a new page of physical
RAM, zeroes it, and gives you a real physical page to use.

It's clever hacks and abstractions all the down.

~~~
nwmcsween
You're mixing up zeroing here, virtual memory is backed by physical memory and
that is what is getting zeroed, in fact Linux recently got zeroing on free or
malloc as a security feature.

------
marcoperaza
If you're routinely zeroing this much memory and the performance matters, you
might benefit from idle-zeroing it. That is, when you need to zero the massive
block, just switch to a different block that has already been zeroed or
partly-zeroed in the background. Whatever hasn't already been zeroed, finish
synchronously. The background thread doing the zeroing would be scheduled with
the lowest priority, so that it only runs when the system otherwise has
nothing to do.

At first, I thought you might just want to get fresh pages from the kernel
(which are always zeroed), but this answer convinced me that might not
actually be faster because of the overhead from syscalls and fiddling with
virtual memory [https://stackoverflow.com/questions/49896578/fastest-way-
to-...](https://stackoverflow.com/questions/49896578/fastest-way-to-zero-
pages-in-linux) . And Linux doesn't idle-zero or pre-zero pages (though I
believe there's a switch to enable pre-zeroing for purposes of security
hardening), so you're probably gonna end up with the OS doing a synchronous[1]
zeroing anyway.

[1] Synchronous from when you actually first write to each page. My
understanding is that when your process gets new pages, they're all mapped to
a special zero-page and set to copy-on-write. So there is still some
efficiency here in theory: you don't have a long wait for the entire range to
be zeroed all at once and you never have to zero pages that you don't modify.

~~~
fyp
Is that how languages that always give you zeroed arrays (e.g., java) do it?

~~~
marcoperaza
I doubt it. Needing to zero ultra large chunks of memory often enough that you
care about performance to this extent is a very niche scenario. If I had to
guess, Java probably just zeroes the memory upon garbage collecting it or
before handing it back out; fresh pages from the OS are already zeroed.

But anything is possible. Maybe some big customer used this pattern and now
Java detects it and does something fancy like what I suggested. But really,
this is an unusual scenario for an application.

------
rob74
This goes to show that sometimes it really pays to read the docs: the doc
comment of "fill" says "For char types filling contiguous areas of memory,
this becomes an inline call to @c memset or @c wmemset"...

~~~
gpderetta
arguably is still a standard library bug (or better, missed optimization).
What triggers the optimization should be the value type of the range, not the
type of the value to be filled in.

The issue here is a quirk of overloading rules: after template instantiations
two overloads of __fill_a are available, one with a deduced 'int' type for the
fill value type and another with a 'char' type. Both are perfectly valid
instantiations, but the 'int' valued one is preferred by the partial ordering
rules as it does not require a conversion.

I think this might easily be solved by making the type of the value to be
filled in a separate template parameter even for the pointer variant.

Also, in addition to pointers, the memset optimization should really be
applied for all contiguous iterators (for example std::vector::iterator).

Just use -O3.

edit: minor fixes and rewording.

~~~
quietbritishjim
> What triggers the optimization should be the value type of the range, not
> the type of the value to be filled in.

As the article says, you have to do at least some checking on the type of the
source value, because it could be a user-defined type with an overloaded
implicit conversion operator (operator char()) that does something non-
trivial. (I'm not completely sure that my comment contradicts what you've just
said, but that's because I don't quite see what you're saying.)

~~~
gpderetta
Sure, but the enable_if<is_scalar<> > already takes care of it, I'm not
suggesting taking the check away (although it might be possible to relax it to
checking whether a type is trivially constructible and copyable). Now that I
think of it you probably need to check that both the iterator value type and
the value itself are trivially copyable, constructible and that the value is
trivially convertible to the iterator value type itself. No wonder that the
std library maintainers went for the easier way.

------
wruza

      jmp memset
    

Classic plot twist.

------
eb0la
> If this were C, we would probably reach for memset

Actually I was thinking about bzero().

Seeing memset() made me smile :-)

~~~
dgellow
Would you mind you explaining what you mean by that? I don't have much C
experience, and don't understand what makes you smile.

~~~
eb0la
It brings me memories from my old Turbo C 2.0 years
([https://en.wikipedia.org/wiki/Borland_Turbo_C](https://en.wikipedia.org/wiki/Borland_Turbo_C))
;-)

------
BeeOnRope
Author here, happy for any feedback or questions.

~~~
mkbosmans
Your links in the article to the HN discussion point to a different
submission, not this one.

~~~
BeeOnRope
Thanks, it got submitted twice it seems: the links are to the original
submission. I'll update them to this one.

------
thestoicattack
Interestingly the std::array::fill member function is identical in the case of
int or char, I suppose because there's only one overload of fill and it has to
take the element type. No idea if the generated stosq is as fast as built-in
memset: [https://godbolt.org/z/4iYGup](https://godbolt.org/z/4iYGup)

~~~
BeeOnRope
Recent glibc seems to use 'rep stosb' for largish regions of memset. At least
for the numbers I give in this post, the ~30 bytes/cycle is actually coming
from rep stosb inside memset.

The q variants (as opposed to b) are a bit of a grey area: Intel has this REP
ERMBS thing [1] that promises fast performance for rep movsb and rep movsb
specifically, but not for the w, d or q variants. However, I think all Intel
hardware that has implemented it has implemented the d and q variants just as
fast. It would be good to verify it though...

\---

[1]
[https://stackoverflow.com/q/43343231](https://stackoverflow.com/q/43343231)

------
sradman
Daniel Lemire compares std::fill in C++ with memset in C in agreement with
Travis Down [https://lemire.me/blog/2020/01/20/filling-large-arrays-
with-...](https://lemire.me/blog/2020/01/20/filling-large-arrays-with-zeroes-
quickly-in-c/)

------
PeterHacker123
Thumbs up, but at least on macOS both is equally fast :-(

~~~
BeeOnRope
Probably you are using clang, right? As mentioned, clang does the _idiom
recognition_ even at -O2.

Try at -O1.

If it's still fast at -O1, I guess the libc++ implementation is different
(AFAIK clang uses libstdc++ on Linux but libc++ on OSX).

------
pharrington
Types are meaningful. At least to me, the discovery would have been to be
mindful of your types when invoking idioms you've learned.

~~~
BeeOnRope
Agreed, but it is especially insidious here because for integer literals like
0 or 1, etc - it is common to simply assign or pass those directly to other
integer types, and 99% of the time, C++ does the right thing. Here it still
does the right thing in that the code is correct, but you walk off a
performance cliff.

------
PaulDavisThe1st
another day, another reason I am glad I only know _some_ C++ idioms. If I had
a byte-oriented block of data, it would never occur to me to use std::fill()
... because it would never occur to me use std::fill() for anything at all !
:)

------
gumby
As an anonymous `mmap()` returns zeroed pages it's likely the fastest
mechanism for large arrays.

~~~
BeeOnRope
The kernel doesn't really have any tricks that userspace doesn't have to zero
arrays quickly. In fact, it is somewhat hamstrung when zeroing since use of
SIMD instructions in the kernel is discouraged and generally avoided. It
usually ends up using `rep stosb` which is nice and fast on modern Intel (up
to 32 bytes/cycle for AVX-supporting boxes), so that's not currently a
problem, but it was slower in the past.

Furthermore, this doesn't help you for repeatedly zeroing existing memory,
which is probably the most important: are you going to mmap fresh pages every
time you want zeros in there rather than just zeroing in userspace the memory
you have?

~~~
nwallin
This isn't actually true. The kernel has two giant tricks up its sleeve when
it comes to serving zero pages: the kernel gets to decide what happens when a
page fault happens, and it gets to decide what to tell the mmu.

At boot, the kernel allocates a physical page of RAM, zeroes it, and marks it
read only.

Some time later, a user process requests a zeroed page. The kernel finds a new
address in virtual memory, and tells the mmu to point that address to the
zeroed page.

The user process reads from the page, and gets zeroes back, because that's
what's in the zero page. And there's nothing wrong with reading a read only
page.

Then the user tries to write into the zero page, and a page fault happens,
because the page is read only. The kernel allocates new physical RAM, zeroes
it, and updates the mmu mapping.

This doesn't appear to do any less work, but gigabytes of virtual memory can
in 4kB of cache, because it's all the same page. So it's way faster. One of my
favorite benchmarks of all time is creating an identity matrix by using
malloc, memset(0) it, then assign ones to the diagonal, vs using calloc,
(which as per its specification zeroes it) and assigning ones to the diagonal.
Then comparing the run time of multiplying the two identity matrices by some
other matrix. The calloc identity matrix is multiple orders of magnitude
faster.

~~~
BeeOnRope
Sure, I know how the zero page works - so for "sparse" scenarios like your
diagonal matrix examples, it works great.

You can of course also implement this trick in userspace, or at least without
relying on the zero CoW thig on fault, e.g., by mapping /dev/zero.

Sometimes the zero page thing comes out worse: if you first read from a zero
page and then ultimately write to it, for every page, you take 2x the number
of faults.

------
underdeserver
Another leaky abstraction.

------
alecco
Discussion on cpp subreddit
[https://www.reddit.com/r/cpp/comments/erialk/the_hunt_for_th...](https://www.reddit.com/r/cpp/comments/erialk/the_hunt_for_the_fastest_zero/)

