
A queue of page faults (2014) - vmorgulis
http://zeuxcg.org/2014/12/21/page-fault-queue/
======
MichaelGG
On page zeroing, why does the kernel and CPU need to do this like any regular
write? Shouldn't there be some fast-path instruction+RAM support to "zero this
range"? The kernel can guarantee the page/lines are not in use elsewhere,
right? So such an instruction can just wipe everything out. Or is just writing
0s detected and optimized already?

Same question for memcpy and stuff like that: shouldn't that be a CPU
instruction instead of special code trying to detect which CPU and emit
optimized instructions?

Is my model of these costs that far off?

~~~
amluto
Unless you have a system some kind of per-page tag in memory that indicates
"should be zero", the CPU still has to write all the zero to RAM. This is
limited by memory bandwidth.

On newer Intel CPUs, REP STOSB is highly optimized, but you still have to do
the writes.

~~~
Someone
Because DRAM is (relatively) so slow, avoiding the _reads_ is a big gain. If
you zero a large range of memory the naive way, the first write to each cache
line _reads_ a cache line of data that likely will be overwritten by zeroes.

Problem is: how does the CPU know it doesn't have to do that read? REP STOSB
may, but it isn't trivial to implement, as neither start nor end of the range
to be processed need to lie on a cache line boundary.

PowerPC has the dcbz and dcbzl instructions which zero a cache line without
reading the to-be-zeroed data (dcbzl was invented because real-world code
assumed dcbz always zeroed 32 bytes. See
[http://lists.apple.com/archives/darwin-
drivers/2005/Apr/msg0...](http://lists.apple.com/archives/darwin-
drivers/2005/Apr/msg00142.html))

~~~
neopallium
The processor will try to do write-combining to fill entire cache lines, so it
doesn't need to do a memory read. GCC already has a number of intrinsics to
help with streaming data directly to RAM (helps to keep from evicting more
important data from the cache).

The cache can be bypassed when doing a lot of writes. See section "Bypassing
the cache" of [0].

That article says that memset() already uses cache bypassing instructions for
large blocks.

0\. [https://lwn.net/Articles/255364/](https://lwn.net/Articles/255364/)

------
twoodfin
I'm surprised this article doesn't mention one typical solution to the problem
of massive numbers of expensive page faults for large allocations: Pages
bigger than 4K, which are supported on most modern operating systems. On Linux
they go by the name "huge pages":

[https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt](https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt)

