When a program needs memory, the operating system maps a physical page of memory (4KB in most cases) into its virtual memory space.
That page might have been in use by a different program only moments before, so the new program could go poking around in it to find interesting stuff. Like SSH keys or passwords or whatever. So the operating system has a responsibility to tidy up pre-loved pages before giving them out.
Once upon a time, filling 4KB with zeroes was a costly operation. It took a bunch of CPU cycles, and worse, would have trashed memory caches closer to the CPU.
So operating systems tend to have queues of discarded pages, background threads to zero them (when nothing more important is happening), and queues of zeroed pages to hand out when a program wants more memory.
This change is DragonFly BSD saying, "Fuck it, we'll do it live". It turns out that with the speed of modern CPUs and the way memory caches work today, they reckon it's faster to just zero the memory synchronously, when the program requests it. No more queues, no more background work, no more weird cache behaviour.
> Once upon a time, filling 4KB with zeroes was a costly operation. It took a bunch of CPU cycles, and worse, would have trashed memory caches closer to the CPU.
Even Intel 486 can do that in 50 microseconds. Without cache. Probably in a few microseconds assuming L1 cache. So I wonder if the assumption has ever been true.
> So I wonder if the assumption has ever been true.
This fear went back to the PDP-11 in Unix and even predates Unix. It was true even when disks were really really slow, as CPUs had small or no cache and lacked multiple functional units, much less hardware multithreading. A big PDP-10 mainframe might still have been .7 VAX MIPS -- the first one I programmed's only "high speed" memory was its DTL registers -- core was literally core memory.
You can get a feel for what a big deal this was in the fact that it felt radical and expensive for Stroustrup to decide that all new objects would be zeroed. Or you can see in the original posting that there was special memory handling support for special devices -- finally eliminated by this patch!
> You can get a feel for what a big deal this was in the fact that it felt radical and expensive for Stroustrup to decide that all new objects would be zeroed. Or you can see in the original posting that there was special memory handling support for special devices -- finally eliminated by this patch!
That's a pretty different issue. Zeroing freshly allocated memory in usermode causes all pages to become committed. Actual zeroing won't take much time, but getting all those pages physically mapped... that's another issue. On memory pressure, that can take arbitrarily long time.
I think that your concern is valid, something that a lot of people aren't aware of, but it probably doesn't really apply in a lot of cases most people deal with. Unless your dealing with entities that take up a page or more, then chances are the rest of the page is being used by something else. If you write to anything in the page, then the entire page has to be committed, and if the page is reused by the allocator then it is also probably already committed. So the chances you can avoid commuting memory for small objects seems likely, and it probably not really worth worrying about.
Now, if you're instead allocating large objects, or allocating a large array at one time, then in theory the allocator would get some fresh memory for that large entity (Assuming no piece of memory exists that can fit it). It should be smart enough to know that the OS will give back zeroed memory and avoid zeroing it a second time, so those pages don't have to be committed right away.
All that said, if such a thing is actually a worry then you should probably skip the default allocator move to an interface like `mmap()` anyway, which can provide pages of memory that you can be sure are untouched and read as zeros.
That's right; my point was that the time to zero memory was considered significant overhead regardless of where it occurred.
> I think that your concern is valid, something that a lot of people aren't aware of, but it probably doesn't really apply in a lot of cases most people deal with. Unless your dealing with entities that take up a page or more, then chances are the rest of the page is being used by something else.
vardump's objection addresses VM pressure, to which you respond. But zeroing objects on allocation is likely to have negative implications on the cache, slightly mitigated only in the case where you immediately initialize nonzero values. RAII can in theory alleviate some of that but I am unaware of any compiler that does this.
> All that said, if such a thing [large objects] is actually a worry then you should probably skip the default allocator move to an interface like `mmap()` anyway, which can provide pages of memory that you can be sure are untouched and read as zeros.
<dillon> [...] I did a ton of testing. all the pre-zeroing stuff was reducing performance instead of improving it
<@dillon> not sure where the crossing point was, but probably somewhere around core-2-duo
I wonder if the acceptance of complexity in page zeroing was more about mapping and zeroing multiple pages… which makes me curious about DragonflyBSD's approach to that, as well as zeroing superpages.
It was a big deal for WinNT performance back in the 486 days. I don't remember the numbers, but page zeroing often chewed up a significant amount of CPU time.
DragonFlyBSD is slowly but steadily becoming more attractive as a Linux/FreeBSD alternative. 4.6 has good support for various newer Intel GPUs, hammer is stable and the SMP and Networking performance is stellar.
I had a 2.8TB backup archive that I have repeatedly tried to dedup with ZFS and the realtime while-you-are-writing architecture of ZFS dedup just kills the performance so badly that I have never actually succeeded going over the full dataset. Past week I made a 4.6-rc2 based DragonFly VM, attached a 3TB disk with RDM, formatted it with HAMMER, started rsync of the backed up data and ran hammer dedup every 30 min - I am now at 1.9TB used after the whole thing was done and a final dedup ran - no noticeable speed drops and I got around 900GB back!
Yeah I did try it on btrfs IIRC - but it's far too complicated (as opposed to hammer dedup where everything is included and it just works) and doesn't cater to my use case too well - I have multiple machines backing files up to a SMB share with lot of potential for duplicate data. I also need to be able to access those backed up files back over SMB. Just making a HAMMER mount point, running dedup on it via cron job at night and exporting it via Samba does everything I need.
Yeah, that's much nicer for your use case than bup.
(I should point out that Windows apparently has similar after-the-fact dedup capabilities on NTFS in Server 2012/R2 and up [1], though I suspect you'll find DFBSD much easier to run with low overhead in a VM.)
- Pre-zeroing a page only takes 80ns on a modern cpu. vm_fault overhead
in general is ~at least 1 microscond.
- Multiple synth and build tests show that active idle-time zeroing of
pages actually reduces performance somewhat and incidental allocations
of already-zerod pages (from page-table tear-downs) do not affect
performance in any meaningful way.
Not surprising. I've seen so much code with comments explaining how expensive copying or zeroing memory is (but not actually measuring it), that that's why "zero-copy" techniques are used. Ending up doing so much work (spending even microseconds) to "save" the cost that turns out to be 100 ns or less.
Fear of zeroing is related to fear of copying. One of the big arguments against message-passing systems is that there's extra copying. But today, copying is usually cheap, especially if the data was just created and is in cache. On the other hand, futzing with the MMU to avoid copying usually results in lots of cache flushes and is a lose unless you're remapping a huge memory area.
This is a total reversal from the situation back when, for example, Mach was designed. Or, for that matter, Linux. It makes message passing microkernels faster than they were back then.
It helps to do copying right. If a user process does something that causes the operating system to copy user-created data, the copy should take place on the same CPU, where the cache is current. These sorts of issues are why message passing and CPU dispatching have to be integrated to get good performance from a microkernel.
I wonder if this has issues with recent research showing reducing the temperature of memory allows reading it after a restart—swap computers to dodge the fault and read away.
As you can see, a write fault on a not-present page in an anonymous VMA calls alloc_zeroed_user_highpage_movable(), which calls __alloc_zeroed_user_highpage(). The x86 architecture provides an architecture-override for this as a macro which calls alloc_page_vma() with the GFP_ZERO flag added. This calls down through alloc_pages_vma() and __alloc_pages_nodemask() which eventually gets a page through get_page_from_freelist() (potentially via __alloc_pages_slowpath()).
get_page_from_freelist() uses prep_new_page() to act on the GFP_ZERO flag ( https://lxr.missinglinkelectronics.com/#linux+v4.6/mm/page_a... ). Unless the (non-default) PAGE_POISONING_ZERO configuration option is set, this clears the page directly with clear_highpage(). The PAGE_POISONING_ZERO option zeroes the page at the time it is freed (so not in the background) - it's intended as a sanitization/hardening option, not a performance one.
A read fault on a not-present page in an anonymous VMA maps in the singleton "Zero PTE" which is a read-only PTE pointing at a page of zeroes. A subsequent write fault on this page will end up in wp_page_copy(), which recognises the Zero PTE and calls into alloc_zeroed_user_highpage_movable() instead of copying the page, as before.
It would be possible for another architecture to provide a version of __alloc_zeroed_user_highpage() which did use background-zeroing, if it was more efficient on that architecture - I haven't checked to see if any do, though.
Windows does more of it, in the sense the heap manager likes to give freed blocks back to the OS, whereas UNIX programs traditionally never gave memory back until program exit.
I had a massive production outage a few years back involving this behaviour on Windows; symptom was 100% CPU usage in-kernel on exactly two threads and no other processes doing anything much. Still recall that as one of the most interesting bugs to track down.
Hours and hours of narrowing down the symptom from a program that would recreate it, plus disassembling the Windows heap manager and tracing through it in the debugger.
Remove the PG_ZERO flag and remove all page-zeroing optimizations,
entirely. Aftering doing a substantial amount of testing, these
optimizations, which existed all the way back to CSRG BSD, no longer
provide any benefit on a modern system.
The joys of linking to the original source. Slashdot does have one interesting advantage over HN in that putting a large explanation, often with multiple links, is expected when submitting something. I probably could have found a blog or article with a fuller explanation, but then you get into accusations of submitting "blog spam". I figured this is HN and someone like jdub would come along and write a great explanation if the topic was worth consideration.
If I understand this change correctly it removes background zeroing of freed pages.
What effect will this have on security? Grsecurity adds a feature which allows the linux kernel to sanitize freed pages[1]. I realize that the idle time freeing isn't the same as the immediate sanitation that grsecurity offers but I'm curious if this change has any effect on security in this manner.
That page might have been in use by a different program only moments before, so the new program could go poking around in it to find interesting stuff. Like SSH keys or passwords or whatever. So the operating system has a responsibility to tidy up pre-loved pages before giving them out.
Once upon a time, filling 4KB with zeroes was a costly operation. It took a bunch of CPU cycles, and worse, would have trashed memory caches closer to the CPU.
So operating systems tend to have queues of discarded pages, background threads to zero them (when nothing more important is happening), and queues of zeroed pages to hand out when a program wants more memory.
This change is DragonFly BSD saying, "Fuck it, we'll do it live". It turns out that with the speed of modern CPUs and the way memory caches work today, they reckon it's faster to just zero the memory synchronously, when the program requests it. No more queues, no more background work, no more weird cache behaviour.