

Hidden Costs of Memory Allocation - ghusbands
http://randomascii.wordpress.com/2014/12/10/hidden-costs-of-memory-allocation/

======
gargantian
I'm actually surprised virtualized page-zeroing isn't a memory controller
feature at this point. With all of the things modern CPUs do or you, it seems
crazy that everyone is running CPU threads that waste time and voltage to pump
a zero page queue.

~~~
stephencanon
It's a _little_ bit silly, but zeroing a page is an extremely cheap operation
(far cheaper than just about anything you are reasonably going do with the
page once it's zeroed -- on the order of a hundred cycles is pretty typical
these days). That said, yes, it is a cost, and it's not crazy to want to
address it with HW.

FWIW, both powerpc and arm64 have a "just zero the damn cacheline"
instruction. That's not quite what you want, but it is quite useful.

~~~
brucedawson
A hundred cycles to zero a page? If 4 KB can be written in a hundred cycles
then, assuming a 3 GHz processor, that's 30 million pages per second or ~120
GB/s. That's pretty fast memory. On x86/x64 processors, which lack a zero-the-
cacheline instruction, the memory will also end up being read, so you need 240
GB/s to clear pages that quickly. This ignores the cost of TLB misses, which
require further memory accesses.

~~~
stephencanon
The zeros don't need to get pushed to memory immediately. They go to cache,
where they will typically be overwritten with your real data long before they
are pushed out to memory. That push of your real data would have needed to
happen anyway, so there (usually) minimal extra cost associated with the
zeroing.

There are, of course, pathological cases where you touch one byte on a new
page, and then don't write anything else before the whole thing gets pushed
out, but they are relatively rare in performance-critical contexts.

------
tehwalrus
I know that in C it is relatively easy to define your own memory management
functions (I have written my own, complete with very primitive garbage
collector, in an attempt to track down memory leaks before I had heard about
Valgrind).

I believe Python essentially does this under the hood by allocating buffers
and managing its own memory requests on them (allocating more when it needs
more, obviously).

I wonder if it is possible to do this kind of thing in C++? It occurs to me
that it might be hard to write platform-independent code which mimics the
behaviour of the `new` keyword on a preallocated buffer. A quick Google
reveals some syntax[1] that I'd not seen before which allows one to do this,
although it looks like your call would look something like:

    
    
        MyType* instance = new (my_malloc(sizeof(MyType))) MyType();
    

which is hardly concise. Perhaps a macro called my_new would allow you to
text-transform your way there without as many parens or repeats of MyType.

In any case, asking for all your RAM at the beginning, and then re-blanking
and re-using it, is certainly much faster for certain problem types (3D games
with physics emulation, some scientific problems.)

[1] [http://stackoverflow.com/questions/8301043/create-objects-
in...](http://stackoverflow.com/questions/8301043/create-objects-in-pre-
allocated-memory)

 _disclaimer:_ This post is about reinventing too many wheels, and should be
taken with a pinch of salt! Moving malloc calls out of loops (and replacing
them with hand-written zeroing functions) is probably the most useful tip in
practise.

~~~
pjc50
In C++ you're supposed to do this with std::allocator and alternatives. This
allows you to change the allocator used by vector<>, map<>, etc. Or you use
the placement new syntax you found with operator overloading (yes, you can
overload "new MyType()")

You're really _not_ supposed to do it with macros.

~~~
scott_s
To be clear, there are four options you mentioned:

1\. Changing the allocator used by the standard library by passing in a
special allocator to instantiations of std::vector, std::map, etc.

2\. Allocating raw memory elsewhere, and instantiating an object with that
memory using the placement syntax for new.

3\. Overriding operator new on just a particular class. This does not change
the global new; it just means that objects of this class will be allocated
with this version of new.

4\. Overriding the global definition of new. All objects whose classes do not
define their own operator new will be allocated with your own, global new.

More details here:
[http://en.cppreference.com/w/cpp/memory/new/operator_new](http://en.cppreference.com/w/cpp/memory/new/operator_new)

------
bluetomcat
Does Windows really keep a pool of zero pages? With COW, a single zero page is
sufficient -- any copying would be done when a write is attempted over the
page.

~~~
ars
COW isn't free. You still have to spend the time doing the copy.

Better to zero memory ahead of time.

~~~
fulafel
Windows-style ahead of time zeroing thread will typically take the cache
misses and memory bandwidth for that page twice. (Assuming your code
subsequently puts its own data on the pages, instead of just ordering up
zeroed pages to sit on)

~~~
kevingadd
Are you certain it's not bypassing cache on the writes? It's been relatively
straightforward to do this on x86 for ages now, especially if you're using SSE
to fill many bytes at once. I would be shocked if the page zeroing thread did
cached reads or writes.

~~~
brucedawson
The page-zeroing code may be able to minimize its effect on the cache, but it
will then necessarily consume memory bandwidth -- 4 KB of bandwidth per page
zeroed as it writes each page. So, it still affects overall performance.

And it guarantees that when a process goes to use the pages they will not be
in the cache. Ah, tradeoffs.

~~~
ars
It does it when the system is otherwise idle.

> And it guarantees that when a process goes to use the pages they will not be
> in the cache.

There isn't much point in caching a page full of zeros. Let the cache fill
when there is actual data.

