
How fast can you allocate a large block of memory in C++? - ibobev
https://lemire.me/blog/2020/01/14/how-fast-can-you-allocate-a-large-block-of-memory-in-c/
======
paulsutter
Any C++ programmer could write code that “allocates” 500MB in a few
instructions if the pages already exist.

If the pages don’t already exist (as indicated by the given timings), this is
a test of the OS and has little to do with the language.

It’s a poorly posed question. C++ runs on many environments.

------
ncmncm
If you need to allocate lots of memory, you should be using mmap(). If you
need it faulted in, you use MAP_POPULATE. Try to use MAP_HUGETLB to economize
on TLB cache entries; you may need to set "vm.nr_hugepages=16384" (or
something) in /etc/sysctl.conf (or someplace) to reserve them.

If you are allocating lots of memory and don't know about hugepages, you badly
need to learn.

If your program has threads, unmapping memory is generally to be avoided.

~~~
usefulcat
> If your program has threads, unmapping memory is generally to be avoided.

Could you expand on this? I recently encountered a situation where allocating
and freeing many large allocations using mmap() seemed to eventually cause
problems with thread creation, but I had assumed that it was probably because
the virtual address space had become too fragmented, which of course would not
solely be a result of unmapping. Or maybe that fragmentation is what you're
referring to, and I'm just reading too much into that sentence.

~~~
ncmncm
It just creates long stalls as the TLBs of all cores that threads in the
process run on are flushed and must be reloaded from backing RAM.

------
Const-me
The OP should have added “Linux” somewhere in the title.

Here on Windows, the OS doesn’t overcommit and the first new[] call actually
allocates these pages (possibly in a page file).

Also, if the requested size is large enough (>512kb on Win32, >1MB on Win64),
the OS guarantees zero initialization despite C++ doesn’t:
[https://docs.microsoft.com/en-
us/windows/win32/api/memoryapi...](https://docs.microsoft.com/en-
us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc)

~~~
wbkang
Even on Windows the pagefile space is taken but it doesn't have to write to
disk right? They can zero on first read.

------
ggggtez
Correct me if I'm wrong, but won't the speed depends on the compiler, the
operating system, the hardware...

There is no particular reason you can't allocate 500GB in 1 cycle. Just get
rid of all memory management.

It seems silly to answer a question about C++ in operations per _second_ ,
besides.

------
saagarjha
> If you actually want to measure the memory allocation in C++, then you need
> to ask the system to give you s bytes of allocated and initialized memory.
> You can achieve the desired result in C++ by adding parentheses after the
> call to the new operator

Is this an actual thing? You can force reification of overcommitted memory
just by calling operator() on it?

~~~
inetknght
The example given:

    
    
        char *buf = new char[s]();
    

Put into Compiler Explorer:

    
    
        https://gcc.godbolt.org/z/QAs9gz
    

However, doing this is a very strong code smell. At the very least, using
`new` and assigning to a raw pointer is a sign that the C++ developer is
managing memory manually and is likely to hit a lot of problems including
memory leaks or segmentation faults. Also many would forget that this is
calling `operator new[]()`, not `operator new()` and might confuse with
placement-new `operator new(...)` or `operator new[](...)`. And the developer
might also forget that `new` could throw an exception... [0]

The developer should instead be using, at a minimum, a
`std::unique_ptr<char[]>` [1]. Or, IMO, a `std::vector<char>` which reminds
the developer not only of the pointer but also of the count of bytes which
have been allocated (.capacity()) and also of the valid range which has been
initialized (.size()).

IMO, if the developer wanted a pointer to a byte array then it's a lot easier
to use `malloc()` than trying to remember all the different ways you can get
screwed by `operator new`:

    
    
        std::unique_ptr<char, std::function<void(char*)>> m{
            (char*)std::malloc(count), std::free
        };
    
    

[0]:
[https://en.cppreference.com/w/cpp/language/new](https://en.cppreference.com/w/cpp/language/new)

[1]:
[https://en.cppreference.com/w/cpp/memory/unique_ptr](https://en.cppreference.com/w/cpp/memory/unique_ptr)

~~~
dmitrygr
> using `new` and assigning to a raw pointer is a sign that the C++ developer
> is managing memory manually and is likely to hit a lot of problems including
> memory leaks

There is no need to badmouth all c++ developers. Plenty of us can keep our
memory straight just fine. just because you don't like pointers, does not mean
the rest of us can't use them perfectly safely.

~~~
asdfasgasdgasdg
I think the evidence is pretty clear that programmers can't, in general, use
bare pointers properly. The very best write use after free, double free, stack
smashers, and all manner of other memory related bugs. We can see the evidence
of this fact in the CVEs, the syzkaller bugs, etc. If you haven't been bitten
by a serious instance of one of these, then you just haven't written very much
C++.

IMO the debate about whether programmers can safely handle bare pointers is
over. They can't. The only question is whether smart pointers help enough to
make the extra line noise worth it.

~~~
choeger
And the guys that implement smart pointers are what, exactly? Uberprogrammers?
Is there a certificate to get into that club? What about compiler authors,
that deal with, gasp, optimizations of these bare pointers? Kernel developers?
Embedded system engineers?

It is ok to stay above a certain level of abstractions consciously, but
stating that one _cannot_ go below safely is just steadying mediocrity.

~~~
lmm
> And the guys that implement smart pointers are what, exactly?
> Uberprogrammers? Is there a certificate to get into that club? What about
> compiler authors, that deal with, gasp, optimizations of these bare
> pointers? Kernel developers? Embedded system engineers?

One cannot do any of those things safely by hand. Compilers, kernels, and
embedded systems do indeed do these things; they also have bugs.

~~~
choeger
You are contradicting yourself. If one cannot do these things safely by hand,
then noone can implement the automatic abstractions either.

~~~
rcxdude
You cannot safely control an internal combustion engine's valve timings by
hand, but it is clearly possible to implement this automatically (and fairly
simply). Some things are much easier for machines to do than humans, even
though the machines are designed and build by humans.

------
adrianmonk
Very minor nitpick, but IMHO, this article creates a little unnecessary
confusion by insisting upon a particular meaning of the word "allocate"
without being very clear it's doing that.

That is, it uses "allocate" to mean "make ready for immediate use with no
further (lazy) processing". Another perfectly reasonable definition would be
"guarantee to be possible to use".

The distinction would arise on, for example, a system which doesn't over-
commit memory but also doesn't fault pages in upon allocation. On such a
system, allocation might give you a rock solid guarantee that you can write to
(and read from) that memory but it wouldn't give you any guarantee about how
fast or slow that would happen on initial access.

Personally, I prefer the second definition, but that's not really the point.
The point is to be clear and avoid confusion.

------
fargle
This is a completely ridiculous article. The whole concept is flawed.

New/delete like this are for smallish general purpose allocation, for example
of objects, where we want to keep the code at a fairly high level in C++ and
not think about low-level concepts like "bytes" or allocation mechanics. Or
performance.

Conversely, an app written in C++ that needs huge swaths memory for
manipulating raw bytes and needs high-performance would not likely use new
char[] or calloc/malloc directly at all. It would directly interface with the
OS via mmap or indirectly via some domain relevant library, for example,
OpenCV.

We might even still exclusively use the C++ language to write the low-level
portions of the code, but if you need to interface with the OS in specific
controlled ways or write performance oriented code, you are not going to do it
using only high-level C++ operations. You are going to call exactly what you
need to interface with the system directly. If you want mmap(), you'd just
call mmap() from C++. If you want sbrk(), you'd call sbrk(). If you don't like
the system new/delete and malloc/free, you could use something like dlmalloc
and and even remap new/delete to it, or to some custom slab allocator.

Secondly, as pointed out by others, the author isn't benchmarking C++. He's
benchmarking glibc malloc and the Linux mmap. We'd expect a C program using
malloc/calloc to have exactly the same timings.

Thirdly, fallacious reasoning about initialization. If your app needs to
allocate (for some ??? reason) 32 GB of memory, you would NOT automatically
zero it first, unless you actually needed it to be zero, or you wanted your
app to waste a bunch of time. Unnecessary zeroing huge arrays is not required
for good security. We're mostly benchmarking memset here, not even
malloc/mmap. So it's completely apples/oranges to compare mmap benchmark with
a zerofill benchmark. I almost expect a follow-up article that points out
accessing x[i] is much faster when x is a raw array of integers than when x is
a std::map of strings.

Now read the footnotes. The author is misunderstanding what "idomatic C++"
means. RAII is the best answer I can come up with for idiomatic - C++ doesn't
have a built-in guard concept so we creatively mis-use constructors and
destructors to get a similar result.

Footnote 2 proves the author knows bupkis about how C++ or any part of the
system actually works. He's semi-admitting as much.

I'd hope for much better from a CS professor. Stick to benchmarking this:

    
    
      for(i=0; i<1E9; ++i);

------
scoutt
> the system may then decide

This makes me think if the whole question is really about C++ or the OS
implementation for memory allocations. This could have little to do with C++.

On a bare-metal system, malloc/sbrk (and thus _new_ ) can be something as
simple as:

    
    
      uint32_t retptr = heap;
      heap += size;
      return retptr;

------
gumby
Anonymous mmap() zeros memory so why is new wasting its cycles doing so too? I
doubt any modern c++ standard library is using brk/sbrk I this day and age.

~~~
AstralStorm
The thing is, tested C++ standard library implementation cannot assume a
specific platform or implementation of libc. It's too portable.

Yes, you could add some platform and type specific speedups, it ends up a big
mess.

In C++20 you get uninitialized allocation functions which do not have to
initialize.

~~~
gumby
By definition the libraries present a portable interface to underlying system
resources and are full of platform specific code (I was a Cygnus founder so
spent plenty of time deep in these issues at a time when there was more
diversity). All posix is systems I know (e.g. Linux, BSD, Solaris, etc) return
pages of zeros from anonymous mmap. I consider it a bug if they post process
in this case.

------
halayli
most std libs rely on malloc lib underneath. so i am not sure what's being
tested here. Especially that the standard allocator is being used.

------
blackrock
Anyone want to take a stab at how malloc actually works? What’s under the
hood? What data structures and algorithms does it use?

~~~
jabl
In a comment the blog author says he's using Ubuntu and glibc malloc. And
glibc malloc has a threshold (128 kB or thereabouts) where it switches to
using mmap(). So in practice this benchmark is a test of how the Linux kernel
implements the mmap() syscall, and how the compiler implements the zeroing
loop.

~~~
Const-me
> And glibc malloc has a threshold (128 kB or thereabouts) where it switches
> to using mmap()

Very similar on Windows. In modern CRT, malloc is a thin wrapper over
HeapAlloc WinAPI. That one has a threshold (512kb or 1MB depending on 32- or
64-bit process) where it switches to VirtualAlloc.

------
signa11
dup:
[https://news.ycombinator.com/item?id=22053831](https://news.ycombinator.com/item?id=22053831)

------
chapium
What is the point in initializing memory here?

~~~
saagarjha
It ensures that they actually exist, because otherwise the OS will just
overcommit them.

------
NoZZz
On a barebones system, 0 cpu cycles.

