Hacker News new | past | comments | ask | show | jobs | submit login
How to Allocate Memory (2016) (sdf1.org)
104 points by tosh 5 days ago | hide | past | web | favorite | 44 comments





I don't like this article. It comes across as being written with the intent of highlighting the elevation the author's status as a 'low-level wizard', while conveying as little as possible to the ignorant reader. From the scattered aspects of memory management it covers to it uses of unexplained magic numbers in its code examples, it comes across more as an unsolicited secret handshake than cogent, empathetic attempt to elevate others in the practice.

Disclaimer: I had a hard time following this article. I'm also (ostensibly anyway) not the target audience. I was at one point the primary person responsible for the memory management in the JIT compiler[1] used by OpenJ9, though I haven't done that for a couple of years now.

[1] That is, the memory used by the JIT compiler itself, not its interactions with the garbage collector.


I am the target audience. I get exactly what he’s going for but found a different gripe.

I really dislike his coding style! It’s written to be “short” in favor of clear. I see what the code does, but not necessarily his intent.

  void*page_alloc(unsigned long long bytes) {
  void *x;
  for(;;) {
    x = mmap(0, bytes, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if(x != MAP_FAILED) return x;
    out_of_memory();
  }
}

For embedded I stoped using “unsigned long long” in favor of uint64_t. He spends all those extra characters but can’t put a space in the void pointer function definition?


Apologies! I usually typedef long long J;typedef unsigned long long U; and that makes it a bit shorter, but I wanted each example to compile, and not to get bogged down on typedefs.

If you use this kind of typedefs, most people reading your code will kind of hate you... either that or you're targeting the Internationl Obfuscated C Code Contents [1].

[1] - https://www.ioccc.org/


Perhaps this might help explain their reasoning: https://news.ycombinator.com/item?id=22010590

I didn’t downvote you, but your link contained this... and I considered it.

  ZI http(I d,I f,C*p,I r,I*sd){I i,o=0,b=0,s=0,w=0,g=0,e=0,c=0,cf=1,m=0;K q=0,a=ktn(11,0),v=ktn(0,0);
  for(i=b=0;i<r;++i){switch(p[i]){
  case'\n':if(!e)e=i;if(!q){q=kpn(p+w,g-w);if('\r'==(cf=p[i-1]))cf=p[i-2];}else if(!c){w=!!strchr("kK1",cf);if(!o)sc(d,0,sd);else if(w){if('g'==o)cwrite(f,OK("200","keep-alive")BLANK);else cwrite(f,OK("204","keep-alive")END204);}else{if('g'==o)cwrite(f,OK("200","close")BLANK);else cwrite(f,OK("204","close")END204);close(f);}a=k(o?-d:d,"dash",knk(2,q,xD(a,v),0),0,0);b=i+1;if(!o){if(!a){poop(f);R 0;}if(10!=a->t){r0(a);poop(f);R 0;}writer(f,kC(a),a->n);r0(a);if(!w)close(f);}if(b==r)R 1;q=0;o=0;a=ktn(11,0),v=ktn(0,0);}else{if((c-m)==10&& !strncasecmp(p+m,"connection",c-m))cf=p[s];js(&a,sn(p+m,c-m));jk(&v,kpn(p+s,e-s));}w=e=g=s=c=0;m=i+1;break;
  case' ':case'\t':case'\r':if(w&&!g)g=i;if(s==(e=i))++s;break;
  case':':if(!c)s=1+(c=i);break;
  case '/':if(!w)w=i+1;case '?':case '&':if(r>=(i+4)&&p[i+1]=='f'&&p[i+2]=='=')o=p[i+3];default: e=0;break;}}if(a)r0(a),r0(v);if(q)r0(q);R b;}
Thanks, I learned about something I absolutely hate.

Likely targeting c89, which doesn't have the fixed-size ints (nor size_t, as the sibling suggests).

My recollection was that size_t and ptrdiff_t were present in c89. size_t was the type yielded by sizeof. This draft appears to corroborate my recollection:

http://port70.net/~nsz/c/c89/c89-draft.html#3.3.3.4


In this case it also makes most sense to use size_t, right?

Depends.

If you use size_t all the time, you may forget or have someone else forget that another system isn’t the same.

If you always use uint8_t or int64_t etc there is no ambiguity. If I intended this to be an 32bit var, it will be regardless if who is compiling it.


But this is an allocation function, where the amount of data to be allocated is specified in bytes. That must surely use size_t. Imagine the case where 128 bit systems become popular and you want to use this function to allocate 2^70 bytes of memory. Using uint64_t here would be invalid in that case.

If you’re going to allocate like that you could use 8s and get any multiple you like, no?

I’ll give you that there are times for performance you want the native system type. But I would still prefer to use uint32_t and specify this was optimized for a 32bit system.


> It comes across as being written with the intent of highlighting the elevation the author's status as a 'low-level wizard', while conveying as little as possible to the ignorant reader.

Technical article for a technical audience, namely POSIX C programmers. There's no secret handshake or subtext to elevate the author.


You can be a POSIX C programmer and have no idea what sbrk is. (In fact, it was deprecated very early and removed in POSIX 2001.)

> unexplained magic numbers

Which "magic number" do you think was left unexplained? I thought I had got them all.


Here's one that I would consider unclear unless you knew that allocations were 16-byte aligned:

> gcc has a very efficient way to obtain the bucket from the number of bytes, 60 - __builtin_clzll(byte_size);


Ok. I’ll add a point why 60 is chosen when I land.

Using sbrk(2) as an application stack seems like a bad idea. A lot of times that is where malloc gets its heap addresses from.

The article even says this but not strongly enough:

> It is worth mentioning that the C library malloc might be using and caching the results of sbrk it may be unwise to mix this trick with your C library malloc, but it is very useful in new programs that you have no intention of using the C library malloc with.

No it is not a good idea to write "new programs" to apply hacky uses of sbrk(2) and conflict with malloc. Any use of sbrk(2) outside of a malloc implementation is pretty suspect as code smell. And you will be hard pressed to find any program or library that doesn't use malloc, short of embedded use cases.


Not only malloc, but the entire libc, as no functions in it explicitly guarantee that they do not call malloc and/or mess “your” brk range. You can guess it by what it does and ENOMEM in possible return status, but that is a stupid idea. sbrk is basically a remnant api and is now either an implementation detail of libc memory management or a fool’s crutch [1].

[1] https://opensource.apple.com/source/Libc/Libc-1272.250.1/emu...


Yes. Notably stdio typically uses the heap. So I hope this author isn't planning to use printf for New Programs.

I don't, at least not when I care about performance.

stdio is a big unknown with highly variable performance across platforms and libc-versions.


You might not.

Your dependencies do.

And maybe you do in small amounts.

You think you are gaining something by misusing sbrk(2), but you are leaving a timebomb in your code. By your writing it seems like you are smart enough to know this.

From a performance perspective, it won't matter in large amounts if you put a stack on a high level heap or mmap its pages or use this silly sbrk(2) thing. You know this.


Sure, yeah but it’s not like using printf is a problem for error messages or logging. Using write() for that kind of stuff is more hassle than it’s worth.

Oh it so is!

I can’t count the number of times I’ve gotten 20-50x speed ups just turning off logging. Assuming printf is free (or even cheap) is the sort of thing that’ll get you into trouble. It might not always matter, but when it does if your application is parsing strings at runtime at high volume, it’s going to hurt!

I rarely log messages anyway, preferring to set state that I can then printout on command (e.g. siginfo)


So did you just turn logging off, or did you in fact write a better custom implementation? And how much faster did it go? Did you actually measure that? What kind of system was that on?

My first thought is, probably there was an fsync() or similar after logging each message. Also, printf() and friends have to lock the streams, so there might be some contention somewhere in case you're multi-threaded. That's not a problem that you can solve without some custom routing of logging messages.

Using printf, you can easily write dozens of megabytes of the most complicated formatting code on contemporary systems (I'm just pulling a number out of thin air here). More than any system should want to log.


People who deal in megabytes have different problems than people who deal in petabytes.

Usually just turn it off. At least at first. Note I’m talking diagnostic here; If I need the log messages obviously I have to write a custom streaming system even if the system printf is fast enough because the API complicates recovery.

Printf can also vary in performance by 10msec or more based on the internal state. That’s not good enough since my entire application has a hard limit of around 30msec. I can’t even do one printf — even every N messages (for any N) because I’ll never catch up.


> People who deal in megabytes have different problems than people who deal in petabytes.

I just hope that this is not meant ad-hominem. At the very least it's a bad reply, unless you are dealing with petabytes per second. (Btw one of my current projects is a text editor that can handle gigabytes in memory; local operations take 5-50 microseconds. So I have reason to think I'm not entirely clueless).

> I rarely log messages anyway, preferring to set state that I can then printout on command (e.g. siginfo)

That sounds much more reasonable to me giving the volumes that you cite, and why shouldn't we use printf() for that?

Why would printf "vary in performance by 10msec" or more, in ways that another application wouldn't? For how much data is that? How many printf() calls?

Anyway I shouldn't have gotten in the woods here. The blanket statement that printf() is slow and therefore an obscure API like sbrk() is a better use, is nonsensical for a guide that seemingly gives general advice for memory allocation.


> you can easily write dozens of megabytes

I meant to add per second.


It seems like sbrk is a bad idea because it's a legacy API that manipulates global program state in a very error prone manner. Is there ever any reason to use it instead of mmap?

More generally, I was very confused by the entire stack section. Why sbrk instead of mmap? Why should C provide special functions for creating or using a stack? Isn't a stack just a region of memory and a pointer into it? Isn't their high performance entirely due to data access patterns that facilitate cache locality while minimizing instruction count and pointer dereferences?

I may well have some misconceptions, but if so this article did nothing to address them.

Edit: Aha! From the mmap man page:

> MAP_STACK (since Linux 2.6.27)

> Allocate the mapping at an address suitable for a process or thread stack.

> This flag is currently a no-op on Linux. However, by employing this flag, applications can ensure that they transparently obtain support if the flag is implemented in the future. ... some architectures may (later) require special treatment for stack allocations. ...


Don't forget MAP_GROWSDOWN

> MAP_GROWSDOWN

> This flag is used for stacks. It indicates to the kernel virtual memory system that the mapping should extend downward in memory. The return address is one page lower than the memory area that is actually created in the process's virtual address space. Touching an address in the "guard" page below the mapping will cause the mapping to grow by a page. This growth can be repeated until the mapping grows to within a page of the high end of the next lower mapping, at which point touching the "guard" page will result in a SIGSEGV signal.


The strange thing [to me] is that, in a discussion of stack allocation patterns, sbrk is a deeply confusing red-herring.

'Little-s' stacks require bulk-allocated pages that are then carved up into object allocations as required, and de-allocated in last-in-first-out ordering. No alloca or sbrk required.


> One major limitation of malloc (and even the best implementations like jemalloc and dlmalloc) is that they try to use a single allocator for each data structure.

I’m not sure what this is supposed to mean, but it’s good to note that malloc will switch between different paths based on the allocation size and what memory it has available to service the request.


malloc() has incomplete information. It doesn't know that this allocation of 0 bytes is soon going to balloon to a few GB in a future realloc(), so if it starts in brk, it'll mmap later and memcpy the data over. The good news is that the caller knows this.

By saying "I want a sizeof(Cons) object" versus "I want a buffer that's going to get big and used for a long time" versus "I want a buffer that'll get big but go away right after" means you get to avoid these unnecessary operations. That's important because they really add up when dealing with lots of data. That's how replacing a complicated and intelligent malloc() which has to rely on heuristics to decide which of the "different paths" to follow can lose big to a simpler strategy.


Why not just mmap at the start then?

Well then you’re not using malloc() which is kindof my point.

Or are you asking something else?


Can anyone think of a use case for creating a second stack, without a thread, to allocate memory? The approach I have in mind: 1) Create a new stack POSIX makecontext() w/ initializer thread. 2) Initializer thread makes function calls to allocate objects. Functions are called in sequence and never return. The thread eventually reaches the end of its initialization sequence and stalls. 3) Use swapcontext() to return to the original thread from #1. 4) Original thread may reference all objects allocated on the stack of the other context.

This seems like a widget that could be used to accomplish something interesting, but I'm not sure what that might be.


A forth interpreter is probably the simplest and most-common example of a two-stack system someone might be familiar with.

Sometimes I'd like to push an event onto a queue from an interrupt (e.g. a signal handler) and the buffer size can balloon some; sbrk() might be used because it's faster than malloc() and I know the consumer is fast enough (even though it's currently busy).

Sometimes I know the memory usage is very temporary, but I don't want to contort my program to use alloca() and so again: using brk can be convenient.

There are other things that are stack-like that we don't often think of as stacks.

One interesting place might be an undo-stack: An editor typically has an undo and redo-button and will never be freeing this state anyway, so rather than malloc/realloc my state (and have extra copies going on), and without trashing my heap with a long linked list, I might prefer the performance of sbrk(). You could possibly use alloca() as well, but this would make redo very difficult.

Another might be in a consensus algorithm: If implementing raft you wanted to queue messages until you had a chance for your election, your "stack" might take several seconds before you can start processing it. Keeping this on the "other" stack would make it difficult to do any other processing (where's the event loop?), and using malloc/realloc() will introduce copies. Reserving sbrk() for this stage might make everything much more straightforward.


> One interesting place might be an undo-stack: [..] so rather than malloc/realloc my state (and have extra copies going on), and without trashing my heap with a long linked list, I might prefer the performance of sbrk().

You can implement a stack-like thing that avoids copying, and even gives stable addresses, by linking fixed-size chunks (say 128 elements) in a list. A stack "pointer" is then a pointer to a chunk + offset.

That said, editor/undo redo items are not performance critical. And you will be hard pressed to find a reason why you need to optimize memory consumption. Assuming a memory overhead of 16 bytes per allocation and assuming that a human would make < 1M edits.

Preferring an obscure API like sbrk() instead is not good general advice. It's not portable and it precludes the use of basically any library including libc.


> And you will be hard pressed to find a reason why you need to optimize memory consumption.

You might want to minimize fragmentation and number of underutilized heap pages.

Even if your allocator minimizes fragmentation by allocating similar size objects from same area, allocating a large number of objects risks some other code (possibly running in a different thread) allocating long-living objects in the same time.

This can cause pages to exist just to hold a single small object, preventing optimal use of memory.

In some algorithms, 90-99%+ of runtime can be in allocation and freeing small objects!

On top of that, accessing a sequential stack (array) is significantly faster. CPUs are built for that type of access patterns.


I had to read a little between the lines to understand your point, which I think is this: Even if you have an object type that requires only lowest-performance memory management (say, editor undo/redo items), you should separate these allocation concerns from all the other allocation concerns in the program, because you might risk contention. It's important to group allocations by lifetime.

To which I'll say, shouldn't we rather optimize and separate the high-bandwidth stuff, so we can keep using malloc-per-object for the unimportant stuff?

The rest of your comment does not apply to a simple undo/redo system:

> In some algorithms, 90-99%+ of runtime can be in allocation and freeing small objects!

> On top of that, accessing a sequential stack (array) is significantly faster. CPUs are built for that type of access patterns.


Not the person you’re replying to, but:

> shouldn't we rather optimize and separate the high-bandwidth stuff, so we can keep using malloc-per-object for the unimportant stuff?

If it helps, sure, but one problem that shows up nasty at high bandwidth is variable latency: malloc might be too clever and vary on the order of msec and with hard-real-time environments you can’t amortise the cost over multiple messages since missing one means you never catch back up (without e.g. throwing data away)


I've only cursory exposure to it, but my understanding is that TLSF malloc[1] is designed to directly address those concerns. Is it known to be deficient such that its claims are invalid in / worthless for real-world use?

[1] http://www.gii.upv.es/tlsf/files/ecrts04_tlsf.pdf


It's useful all the time in serialization and marshalling systems. You end up with very nice code if you can keep a separate stack for each of e.g. your json parser and its paired object unmarshaller.

The nicest outcomes are with coroutines, though, and that begins to be another matter.


Abstractions are leaky - much more often than one would like to think. On modern 64-bit systems, for example, with the apparently "limitless" virtual memory one has to pay close attention not only to the allocation patterns but also to the access patterns: random reads and writes from/to a huge array may lead to a significant drop in performance.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: