
How to Allocate Memory (2016) - tosh
https://geocar.sdf1.org/alloc.html
======
ShroudedNight
I don't like this article. It comes across as being written with the intent of
highlighting the elevation the author's status as a 'low-level wizard', while
conveying as little as possible to the ignorant reader. From the scattered
aspects of memory management it covers to it uses of unexplained magic numbers
in its code examples, it comes across more as an unsolicited secret handshake
than cogent, empathetic attempt to elevate others in the practice.

Disclaimer: I had a hard time following this article. I'm also (ostensibly
anyway) not the target audience. I was at one point the primary person
responsible for the memory management in the JIT compiler[1] used by OpenJ9,
though I haven't done that for a couple of years now.

[1] That is, the memory used by the JIT compiler itself, not its interactions
with the garbage collector.

~~~
SlowRobotAhead
I am the target audience. I get exactly what he’s going for but found a
different gripe.

I really dislike his coding style! It’s written to be “short” in favor of
clear. I see what the code does, but not necessarily his intent.

    
    
      void*page_alloc(unsigned long long bytes) {
      void *x;
      for(;;) {
        x = mmap(0, bytes, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
        if(x != MAP_FAILED) return x;
        out_of_memory();
      }

}

For embedded I stoped using “unsigned long long” in favor of uint64_t. He
spends all those extra characters but can’t put a space in the void pointer
function definition?

~~~
geocar
Apologies! I usually typedef long long J;typedef unsigned long long U; and
that makes it a bit shorter, but I wanted each example to compile, and not to
get bogged down on typedefs.

~~~
einpoklum
If you use this kind of typedefs, most people reading your code will kind of
hate you... either that or you're targeting the Internationl Obfuscated C Code
Contents [1].

[1] - [https://www.ioccc.org/](https://www.ioccc.org/)

~~~
saagarjha
Perhaps this might help explain their reasoning:
[https://news.ycombinator.com/item?id=22010590](https://news.ycombinator.com/item?id=22010590)

~~~
SlowRobotAhead
I didn’t downvote you, but your link contained this... and I considered it.

    
    
      ZI http(I d,I f,C*p,I r,I*sd){I i,o=0,b=0,s=0,w=0,g=0,e=0,c=0,cf=1,m=0;K q=0,a=ktn(11,0),v=ktn(0,0);
      for(i=b=0;i<r;++i){switch(p[i]){
      case'\n':if(!e)e=i;if(!q){q=kpn(p+w,g-w);if('\r'==(cf=p[i-1]))cf=p[i-2];}else if(!c){w=!!strchr("kK1",cf);if(!o)sc(d,0,sd);else if(w){if('g'==o)cwrite(f,OK("200","keep-alive")BLANK);else cwrite(f,OK("204","keep-alive")END204);}else{if('g'==o)cwrite(f,OK("200","close")BLANK);else cwrite(f,OK("204","close")END204);close(f);}a=k(o?-d:d,"dash",knk(2,q,xD(a,v),0),0,0);b=i+1;if(!o){if(!a){poop(f);R 0;}if(10!=a->t){r0(a);poop(f);R 0;}writer(f,kC(a),a->n);r0(a);if(!w)close(f);}if(b==r)R 1;q=0;o=0;a=ktn(11,0),v=ktn(0,0);}else{if((c-m)==10&& !strncasecmp(p+m,"connection",c-m))cf=p[s];js(&a,sn(p+m,c-m));jk(&v,kpn(p+s,e-s));}w=e=g=s=c=0;m=i+1;break;
      case' ':case'\t':case'\r':if(w&&!g)g=i;if(s==(e=i))++s;break;
      case':':if(!c)s=1+(c=i);break;
      case '/':if(!w)w=i+1;case '?':case '&':if(r>=(i+4)&&p[i+1]=='f'&&p[i+2]=='=')o=p[i+3];default: e=0;break;}}if(a)r0(a),r0(v);if(q)r0(q);R b;}
    

Thanks, I learned about something I absolutely hate.

------
asveikau
Using sbrk(2) as an application stack seems like a bad idea. A lot of times
that is where malloc gets its heap addresses from.

The article even says this but not strongly enough:

> It is worth mentioning that the C library malloc might be using and caching
> the results of sbrk it may be unwise to mix this trick with your C library
> malloc, but it is very useful in new programs that you have no intention of
> using the C library malloc with.

No it is not a good idea to write "new programs" to apply hacky uses of
sbrk(2) and conflict with malloc. Any use of sbrk(2) outside of a malloc
implementation is pretty suspect as code smell. And you will be hard pressed
to find any program or library that doesn't use malloc, short of embedded use
cases.

~~~
wruza
Not only malloc, but the entire libc, as no functions in it explicitly
guarantee that they do not call malloc and/or mess “your” brk range. You can
guess it by what it does and ENOMEM in possible return status, but that is a
stupid idea. sbrk is basically a remnant api and is now either an
implementation detail of libc memory management or a fool’s crutch [1].

[1]
[https://opensource.apple.com/source/Libc/Libc-1272.250.1/emu...](https://opensource.apple.com/source/Libc/Libc-1272.250.1/emulated/brk.c.auto.html)

~~~
asveikau
Yes. Notably stdio typically uses the heap. So I hope this author isn't
planning to use printf for New Programs.

~~~
geocar
I don't, at least not when I care about performance.

stdio is a big unknown with highly variable performance across platforms and
libc-versions.

~~~
smabie
Sure, yeah but it’s not like using printf is a problem for error messages or
logging. Using write() for that kind of stuff is more hassle than it’s worth.

~~~
geocar
Oh it so is!

I can’t count the number of times I’ve gotten 20-50x speed ups just turning
off logging. Assuming printf is free (or even cheap) is the sort of thing
that’ll get you into trouble. It might not always matter, but when it does if
your application is parsing strings at runtime at high volume, it’s going to
hurt!

I rarely log messages anyway, preferring to set state that I can then printout
on command (e.g. siginfo)

~~~
jstimpfle
So did you just turn logging off, or did you in fact write a better custom
implementation? And how much faster did it go? Did you actually measure that?
What kind of system was that on?

My first thought is, probably there was an fsync() or similar after logging
each message. Also, printf() and friends have to lock the streams, so there
might be some contention somewhere in case you're multi-threaded. That's not a
problem that you can solve without some custom routing of logging messages.

Using printf, you can easily write dozens of megabytes of the most complicated
formatting code on contemporary systems (I'm just pulling a number out of thin
air here). More than any system should want to log.

~~~
geocar
People who deal in megabytes have different problems than people who deal in
petabytes.

Usually just turn it off. At least at first. Note I’m talking diagnostic here;
If I _need_ the log messages obviously I have to write a custom streaming
system even if the system printf is fast enough because the API complicates
recovery.

Printf can also vary in performance by 10msec or more based on the internal
state. That’s not good enough since my entire application has a hard limit of
around 30msec. I can’t even do one printf — even every N messages (for any N)
because I’ll never catch up.

~~~
jstimpfle
> People who deal in megabytes have different problems than people who deal in
> petabytes.

I just hope that this is not meant ad-hominem. At the very least it's a bad
reply, unless you are dealing with petabytes _per second_. (Btw one of my
current projects is a text editor that can handle gigabytes in memory; local
operations take 5-50 microseconds. So I have reason to think I'm not entirely
clueless).

> I rarely log messages anyway, preferring to set state that I can then
> printout on command (e.g. siginfo)

That sounds much more reasonable to me giving the volumes that you cite, and
why shouldn't we use printf() for that?

Why would printf "vary in performance by 10msec" or more, in ways that another
application wouldn't? For how much data is that? How many printf() calls?

Anyway I shouldn't have gotten in the woods here. The blanket statement that
printf() is slow and therefore an obscure API like sbrk() is a better use, is
nonsensical for a guide that seemingly gives general advice for memory
allocation.

------
saagarjha
> One major limitation of malloc (and even the best implementations like
> jemalloc and dlmalloc) is that they try to use a single allocator for each
> data structure.

I’m not sure what this is supposed to mean, but it’s good to note that malloc
will switch between different paths based on the allocation size and what
memory it has available to service the request.

~~~
geocar
malloc() has incomplete information. It doesn't know that this allocation of 0
bytes is soon going to balloon to a few GB in a future realloc(), so if it
starts in brk, it'll mmap later and memcpy the data over. The good news is
that the caller knows this.

By saying "I want a sizeof(Cons) object" versus "I want a buffer that's going
to get big and used for a long time" versus "I want a buffer that'll get big
but go away right after" means you get to avoid these unnecessary operations.
That's important because they really add up when dealing with lots of data.
That's how replacing a complicated and intelligent malloc() which has to rely
on heuristics to decide which of the "different paths" to follow can lose big
to a simpler strategy.

~~~
saagarjha
Why not just mmap at the start then?

~~~
geocar
Well then you’re not using malloc() which is kindof my point.

Or are you asking something else?

------
zerotonine
Can anyone think of a use case for creating a second stack, without a thread,
to allocate memory? The approach I have in mind: 1) Create a new stack POSIX
makecontext() w/ initializer thread. 2) Initializer thread makes function
calls to allocate objects. Functions are called in sequence and never return.
The thread eventually reaches the end of its initialization sequence and
stalls. 3) Use swapcontext() to return to the original thread from #1. 4)
Original thread may reference all objects allocated on the stack of the other
context.

This seems like a widget that could be used to accomplish something
interesting, but I'm not sure what that might be.

~~~
geocar
A forth interpreter is probably the simplest and most-common example of a two-
stack system someone might be familiar with.

Sometimes I'd like to push an event onto a queue from an interrupt (e.g. a
signal handler) and the buffer size can balloon some; sbrk() might be used
because it's faster than malloc() and I know the consumer is fast enough (even
though it's currently busy).

Sometimes I know the memory usage is very temporary, but I don't want to
contort my program to use alloca() and so again: using brk can be convenient.

There are other things that are stack-like that we don't often think of as
stacks.

One interesting place might be an undo-stack: An editor typically has an undo
and redo-button and will never be freeing this state anyway, so rather than
malloc/realloc my state (and have extra copies going on), and without trashing
my heap with a long linked list, I might prefer the performance of sbrk(). You
could possibly use alloca() as well, but this would make redo very difficult.

Another might be in a consensus algorithm: If implementing raft you wanted to
queue messages until you had a chance for your election, your "stack" might
take several seconds before you can start processing it. Keeping this on the
"other" stack would make it difficult to do any other processing (where's the
event loop?), and using malloc/realloc() will introduce copies. Reserving
sbrk() for this stage might make everything much more straightforward.

~~~
jstimpfle
> One interesting place might be an undo-stack: [..] so rather than
> malloc/realloc my state (and have extra copies going on), and without
> trashing my heap with a long linked list, I might prefer the performance of
> sbrk().

You can implement a stack-like thing that avoids copying, and even gives
stable addresses, by linking fixed-size chunks (say 128 elements) in a list. A
stack "pointer" is then a pointer to a chunk + offset.

That said, editor/undo redo items are not performance critical. And you will
be hard pressed to find a reason why you need to optimize memory consumption.
Assuming a memory overhead of 16 bytes per allocation and assuming that a
human would make < 1M edits.

Preferring an obscure API like sbrk() instead is not good general advice. It's
not portable and it precludes the use of basically any library including libc.

~~~
vardump
> And you will be hard pressed to find a reason why you need to optimize
> memory consumption.

You might want to minimize fragmentation and number of underutilized heap
pages.

Even if your allocator minimizes fragmentation by allocating similar size
objects from same area, allocating a large number of objects risks some other
code (possibly running in a different thread) allocating long-living objects
in the same time.

This can cause pages to exist just to hold a single small object, preventing
optimal use of memory.

In some algorithms, 90-99%+ of runtime can be in allocation and freeing small
objects!

On top of that, accessing a sequential stack (array) is significantly faster.
CPUs are built for that type of access patterns.

~~~
jstimpfle
I had to read a little between the lines to understand your point, which I
think is this: Even if you have an object type that requires only lowest-
performance memory management (say, editor undo/redo items), you should
separate these allocation concerns from all the other allocation concerns in
the program, because you might risk contention. It's important to group
allocations by lifetime.

To which I'll say, shouldn't we rather optimize and separate the high-
bandwidth stuff, so we can keep using malloc-per-object for the unimportant
stuff?

The rest of your comment does not apply to a simple undo/redo system:

> In some algorithms, 90-99%+ of runtime can be in allocation and freeing
> small objects!

> On top of that, accessing a sequential stack (array) is significantly
> faster. CPUs are built for that type of access patterns.

~~~
geocar
Not the person you’re replying to, but:

> shouldn't we rather optimize and separate the high-bandwidth stuff, so we
> can keep using malloc-per-object for the unimportant stuff?

If it helps, sure, but one problem that shows up nasty at high bandwidth is
variable latency: malloc might be too clever and vary on the order of msec and
with hard-real-time environments you can’t amortise the cost over multiple
messages since missing one means you never catch back up (without e.g.
throwing data away)

~~~
ShroudedNight
I've only cursory exposure to it, but my understanding is that TLSF malloc[1]
is designed to directly address those concerns. Is it known to be deficient
such that its claims are invalid in / worthless for real-world use?

[1]
[http://www.gii.upv.es/tlsf/files/ecrts04_tlsf.pdf](http://www.gii.upv.es/tlsf/files/ecrts04_tlsf.pdf)

------
Koshkin
Abstractions are leaky - much more often than one would like to think. On
modern 64-bit systems, for example, with the apparently "limitless" virtual
memory one has to pay close attention not only to the allocation patterns but
also to the access patterns: random reads and writes from/to a huge array may
lead to a significant drop in performance.

