

Split Stacks in GCC - swannodette
http://gcc.gnu.org/wiki/SplitStacks

======
barrkel
If you twist your perspective a little, you can change how you view the stack;
instead of looking at it as a literally linear blob of data, consider it the
'data' or 'scope' pointer of a closure (frequently implemented as a pair of
data and code pointers). A key element of this closure is the continuation
closure, which is implicitly passed in to closures when they get called; the
'RET' type of CPU opcode is the normal way of calling this closure.

If one carefully adds stack splitting and duplication capabilities to an
ordinary procedural language, it amounts to a conversion to continuation-
passing style. Not only does that cut down on some of the drawbacks of massive
multithreading on 32-bit (address space exhaustion, though not the expense of
context switches), but it also permits interesting implementation strategies
(e.g. making call/cc work in C or languages implemented in terms of C, which
can in turn be applied to implementation of stateless servers).

------
swannodette
Stack splitting is also central to efficiently implementing parallel logic
programming languages. This was the missing insight required for efficient or-
parallelism that the 5th generation computing project was hoping for.

Nice summary and much more about that here:
[http://bizrules.info/conference/ORF2008DFW/GopalGupta_Buildi...](http://bizrules.info/conference/ORF2008DFW/GopalGupta_BuildingUltimateRuleEngine_ORF2008.pdf)

~~~
silentbicycle
Wow, that IS a nice summary and bibliography. Thanks for posting it!

There are some other good presentations on C/LP in the same directory, too.

------
pmjordan
Curious idea, clearly borrowing from managed runtimes with garbage collected
stack frames.

Random idea (feasible only where address space is plentiful): why not let the
operating system's virtual memory subsystem worry about allocating stack space
(reserve a large stack with mmap(), let the page faults be responsible for
actually allocating the memory and periodically munmap() and re-mmap() the
upper part of the stack.

~~~
barrkel
This is exactly how Windows allocates the stack. The address space is
reserved, but only the first page is comitted; the next page after that is
called the guard page. Touching the guard page causes an exception, and the
first-chance handler for that exception commits the page, sets up a new guard
page at the next position, and resumes execution.

Compilers targeting Windows need to generate stack touching prologues for
procedures that allocate more than 4K of local variable data, or else they
risk code first touching the page beyond the guard page, which will cause a
much more unpleasant access violation.

~~~
pmjordan
Ah, I vaguely remember encountering that years ago now that you mention it. I
assume it doesn't actually ever purge pages that go unused for a while unless
the thread exits?

------
malkia
I've quickly skimmed through it, but when reading it thought about he's gonna
handle alloca() (and later I've read dynamic arrays).

So the solution is to dynamically allocate them, and at the end of stack
routine free them. That's all good, but what if you longjmp() and have
alloca() function still allocated. That would leak... Unless unwinding is
done, and RAII kind of type for alloca is made. (Or maybe wrap alloca in C++
RAII, or maybe that's what the compiler would be doing).

------
inoop
Here's a related paper from 2003 called 'Capriccio: Scalable Threads for
Internet Services'. It uses call graph analysis to figure out where to place
checkpoints that allocate and deallocate stack frame chunks.

[http://ece.ut.ac.ir/Classpages/S86/ECE404/Papers/capriccio.p...](http://ece.ut.ac.ir/Classpages/S86/ECE404/Papers/capriccio.pdf)

~~~
faragon
The only scalable thing for "Internet Services" I know is non-blocking I/O,
and just one thread for every CPU physical thread (i.e. avoiding thread
context switching).

Edit: After RTFM, is the purpose of the paper, simulating threads handling
blocking I/O with a epoll and/or select. Great paper.

------
Peaker
> It becomes possible to run millions of threads (either full NPTL threads or
> co-routines) in a 32-bit address space

If the minimal "reserved stack space" per thread is 4K, than a million threads
would consume about 4GB of memory. That's not quite feasible in a 32-bit
address-space. Would most threads run with even less than 4K of reserved stack
space?

~~~
jwecker
Using a 4k minimum was just one of the possible implementation strategies. It
looks like there are several ways to get it much, much lower.

------
lokedhs
On a 64-bit architecture, this is completely pointless as there is no problem
with the address space. Memory is also not a problem since all modern
operating system allocates stack memory as needed.

So, my question is why spend time on implementing this?

~~~
Peaker
> Memory is also not a problem since all modern operating system allocates
> stack memory as needed.

A modern OS can only allocate memory as needed with a page-level granularity.

For many millions of threads, this is not good enough and will waste GB's of
memory completely unnecessarily. Maybe in a decade wasting GB's of memory will
be negligible (but then we might want trillions of threads!). And pages seem
to be growing too, so the OS granularity is too coarse for micro-threads.

------
faragon
What about side effects such as heap memory fragmentation?

------
ilikejam
While we're at it, could someone fix Java so that I don't have to guess the
f*!$ing heap size all the time?

