

RethinkDB internals: handling stack overflow on custom stacks - coffeemug
http://www.rethinkdb.com/blog/2010/12/handling-stack-overflow-on-custom-stacks/

======
cperciva
I usually enjoy reading RethinkDB blog posts -- it's unusual for a company
with such smart people to write about _technical_ issues they come across --
but I got as far as the first sentence before being disappointed this time.

    
    
      On my computer, the callstack of a new process is around 10MB.
    

I can't imagine what sort of thinking produced this sentence.

First, this limit isn't set by his _computer_. His computer might affect the
limit; but the value is also likely to be affected by his choice of operating
systems, his login.conf settings, whether he is root, and possibly environment
variables.

Second, processes don't have stacks. _Threads_ have stacks. And often the
first thread gets more room than later threads.

Third, the stack starts out small -- like, a few hundred bytes or so,
depending on the amount of bloat in his platform's crt1.o (or equivalent). It
can grow, of course; but the size of something is not equivalent to its
_maximum permitted_ size -- you would never say "a newly created file is 2^64
bytes" just because that's the maximum possible size.

I'd suggest replacing the first sentence with

    
    
      On my computer / OS / login environment, the default maximum permitted size of
      the primary thread's stack in a new process is around 10 MB.
    

(Random side note: You guys have heard of userland threads, right? You seem to
be doing an awful lot of wheel-reinventing here -- not necessarily a bad
thing, of course, but I didn't find your "threads are too expensive" line
particularly convincing.)

~~~
coffeemug
Hi Colin! I don't necessarily disagree with you, but I feel like your
arguments address semantics/linguistics and not the actual issue.

 _First, this limit isn't set by his computer._

Obviously the size of the stack depends on a number of variables (including
the hardware architecture, the kernel, how the kernel was compiled, various
settings, etc.), but on Dan's configuration it's roughly 10MB, as verified by
the program at the end of the blog post. Here is a program that confirms it by
computing the number in a different way:

    
    
      #include <pthread.h>
      #include <stdio.h>
    
      int main(int argc, char *argv[])
      {
         size_t stacksize;
         pthread_attr_t attr;
         pthread_attr_init(&attr);
         pthread_attr_getstacksize (&attr, &stacksize);
         printf("Default stack size = %lu\n", stacksize);
      }
    

On our configuration (which is roughly what most customers are expected to
run) the output is 8388608, or 8MB. According to the manual,
_pthread_attr_getstacksize_ returns _the minimum size (in bytes) that will be
allocated for threads created using the thread attributes object attr_.
Naturally the size can be changed manually (in the code or with ulimit), but
it is the default minimum for newly created threads according to the manual.
The point wasn't that you couldn't change this value, it's that for our
purposes (lots of green threads, shallow stacks) a much smaller default stack
size is more sensible.

 _Second, processes don't have stacks. Threads have stacks._

Certainly, but this is a question of terminology. On Linux threads are
special-case processes (with address-space sharing and a few other changes).
For example, when calling _strace_ you have to pass -f to see syscalls from
all the threads instead of just the main one.

 _Third, the stack starts out small -- like, a few hundred bytes or so._

Again, not according to the manual (see above).

 _You guys have heard of userland threads, right?_

When we get the chance we'll post benchmarks between GNU Pth, NPTL, and our
coroutine implementation. I can assure you that NPTL is much slower than what
we're doing. GNU Pth is a real contender, but there are some issues associated
with using it (mainly the fact that we need to be able to move coroutines
between native kernel threads, and it's not clear how well GNU Pth can handle
that). The coroutines implementation won't be making it into the first release
- ultimately we'll test all three alternatives and pick the best one, but it
will take some time before that happens.

~~~
cperciva
_I don't necessarily disagree with you, but I feel like your arguments address
semantics/linguistics and not the actual issue._

It might be that, yes. But as a philosophy professor of my acquaintance likes
to remark, "clear thinking requires clear language" (the irony of this phrase
coming from a philosopher's mouth seems to be lost on him). Using unclear
language both misleads the audience and makes it unclear whether the speaker
understands what he's talking about.

(I find the same is true of code -- the clarity of some code seems very highly
correlated with how well its author understands the problem space in
question.)

Between us we're using three different definitions of "stack size" -- the size
of the actual stack, the amount of RAM allocated, and the amount of VM space
allocated; needless to say, I think the first definition makes the most sense.
;-)

 _When we get the chance we'll post benchmarks between GNU Pth, NPTL, and our
coroutine implementation._

I recommend trying other operating systems too. Linux is famously bad at
threading, and only survives in benchmarks because most applications which run
on Linux are equally bad in how they use threads. What you really want (and
seem to be implementing) is an M:N threading system, which is considerably
more performant than either 1:1 or M:1.

~~~
coffeemug
_But as a philosophy professor of my acquaintance likes to remark, "clear
thinking requires clear language"_

I think you need to balance that with taking some shortcuts, or you'll never
get to the essence (or at least your readers won't).

------
silvestrov
Use 64-bit and all problems with 10 MB stacks go away. 64 KB is surprisingly
small when you call a library function or if the OS uses the userspace stack
for OS calls.

~~~
littledanehren
This is something which will be good to look into in the future. It may be
that, in certain system configurations, that's fine. But if there are 10,000
coroutines each with 10 MB stacks, and memory overcommitting is turned off or
configured to be relatively conservative (as some system administrators like
to do), you'll need to reserve 100GB of physical memory + swap just for
coroutine stacks. This may be problematic, but it may not. In the system I've
implemented, the stack size is a parameter which can be varied when the
database is launched, so this is something which we can easily benchmark in
the future.

