Hacker News new | comments | show | ask | jobs | submit login

I was curious what the actual state was of the "modern Linux kernel" pcwalton mentioned, so I tried running a test program to create a million threads on a VM - x86-64 with 8GB of RAM, Linux 4.0. For comparison, Go apparently uses about 4KB per goroutine, so it should be possible to create somewhat under 2 million goroutines. To be fair, I allocated the stacks manually in one large allocation; otherwise it dies quite quickly running out of VM mappings. I set the stack size to the pthread minimum of 16KB (actually, I tried cheating and making it smaller, but it crashed, so I gave up - not a good idea anyway). The threads waited for an indication from the main thread, sent after thread creation was done, to exit; in an attempt to avoid the overhead associated with pthread conditions, I just used futex directly:

    while (!ready)
        assert(!syscall(SYS_futex, &ready, FUTEX_WAIT, 0, NULL, NULL, 0));

The program caused the kernel to hit OOM (rather ungracefully!) somewhere around 270,000 threads. To see how long it took while ensuring all the threads actually ran, I reduced the thread count to 200,000, had it join all the threads at the end, and timed this whole process: after the first run it took about 4 seconds. (The first run was considerably slower, but that isn't a big deal for a server, which is the most important use case for having such a large number of goroutines/threads.) Therefore, the C version uses about 20 microseconds and 32 KB of memory per thread.

For completeness, I also tested a similar Go program on Go 1.4 (the version available from Debian on the VM); it actually got up to 3,150,000 before OOM, and took 9 seconds to do 2 million - 4.5 microseconds and 2.7KB per thread.

In other words, Linux is about an order of magnitude slower at managing a lrge number of threads. That looks pretty bad, but on the other hand, it's not that much in absolute terms! I'm pretty sure most server programs don't need more than 250,000 simultaneous connections (or can afford to spend more than 8GB of RAM on them) and don't mind spending an extra 20 microseconds to initiate a connection, so if operating systems other than Linux aren't a concern, they could be written to create a thread per connection without too much trouble. It's not going to give you the absolute maximum performance (meaning it's not appropriate for a decent class of program - then again, I suspect Go isn't either), but it's not terrible either.

I'd like to see it improve. I wouldn't be surprised if there is (still) some low hanging fruit; do kernel developers actually care about this use case?

(And yes, I know this doesn't really test performance of the scheduler during sustained operation. That's its own can of worms.)




> It's not going to give you the absolute maximum performance (meaning it's not appropriate for a decent class of program - then again, I suspect Go isn't either), but it's not terrible either.

Yeah, this matches our results when we did similar tests. It's definitely faster to use green threads if you're just spawning and shutting down the threads, but if you're actually doing I/O work on those threads, the overhead quickly drops down.

It's not the fastest way to do I/O, but Go's approach isn't either. The fastest way to do I/O is to forego the thread management syscalls and the stack, like nginx does.


>It's not the fastest way to do I/O, but Go's approach isn't either

Though Go's is faster.

>forego the thread management syscalls and the stack, like nginx does.

The problem is nginx uses state machines, which aren't a general solution. Stackless coroutines allow you to write functions that the compiler (or a clever library) transforms into state machines.


> To be fair, I allocated the stacks manually in one large allocation; otherwise it dies quite quickly running out of VM mappings.

Okay, so the test you did doesn't actually reflect the use case in practice. Can I expect to reach 200,000 threads if the threads are not all created at exactly the same moment? What if (God forbid) they're doing memory allocation? And if it does work out, will everything be handled efficiently?


Hope comex replies to your question. Typical green thread usage is spawn-em-as-you-need-em, so if in order to spawn lots of 1:1 threads I need to do it all up front, that could be very limiting or complicating.


Yeah, I just made a mistake - you can increase the maximum number of mappings using /proc/sys/vm/max_map_count; I tried doing that and switching back to normal stack allocation (but still specifying the minimum size of 16KB using pthread_attr_setstacksize) and it doesn't change the number of threads I was able to create.

...in fact, neither did removing the setstacksize call and having default 8MB stacks. I guess this makes sense: of course the extra VM space reserved for the stacks doesn't require actual RAM to back it; there is some page table overhead, but I guess it's not enough to make a significant difference at this number of allocations. Of course, on 32-bit architectures this would quickly exhaust the address space.

If increasing max_map_count hadn't worked, it would still be possible to allocate stacks on the fly - but you would get a bunch of them in one mmap() call, and therefore in one VM mapping, and dole them out in userland. However, in this case guard pages wouldn't separate different threads' stacks, you would have to generate code that manually checks the stack pointer to avoid security issues from stack overflows, rather than relying on crashing by hitting the guard page. Rust actually already does this, mostly unnecessarily; I'm not sure what Go is doing these days but I think it does too. Anyway, given that the above result I suspect this won't be an actual issue, at least until the number of threads goes up by an order of magnitude or something.


Thanks for your reply. Wonder how other platforms fare.


That doesn't seem that unrealistic — you could allocate your stacks using slab allocation, for example. I wonder why the Kernel allocator doesn't do a better job though.


The key number for server applications is latency — it doesn't matter if you can spawn 200,000 threads if it takes you four seconds to serve a request. And if a thread can't serve a request before its time slice is up, it gets sent to the back of the line and its cache entries get evicted by other threads.

Coroutines have three awesome advantages here:

Really fast switches (it's basically a function call — no need to trap into the kernel)

Scheduling in response to IO events (a coroutine can yield when making a "blocking" IO call and resume when it completes). You can write asynchronous code that looks synchronous!

Better cache behavior (hugely important on modern processors). You can allocate all your coroutine structures together and don't have to suffer all the cache evictions of a context switch every time a coroutine yields.

It would be interesting to see this benchmark repeated with a latency requirement.


> Scheduling in response to IO events (a coroutine can yield when making a "blocking" IO call and resume when it completes). You can write asynchronous code that looks synchronous!

You can also write "actual" synchronous code which does the same thing in the kernel. :P And there are several CPU schedulers to choose from to fine-tune when things get woken up.

I agree it would be interesting to test latency; actually, the numbers would be much more useful than my four seconds. After all, if thread creation throughput or latency becomes an issue, you can keep a thread pool around, while still retaining the advantages of one connection per thread.


>You can also write "actual" synchronous code which does the same thing in the kernel. :P And there are several CPU schedulers to choose from to fine-tune when things get woken up.

You can. But no matter how good your kernel scheduler is, that trip through it is going to cost you. Why? Because it's going to wreck your cache and TLB entries. I think the TLB is a big part of the story, given it's small, specific to the process, and particularly expensive to miss.

http://www.cs.cmu.edu/~chensm/Big_Data_reading_group/papers/...

Anyway, I'm sure you know this already. :) Out of curiosity, did you get a chance to take CS169 before you left Brown?

And I think the kernel already does pool/recycle stacks, but I guess not 200,000 of them given your results. An explicit thread pool would certainly work too.


Kernel stacks are allocated by Linux in alloc_thread_info_node; user stacks by glibc in allocate_stack, which does have some sort of cache. Overall there's a fair bit of work going on that could be skipped...

But yeah, certainly userland context switches are faster; it's just that your comment seemed to postulate "scheduling in response to IO events" as a benefit of coroutines over threads, as if the kernel had no way to distinguish between threads waiting for IO and threads needing to be woken up. I presumably misinterpreted it.

Interesting paper. It seems like it could be a useful step toward my ideal imagined environment where the distinction between coroutines and threads would be meaningless - because scheduling would be a job shared between userland and the kernel, so userland could do fast thread switches, batch system calls, etc., while the kernel would still give them PIDs and let them take signals or be ptraced, and native tools (debuggers) would treat them as threads.

Though the paper is from 2010; do you know if anyone has tried to implement anything along similar lines in production?

And no, I didn't take that course.


Oh OK, I definitely could have written that better.

From what I can tell, M:N threading has been thoroughly abandoned by Linux, but it might be worth revisiting given the horrendous I/O scaling of Linux kernel threads. I would imagine scheduling user threads cooperatively simplifies things a lot.

Google has a very interesting talk about cooperative native thread scheduling, but they haven't upstreamed their code:

https://www.youtube.com/watch?v=KXuZi9aeGTw

And here's what linux devs are actually doing (it's underwhelming, something like 50% of what the hardware is capable of):

https://lwn.net/Articles/629155/

In the meantime, people who actually need the performance (HPC and even some enterprise servers) have to bypass the kernel entirely. It looks like that's going to be possible even in virtualized environments, thanks to hardware support:

http://people.inf.ethz.ch/troscoe/pubs/peter-arrakis-osdi14....


It would be interesting to run the same test using musl, which allows a smaller minimum stack size.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: