Hacker News new | past | comments | ask | show | jobs | submit login
How long does it take to make a context switch? (2010) (tsunanet.net)
87 points by majke on March 22, 2017 | hide | past | favorite | 72 comments



Pretty sure that he's just rediscovering what I did in lmbench back in the early 1990's. All written up and open source (and got best paper in the 1995 Usenix):

http://mcvoy.com/lm/bitmover/lmbench/lmbench-usenix.pdf

Edit: which now that I think about it, was over 20 years ago. Reminds me of something a friend (Ron Minnich) said long ago: people rediscover stuff in computer science over and over, nobody wants to read old papers. Or something like that. He said it better, I think it was something like "want to look smart? Just take any 5 year old paper and redo whatever it did, our attention span seems to be very short."


Your link is about 1990's hardware, the article is about 2010 hardware.

The article was revisiting the same topic exactly because the hardware changed (the author said so multiple times), and cheap context switches were a big marketing topic at the time.

It would even be interesting to get some newer study of that. You can bet the relative costs of a context switch changed a lot since 2010, as did the waiting latency.


Did you read the posted article? It spends a lot of time trying to figure out how to measure context switches. Which I carefully wrote up around 1995 and still works to this day. That's independent of the hardware and was sort of my point. This is a solved problem.

And if you look at Table 10, the time for a context switch is remarkably similar to what it was 20 years ago (which means the OS is fatter, not surprising, it's scales up on SMPs now).


I got the impression you were talking about the numbers.

In that I agree. You've got basically the same procedure written there for anybody to learn. And a nice tool that runs it.


The numbers are just a snapshot into history but what I measured and how it was measured is far more interesting. For example your comment "you've got basically the same procedure" isn'tquite right. If you read the paper, much effort was spent explaining to people that the baseline context switch is just a floor, what you will see in real life will be bigger since in real life a context switch is more that just a register file save/restore.

Very few benchmark writers get that point but that's sort of the main point. Not getting it means you are doomed to be mislead.


Replying to my own reply, I reread what they were doing and yeah, you are right, they do have basically the same procedure. My bad. Good on them for getting it (though I'll grumble that this was a solved problem, you can apt-get install it).

But you were right, looks the same. Sorry about that.


No problems. Overall, I'm glad you shared your paper, lmbench seems very nice.


Cool, glad you aren't pissed. And this whole thread has made me want to do a lmbench follow on paper, just get a new set of results for 2017 for *BSD, Linux, MacOS and maybe Solaris (does anyone care about anything other than Linux, MacOS, Windows these days?).


You're like the auto mechanic I talked to that was bemoaning this generation and how automotive tech has basically "stayed the same" since the 50s... "MacPherson struts, you ever heard of those boy!? Yeah - invented in the 50s!" Nothing wrong with a fresh take on an old topic.


Author somewhat addresses that in a comment:

@Adrian Cockcroft: I didn't use lmbench because I was wondering how to write such a benchmark in the first place, and you always learn more by building things yourself than by using something already existing. My little benchmarks aren't supposed to be alternatives to lmbench, I simply provided the source code so others could unambiguously see how I did the benchmarks and reproduce them.


Everything old is new again it seems.

I guess this is a variant of CADT, based on ignorance of the past rather than a chase of shinies.


People have always obsessed over "news", since humans are information-hungry by nature. But the modern tech crowd seems to have forgotten that there are two kinds of news: the kind that is objectively new (i.e. recent data), and the kind that is subjectively new (i.e. I didn't know that before). I think the situation is a bit similar to the distinction between objectively established frequencies and subjective Bayesian priors.


Full results from mcvoy.com i7@3.5ghz:

                     L M B E N C H  2 . 0   S U M M A R Y
                     ------------------------------------
    
    
    Basic system parameters
    ----------------------------------------------------
    Host                 OS Description              Mhz
    --------- ------------- ----------------------- ----
    slovax.mc Linux 4.4.0-5   x86_64-glibc223-linux 3497
    
    Processor, Processes - times in microseconds - smaller is better
    ----------------------------------------------------------------
    Host                 OS  Mhz null null      open selct sig  sig  fork exec sh  
                                 call  I/O stat clos TCP   inst hndl proc proc proc
    --------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
    slovax.mc Linux 4.4.0-5 3497 0.09 0.12 0.28 1.76 2.170 0.12 0.58 86.4 249. 644.
    
    Context switching - times in microseconds - smaller is better
    -------------------------------------------------------------
    Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                            ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
    --------- ------------- ----- ------ ------ ------ ------ ------- -------
    slovax.mc Linux 4.4.0-5 7.630 8.5200   10.5   10.9   15.5    10.8    19.0
    
    *Local* Communication latencies in microseconds - smaller is better
    -------------------------------------------------------------------
    Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                            ctxsw       UNIX         UDP         TCP conn
    --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
    slovax.mc Linux 4.4.0-5 7.630  13.6 14.7  20.8  19.4  25.8  24.5 25.8
    
    File & VM system latencies in microseconds - smaller is better
    --------------------------------------------------------------
    Host                 OS   0K File      10K File      Mmap    Prot    Page   
                            Create Delete Create Delete  Latency Fault   Fault 
    --------- ------------- ------ ------ ------ ------  ------- -----   ----- 
    slovax.mc Linux 4.4.0-5 2.7920 2.0480 8.2600 4.0840    17.6K 0.158 5.00000
    
    *Local* Communication bandwidths in MB/s - bigger is better
    -----------------------------------------------------------
    Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                                 UNIX      reread reread (libc) (hand) read write
    --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
    slovax.mc Linux 4.4.0-5 1928 11.K 5674 43009M  11.0K  11.2K 7686.9 14.K 10.3K
    
    Memory latencies in nanoseconds - smaller is better
        (WARNING - may not be correct, check graphs)
    ---------------------------------------------------
    Host                 OS   Mhz  L1 $   L2 $    Main mem    Guesses
    --------- -------------  ---- ----- ------    --------    -------
    slovax.mc Linux 4.4.0-5  3497 0.889 2.6670   57.4


You should probably share the fixes that made it compile (let alone work) for you. I'd like them, anyways.


Kind of like hacker news posts.


One reason people are reluctant to trust old sources (regardless of how well received they were at the time) is that software, operating systems and hardware all move at lightning pace. I've not read your paper, but I'm willing to bet enough has changed to warrant a new writeup talking about modern systems.


I very much doubt it. If you read the paper, it took each of the things it could measure (memory, disk, network, etc) and measured bandwidth and latency. Those are pretty basic measurements and I suspect that they will still be interesting 100 years from now and measured in the same way.

I'm running the benchmark right now on a modern system just to see what it says. Yep, still works fine:

    "size=0k ovr=0.90
    2 7.63
    4 6.51
    8 7.78
    16 7.07
    24 7.78
    32 9.02
    64 10.35
    96 10.49
That says that a hot potato (I send a byte to you, you send it back to me, the time is the round trip for one pair) context switch is ~7 usecs. Do it in a ring with 96 processes and it goes up to ~10 usecs.

I think almost 100% of that benchmark is still valid (I say almost because there must be something that has bit rotted in 20+ years but I don't know of anything). It's worth a read.

Edit: formatting.


> operating systems and hardware all move at lightning pace

not really, last big leap was the ability to unlock gpu for stream processing, and that's it.

it's quite faster today as ten or one year ago, but at it's core it's still a von neumann machine and not much has been discovered ever since, mostly we're repackaging under a new name things that have been discovered in the 70s', like the thick vs thin client debate, which played more or less the same over and over again as the bottleneck switched from bandwidth to latency to client performances, the last iteration of which was ember and angular precompilation.


Some of the stuff goes back even further than that.

The thread on the Intel Optane stuff the other day had an interesting conversation about how you would program for such systems and someone said we already did, mainframes 50 years ago.

It's nice to see old ideas in a fresh context.


> I've not read your paper, but I'm willing to bet [...]

Offloading all burden of proof onto the partner in a conversation strikes me as.. odd - esp. if said partner already provided a paper on the very same topic that you did not even read.


On the order of a millisecond. Which seems to me, way too slow in this day and age. The context switch has been a hard wad that chokes algorithms in unexpected ways, that are hard to diagnose.

This can be fixed with a different approach to CPU design? Most threads block on a semaphore-thingy or a timer or wait for io completion. If the CPU made those an opcode instead of a wad of kernel code, the switch could be reduced to a clock cycle (ns?)

How could this work? Maybe a global semaphore register where you Wait on a mask. Each bit could be mapped to one of those wait-condition. Another opcode would Set the bit.

To scale, the CPU would have to have not 2 or 4 hyperthreads but maybe 1000. So every thread in the system could be 'hot-staged' in a Wait, ready to execute, or executing.

There would still be cache and tlb issues, but they would be at the same level as any library call. They wouldn't include yards of kernel code messing with stacks and register loads and so on.


Hey, same Joe from Sun? If so, howdy.

On my i7-5930K I'm seeing 7-10 usecs, quite a bit better than a millisecond. Linus cares about this stuff, I'd be very surprised if he didn't measure it regularly.


Probably thinking of my brother, Rich?

usecs are a thousand times better than a ms of course. And a ns is a thousand times better than that!


> If the CPU made those an opcode instead of a wad of kernel code, the switch could be reduced to a clock cycle (ns?)

Not for any real world workload. It's worth reading section 6.6 of this: http://mcvoy.com/lm/bitmover/lmbench/lmbench-usenix.pdf

but I'll try and restate it here. So what's a context switch? Lots of people think it is putting one process to sleep and waking up another process. Which is sort of correct. Let's look at what that means. "Sleeping" a process means I have to take whatever hardware context there is and store it away (int/float registers are the basics). "Waking" a process is the opposite, go find the stored state, restore it, and return (sort of like longjmp()) to the awoken process.

But that is the minimal context switch, that's what you would have if you had a yield_to(pid_t pid) system call and you had a tight loop where pid A yields to pid B which yields to pid A, etc. Do that a million times, measure the total time, divide by a million.

That's not real world at all. In the real world your context is more than your registers, it is also your cache foot print. Pid A wasn't just yielding, it was doing some work. When it comes back, it has to either get lucky (with the right kind of cache that can hold more than one process's cache lines, look up physically vs virtually indexed caches, Hennessey & Patterson cover this nicely) or it has to reload state from L3 or main memory.

That's what the lmbench paper goes into a lot of detail about this. People write benchmarks and don't think about the fact that there is more to life than benchmarks. lmbench's context switch benchmarks are done as a function ctx(int N_processes, int working_set). Most people would benchmark ctx(2, 0) but that's nonsense unless you care about the silly yield/yield numbers. I dunno about you but when I'm debugging a perf problem, in ~30 years of doing that I have yet to see a perf problem around yield/yield (well, sort of. Oracle's distributed lock manager was as close as I've seen but even it had a larger context than zero bytes more than the register store/restore).

The interesting questions are:

- are your caches set up for multiple processes so you still have state when you switch back? - how many different processes can you have in the cache[s] at once? - how big is your OS path to sleep/wakeup, that uses up space in the instruction cache?

The lmbench paper talks about all this stuff, complete with crappy graphs written in troff/pic by yours truly that let you visualize when things will fit and when things will thrash.


Ok it'll never be 1 clock cycle. But it sure can be better than milliseconds!


Where are you getting milliseconds? Their results are right in line with what I'd expect, on the order of ~10-50 usecs.


Ok I only read halfway through the document.

Even knowing some measurements are 10-50usec, that's still 10000X slower than a cpu instruction. So still room for improvement.


You do realize the relative order of things, right? While a CPU instruction is around 1/4 of a ns, assuming a single cycle instruction which is not always the case, a memory load is ~60ns.

In order to really see if there is room for improvement one would need to take a look at each cache miss and see if that could be removed. On my machine, a simple context switch is 7.3us and a memory load is 57.4ns which means a total of 133 cache misses.

Could that be better? Sure, in theory. Could it be 10000x better? Not a chance. Not even if you made a zero cost context switch, you will still have cache misses getting there and coming back. Even if the context switch, the OS overhead, all of that were zero cost, if the two processes doing work have ~70 cache misses each then you are right back to where we are now.

It's 25 years later - what's changed? CPUs are 20x faster (.25ns clock instead of 5ns clock) but main memory is only ~3x faster. To the extent that things don't fit in the cache you would expect a 3x improvement in context switch time. Instead we have it almost the same speed (old was 6usec, current is 7usec). But the OS hasn't been held constant, it went from basically single threaded to scaling up to lots of CPUs. That scaling is not free, you are taking and releasing locks in your context switch code.

Personally, I doubt there is a lot of room for improvement. Some, sure. But I'd be astounded if you could figure out how to get a factor of two in Linux without losing something else.


Nice analysis.

As a last stand, I'd reiterate that different multi-thread algorithms could be practical if the minimum switch time were much, much smaller. Thread wakeup times on the order of hundreds or thousands of instructions, where the cache had not been substantially changed. E.g. Ping-Pong threads that passed buffers back and forth for staged processing. They would even benefit from the cacheing done by the other thread in the algorithm, that had processed the same data.

All impossible now, with the large thread-switch stone around our necks.


But this is sort of like saying things would be so much better if we didn't have this annoying speed of light constraint. Yeah, if memory was all L3 instead of main memory, things would be faster, but it isn't.

Don't get me wrong, I'm all for primitives being light weight, that was the whole motivation behind writing lmbench. You and I agree on the goal.

Where we part ways is what is possible. When context switches are being done in the time it takes to do ~140 cache misses I don't see how it is possible to get to "much, much smaller". Smaller, sure, much smaller? How? Which cache misses are you going to not do?

What I'm looking for is actual insight into the problem. If you've looked at the code and you know the hardware and you know each cache miss on a first name basis and you are saying "right here, these 70 misses are not needed, we can get 2x better if we lose these", that's something that I and everyone would want to know. We already agree that less is more, the point is how do you get to less? Is it even possible?

Lemme run some numbers for i686 25 years ago. Back then, we were at about 34 cache misses/context switch. Now we are at about 140. So it would seem like there is some room there. But back then we didn't have the lock traffic we have today. And who knows what else they have shoved into the kernel. Lets ignore the lock traffic and extra bloat and say we could get back to 34 misses. That would get us to 2usecs rather than 7usecs. Which is better but does it meet your definition of "much, much smaller"?

That's where I'm coming from, I'm sizing the problem. I have no idea if they can shave it down to 34 misses, I very much doubt it. But I think we can agree that given the fact that the kernel is fine grained threaded they aren't going to get to better numbers than what they had when it wasn't threaded, right? That's a lower bound in practice, agreed? So if the cache miss fairy showed up and gave us our best case scenario we're about 3x faster. Which is nice but not life changing. And I don't believe for a second that there is 3x worth of easy to remove cache misses/bloat in that code path, Linus would have to be asleep at the wheel to allow that.


Zero cache misses would be possible. If threads handed off data in a fine-grained fashion, with the cache essentially unchanged between switches.. Just like procedure call and return. In fact, maybe instead of procedure call and return. Impossible with the current model. Possible if thread switching was lightweight.

The idea is, threads are expensive (cache misses) now because they happen infrequently so the cache gets stale. But if they were as common as call-return then they'd behave at least as well. Except for this lead weight (today) of kernel switch/register reload/state reload.


Cue the comment from the 10+ year old vaporware Mill architecture's team... :)


> If the CPU made those an opcode instead of a wad of kernel code, the switch could be reduced to a clock cycle (ns?)

Uh, how do you think the kernel does it? Of course there's a CPU instruction for it - at least in modern ISAs like AArch64, but I can only assume it must have been tacked onto x86 too somehow or another.

One reason that it's implemented in the kernel is to interact well with the scheduler: it's all well and good for a user space process to spinlock (with some kind of enter-low-power-state instruction), but it's going to be spinning until the timer comes around and kicks it back to the kernel anyways unless the other side of the lock just happens to be scheduled on a different core simultaneously. Even worse, if a thread is holding a spinlock and it gets unscheduled, now everyone else who wants to acquire that lock will waste their timeslices spinning away for nothing. The only way to avoid this would be for threads to be able to tell the OS "don't unschedule me" (much like the kernel masks interrupts during critical sections), but that would allow any such process to not only DOS the rest of the system, it could allow any such process to hang the system entirely. That's not to say it can't be done properly, but the easy and perhaps even correct way to do it is OS-managed semaphores.

> not 2 or 4 hyperthreads but maybe 1000. So every thread in the system could be 'hot-staged'

And of course this is what hyperthreads are, but where exactly do you plan on putting 1MB of architectural state while you're not using it? You're lucky if you have an L2 that big, but let's think... moving architectural state to the L2, that sounds a lot like... yes, correct! OS threads!

If we want to rescue your suggestion back into the realm of feasibility, what you're saying is we should implement the OS (with features like scheduler-aware semaphores and thousands of ready threads) in hardware to make things like moving 1MB of architectural state to and from an L2 cheaper (and that we should have larger L2s so it's not incredibly wasteful to use 1MB on architectural state of non-executing threads). There's a reason we implement OSes in software, though, even with all the associated costs.


...and yet OSs do thread switches on the order of usec or even msec. So I repeat, there's much room for improvement.


Um, you need to think more deeply about this. Yes, you can switch a "thread" (says the guy who wrote a user level thread package) very, very quickly but that's meaningless. Unless your application is one where all it does is context switch and does no other work.

"Context" is more than just the registers. You could make an instruction that did the context switch in zero time and a real context switch will take much more than zero time. Because there is more context than just the registers.


Yes all that is very, very clear. But adding 10,000ns to the switch time is going to torpedo any chance of a quick turnaround on a queue wait or IO completion. That's a huge timewad compared to a procedural solution that doesn't use multiple threads or locked queues. Thus the initial comment that using threads adds unpredictable issues that are hard to diagnose. And the suggestion that removing this delay will definitely reduce latency of event handling at all levels.

Including, returning to a thread 10,000ns quicker can help with cache utility - there's that much less cache washout occurring 'while you were away'. Tightening this up can allow, for the first time in any architecture, truly transparent multi-thread implementations of ordinary tasks without penalty. Imagine each thread executes a handful of instructions and then signals the other thread to continue. No cache issues (its only been nanoseconds).

Sure, 'nobody does that'. But that's because, threads are so godawful expensive now. We'd be in a whole different headspace if that changed.


You seem to be living in an imaginary world where main memory is much, much closer to the CPU than it is in real life. 10,000ns is 174 cache misses on my machine. That's not a lot of room when you are popping something off a queue or finishing an IO.


The presumption of large cache misses is where our intuitions diverge. Algorithms where threads operated on the same data, for instance, would not actually have many cache misses or even zero cache misses.

btw Lets take it down a notch, and back off on the personal aspersions? ok?


Sorry if it sounds personal, it's not intended to be so. I get weird when I see people not seeing all of the picture. To me, the picture includes main memory, I've spent way too much time with a logic analyzer watching stuff go across the bus (actually did a lot of this while tuning some hashing code for performance).

Memory misses are a fact of life. If you look at my comments elsewhere, I noted that

    CPU: 20x faster
    Main memory: 3x faster
    Context switch: about the same
If memory wasn't the main part of the equation, shouldn't the context switches be faster? Yes, as I pointed out, the OS is now fine grained multi threaded but that doesn't explain a 20x difference.

As for the comment that some stuff is going to have zero cache misses, sure, in theory. Can you show any real world examples of that? I would think that anything that takes a lock on more than one cpu/cache is going to have lock traffic that is going to cause a cache miss like stall. Has cache coherency changed such that locks are "free"? That would be (interesting) news to me.


The question is, would multiple threads have any essential difference from another embodiment e.g. procedural. If not, then the (only) gain is being able to recast the problem as threads, for whatever that's worth. If thinking about an algorithm as synchronization-of-data is helpful, then the low-or-zero-latency-added thread model is helpful.


The author states:

>"In practice context switching is expensive because it screws up the CPU caches (L1, L2, L3 if you have one, and the TLB – don't forget the TLB!)."

How does a context switch "screw up" the L1 through L3 caches? Yes when there is context switch the TLB gets flushed and the kernel then needs to "walk" the page tables which is expensive and yes I can see the L1 needing to flush some lines but is that really screwing up the L1, L2 and L3 caches, the later two being fairly large these days?


A context switch changes the virtual-to-physical mapping. So, _if_ a cache uses virtual addresses for determining whether data is in the cache a context switch has to flush the cache.

It looks like (1) x86 is virtually indexed and physically tagged (i.e. you need a virtual address to find the cache line data might be in, and the physical address to check whether that line actually holds the data. See https://en.wikipedia.org/wiki/CPU_cache#Address_translation for details))

(1) http://www.realworldtech.com/sandy-bridge/7/ but that may have changed, they might use different strategies for the various cache levels, or they may by now use some other tricks. Address translation is very hairy in multi-core systems.


>" So, _if_ a cache uses virtual addresses for determining whether data ..."

I was curious about your "if" caveat. I believe nearly all modern CPUs uses the virtual address as it doesn't know about the physical addresses no?


The CPU does know the physical address, because the MMU is part of the CPU; so you can have a physically indexed cache if you want. The problem is just that you can't do the cache lookup in parallel with the MMU physical-to-virtual lookup, so it's slower than if you used a virtually-indexed cache.


Sure, I didn't articulate my comment very well. I meant to say that the L1-3 caches operate on virtual addresses because they don't know the physical address but the MMU does since all memory access must transit through the MMU.


The reason for the "_if_" is that I know too little about current hardware to know whether that's the norm, the exception, or extinct, so I tried to be careful.

Also, I'm using CPU for (CPU+MMU) because that's the case on all modern CPUs that have a MMU, as far as I know.


Its understood that different threads, especially in different processes, will quickly rewrite the caches to suit their own needs. Returning to a thread, especially after any additional delay, threads will find their own data essentially gone.

Its always going to create better cache utility if a thread can run to completion before a switch.


The biggest cache trashing on a thread switch are possibly the lines containing the thread stack which are very unlikely to be shared between threads. Also thread are likely to be working on distinct data.

BTW, TLB doesn't need to be flushed on a thread switch, only on a process switch, and even then more modern systems can tag TLB entries by process id, reducing the need for flushing.


I should have specified I was talking about a process context switch not a thread.

>"The biggest cache trashing on a thread switch are possibly the lines containing the thread stack which are very unlikely to be shared between threads. Also thread are likely to be working on distinct data."

Sure, I don't think they are every shared are they?

>"BTW, TLB doesn't need to be flushed on a thread switch, only on a process switch, and even then more modern systems can tag TLB entries by process id,"

Indeed and this often gets left out of the discussion, its not mentioned in the blog post certainly. I believe on x86-64 this is called ASID and on ARM chips its TPID.


The ARM term is ASID (address space identifier). There's also a VMID so different VMs under a hypervisor can share the TLB.


Interesting, I didn't think about VMs sharing the TLB that makes sense.


Context switch in this article = within a computer and processor, not about psychology


Humans are several orders of magnitude slower.


Well this is a computer technology new aggregator isn't it? Not a psychology one.


To be fair, we do frequently talk about 'context switching' in a metaphorical sense of flitting between programming and conversations/meetings, or among different projects.

If anything, I think I agree with GP that it's more unusual to see the term used in the 'real' (not metaphorical) sense on the front page.


The author also seems to be loose with specifying "context switches" where there is a difference between a context switch between two threads which shares everything except a stack and a context switch between two processes that have separate address spaces. Did anyone else find that odd?


Context: Linux.


Could you expand your response? I presume you are saying that since in Linux processes and threads are "the same" there is no need to distinguish whether the process being switched in shares the same address space. Does this mean that there is a missed optimization on Linux to preserve some virtually-addressed caches and tables when switching to thread/process that shares the same address space? Or that the terminology is correct, but that Linux already makes the possible optimization?


> I presume you are saying that since in Linux processes and threads are "the same" there is no need to distinguish whether the process being switched in shares the same address space.

That's right. Threads share everything but their stack.

> Does this mean that there is a missed optimization on Linux to preserve some virtually-addressed caches and tables when switching to thread/process that shares the same address space?

I'm not entirely sure I understand your question. What scenario do you have in mind?


Its just a smart ass comment with no real value.

Yes the Linux scheduler doesn't differentiate between threads and processes(see NPTL vs LWP) However I was pointing out that although the kernels just schedules "tasks" instead of threads and processes it is far more efficient to schedule threads than processes on some processors - example pre-ARMv6 didn't use address space tags on its TLB so a context switch to a different process required the whole whole TLB to be flushed which means the kernel will need to walk the page table to get the translations for the physical pages for the newly scheduled process.


> Applications that create too many threads that are constantly fighting for CPU time (such as Apache's HTTPd or many Java applications) can waste considerable amounts of CPU cycles just to switch back and forth between different threads.

Completely wrong. Context switches happen at a set interval in the kernel. Creating more threads won't make context switches happen more frequently, each thread will just get a proportionally smaller share of CPU time.

Conceptually, the CPU scheduler can be thought of as a LIFO queue. An interrupt fires at a fixed interval of time and switches to the first ready thread (or process) on the queue.

What's more important is the fact that this queue is exactly equivalent to how 'asynchronous' operations are implemented in the kernel.

The only difference between creating threads and using async primitives is the thread prioritization algorithm. With threads you're using whatever scheduling heuristics are baked into the kernel, while with async you can roll your own. (And pay the penalty of doing this in usermode.)


> Creating more threads won't make context switches happen more frequently

This is generally wrong in practice, but for a slightly subtle reason.

1. You can force a context (or thread) switch whenever you like by performing a blocking operation: calling read() for example, or by waiting on a mutex (as this person was doing). There's no more work to be done until the blocking operation completes, so the kernel finds another process (or thread) to run on the now-spare hardware thread.

2. Applications which create a lot of threads or processes tend to use them as IO workers.

3. The IO worker pattern is to block on read(), do some processing, maybe write(), and then go back to blocking. This means that most of the time the worker thread won't be using its complete timeslice.

4. If lots of IO worker threads are continually doing this, you will get a lot of thread switching (or context switching for processes).

5. Thread switching is fast in Linux, but it's slower than a simple syscall, which is what you'd be doing with an async solution.


Re: your point 1 -- waiting on a mutex happens regardless of whether you're using blocking or non-blocking IO.

If your program only ever makes IO system calls and nothing else, then your point might make sense, but this isn't a realistic real-world assumption.

Re: your point 3 -- not using the complete timeslice is not a realistic assumption. If you're creating threads then you're presumably doing some little bit of CPU-intensive work and not just copying bytes from one file descriptor to another.

You're making terribly silly assumptions about what kind of work servers actually do, and ignoring the real benefit of async solutions.

The real benefit is being able to control how threads switch by yourself instead of using the kernel's builtin 'black box' scheduling algorithms. The problem with the 'black box' is that the kernel might decide to penalize your threads for inscrutable reasons of 'fairness' and then you suddenly get inexplicable latency spikes.

Of course rolling your own scheduling is an engineering boondoggle and most people just opt for a very primitive round-robin solution. (Which, incidentally, is what you want anyways if you want good latency.)

In which case you might as well create a bunch of threads and schedule them as 'real-time' (SCHED_RR in Linux) and get the same result.

(Seriously, try it -- benchmark an async server vs a SCHED_RR sync server and see for yourself.)


You are incorrect.

If you have many threads each doing small amounts of work before blocking, the kernel will happily switch in another runnable thread when the current thread blocks. This is common in workloads that service network IO using a 1:1 thread:client model.

It can absolutely be more efficient to retire work for multiple clients from a single thread. Your last paragraph ignores this. It also forgets that in addition to prioritization there is multi-core load balancing to deal with. That can leave runnable threads stranded for whatever the LB interval is (which is independent of the overall scheduler quantum), while a concurrent service queue would allow steady service of all clients from all workers (that said, I actually prefer the siloed model in most cases).


No, you're wrong. Or, you're correct as far as you have a bunch of threads doing only CPU-bound work (and even in that case having fewer threads can be useful due to cache effects).

But if you have threads that block, then there will be more context switches. Consider something like apache httpd: Worker thread/process A writes to a socket, then the socket buffer fills up before the data has been sent over the network, the process blocks and goes to sleep (before it has used its time quantum and is preempted). CPU scheduler switches to apache worker thread B. B writes to its socket, socket buffer fills, B goes to sleep causing yet another context switch to apache worker C, etc. etc. See? Lots of context switches. Compare this to something like nginx: Worker process A writes to socket A1, socket A1 buffer fills up (errno EAGAIN, but no context switch), A switches to writing to socket A2, etc.

Or for that matter consider a multithreaded process that manipulates some shared state. Thread A takes a lock X, and then the scheduler interrupt fires and the CPU scheduler decides to put A to sleep and switch to B. Well, B works for a little while, then tries to acquire lock X which is locked, and thus goes to sleep. CPU scheduler wakes up thread C, which works for a little while, then tries to acquire lock X. Oops, X is locked, so C goes to sleep and the scheduler wakes up thread D. Etc. etc. Again, lots of context switches.


This isn't really correct. In many (most?) multithreaded applications a thread will regularly yield, forcing a context switch, long before its kernel time slice expires. In modern systems it is not difficult to have a case where the time spent doing work is less than the time spent yielding to another thread i.e. almost all of the CPU time is spent context switching.

This is in significant part a side effect of CPUs becoming much faster. You need to increase the amount of code that is run between potential thread yield points to offset the overhead of a yield or it can generate context switch storms (i.e. almost all time is spent context switching) and throughput will suffer.

Modern server architectures take this to its logical conclusion and try to eliminate virtually all context switching, hence the rising prevalence of "process-per-core" software designs.


Does the CPU interrupt threads running happily on cores even when there are no other threads which want to run or which have affinity that would allow them to run on that core?

But creating more threads will cause each to run slower as their caches are ruined by each context switch aren't they?


> Does the CPU interrupt threads running happily on cores even when there are no other threads which want to run or which have affinity that would allow them to run on that core?

Yes. Even if you are careful to ever run only one process (so: no monitoring, no logging, no VM's, no 'middleware', etc.) and limit the number of threads to strictly equal the number of processors, you still have background kernel threads that force your process to context switch.


Tough you can instruct the kernel not to run anything (not even interrupt handlers) on specific cores, except for manually pinned processes.


At a minimum you still need timer and TLB IPI shootdown interrupts hitting every core, just to have a working SMP system.


the cpu does need to handle IPI of course but not sure about timers.


Yes, it does interrupt them, but if the kernel decides to keep the same thread there's no context switch. (In other words, there's a penalty, but it is much smaller.)

Your conclusion is correct. By creating more CPU hungry threads than you have CPUs you will reduce the total throughput of the computer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: