
How long does it take to make a context switch? (2010) - majke
http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
======
luckydude
Pretty sure that he's just rediscovering what I did in lmbench back in the
early 1990's. All written up and open source (and got best paper in the 1995
Usenix):

[http://mcvoy.com/lm/bitmover/lmbench/lmbench-
usenix.pdf](http://mcvoy.com/lm/bitmover/lmbench/lmbench-usenix.pdf)

Edit: which now that I think about it, was over 20 years ago. Reminds me of
something a friend (Ron Minnich) said long ago: people rediscover stuff in
computer science over and over, nobody wants to read old papers. Or something
like that. He said it better, I think it was something like "want to look
smart? Just take any 5 year old paper and redo whatever it did, our attention
span seems to be very short."

~~~
chillydawg
One reason people are reluctant to trust old sources (regardless of how well
received they were at the time) is that software, operating systems and
hardware all move at lightning pace. I've not read your paper, but I'm willing
to bet enough has changed to warrant a new writeup talking about modern
systems.

~~~
LoSboccacc
> operating systems and hardware all move at lightning pace

not really, last big leap was the ability to unlock gpu for stream processing,
and that's it.

it's quite faster today as ten or one year ago, but at it's core it's still a
von neumann machine and not much has been discovered ever since, mostly we're
repackaging under a new name things that have been discovered in the 70s',
like the thick vs thin client debate, which played more or less the same over
and over again as the bottleneck switched from bandwidth to latency to client
performances, the last iteration of which was ember and angular
precompilation.

~~~
noir_lord
Some of the stuff goes back even further than that.

The thread on the Intel Optane stuff the other day had an interesting
conversation about how you would program for such systems and someone said we
already did, mainframes 50 years ago.

It's nice to see old ideas in a fresh context.

------
JoeAltmaier
On the order of a millisecond. Which seems to me, way too slow in this day and
age. The context switch has been a hard wad that chokes algorithms in
unexpected ways, that are hard to diagnose.

This can be fixed with a different approach to CPU design? Most threads block
on a semaphore-thingy or a timer or wait for io completion. If the CPU made
those an opcode instead of a wad of kernel code, the switch could be reduced
to a clock cycle (ns?)

How could this work? Maybe a global semaphore register where you Wait on a
mask. Each bit could be mapped to one of those wait-condition. Another opcode
would Set the bit.

To scale, the CPU would have to have not 2 or 4 hyperthreads but maybe 1000.
So every thread in the system could be 'hot-staged' in a Wait, ready to
execute, or executing.

There would still be cache and tlb issues, but they would be at the same level
as any library call. They wouldn't include yards of kernel code messing with
stacks and register loads and so on.

~~~
luckydude
Hey, same Joe from Sun? If so, howdy.

On my i7-5930K I'm seeing 7-10 usecs, quite a bit better than a millisecond.
Linus cares about this stuff, I'd be very surprised if he didn't measure it
regularly.

~~~
JoeAltmaier
Probably thinking of my brother, Rich?

usecs are a thousand times better than a ms of course. And a ns is a thousand
times better than that!

------
bogomipz
The author states:

>"In practice context switching is expensive because it screws up the CPU
caches (L1, L2, L3 if you have one, and the TLB – don't forget the TLB!)."

How does a context switch "screw up" the L1 through L3 caches? Yes when there
is context switch the TLB gets flushed and the kernel then needs to "walk" the
page tables which is expensive and yes I can see the L1 needing to flush some
lines but is that really screwing up the L1, L2 and L3 caches, the later two
being fairly large these days?

~~~
Someone
A context switch changes the virtual-to-physical mapping. So, _if_ a cache
uses virtual addresses for determining whether data is in the cache a context
switch has to flush the cache.

It looks like (1) x86 is virtually indexed and physically tagged (i.e. you
need a virtual address to find the cache line data might be in, and the
physical address to check whether that line actually holds the data. See
[https://en.wikipedia.org/wiki/CPU_cache#Address_translation](https://en.wikipedia.org/wiki/CPU_cache#Address_translation)
for details))

(1) [http://www.realworldtech.com/sandy-
bridge/7/](http://www.realworldtech.com/sandy-bridge/7/) but that may have
changed, they might use different strategies for the various cache levels, or
they may by now use some other tricks. Address translation is very hairy in
multi-core systems.

~~~
bogomipz
>" So, _if_ a cache uses virtual addresses for determining whether data ..."

I was curious about your "if" caveat. I believe nearly all modern CPUs uses
the virtual address as it doesn't know about the physical addresses no?

~~~
pm215
The CPU does know the physical address, because the MMU is part of the CPU; so
you can have a physically indexed cache if you want. The problem is just that
you can't do the cache lookup in parallel with the MMU physical-to-virtual
lookup, so it's slower than if you used a virtually-indexed cache.

~~~
bogomipz
Sure, I didn't articulate my comment very well. I meant to say that the L1-3
caches operate on virtual addresses because they don't know the physical
address but the MMU does since all memory access must transit through the MMU.

------
mashmac2
Context switch in this article = within a computer and processor, not about
psychology

~~~
chrisseaton
Well this is a computer technology new aggregator isn't it? Not a psychology
one.

~~~
OJFord
To be fair, we do frequently talk about 'context switching' in a metaphorical
sense of flitting between programming and conversations/meetings, or among
different projects.

If anything, I think I agree with GP that it's more unusual to see the term
used in the 'real' (not metaphorical) sense on the front page.

------
bogomipz
The author also seems to be loose with specifying "context switches" where
there is a difference between a context switch between two threads which
shares everything except a stack and a context switch between two processes
that have separate address spaces. Did anyone else find that odd?

~~~
mfukar
Context: Linux.

~~~
nkurz
Could you expand your response? I presume you are saying that since in Linux
processes and threads are "the same" there is no need to distinguish whether
the process being switched in shares the same address space. Does this mean
that there is a missed optimization on Linux to preserve some virtually-
addressed caches and tables when switching to thread/process that shares the
same address space? Or that the terminology is correct, but that Linux already
makes the possible optimization?

~~~
mfukar
> I presume you are saying that since in Linux processes and threads are "the
> same" there is no need to distinguish whether the process being switched in
> shares the same address space.

That's right. Threads share everything but their stack.

> Does this mean that there is a missed optimization on Linux to preserve some
> virtually-addressed caches and tables when switching to thread/process that
> shares the same address space?

I'm not entirely sure I understand your question. What scenario do you have in
mind?

------
otabdeveloper1
> Applications that create too many threads that are constantly fighting for
> CPU time (such as Apache's HTTPd or many Java applications) can waste
> considerable amounts of CPU cycles just to switch back and forth between
> different threads.

Completely wrong. Context switches happen at a set interval in the kernel.
Creating more threads won't make context switches happen more frequently, each
thread will just get a proportionally smaller share of CPU time.

Conceptually, the CPU scheduler can be thought of as a LIFO queue. An
interrupt fires at a fixed interval of time and switches to the first ready
thread (or process) on the queue.

What's more important is the fact that this queue is exactly equivalent to how
'asynchronous' operations are implemented in the kernel.

The only difference between creating threads and using async primitives is the
thread prioritization algorithm. With threads you're using whatever scheduling
heuristics are baked into the kernel, while with async you can roll your own.
(And pay the penalty of doing this in usermode.)

~~~
chrisseaton
Does the CPU interrupt threads running happily on cores even when there are no
other threads which want to run or which have affinity that would allow them
to run on that core?

But creating more threads will cause each to run slower as their caches are
ruined by each context switch aren't they?

~~~
otabdeveloper
> Does the CPU interrupt threads running happily on cores even when there are
> no other threads which want to run or which have affinity that would allow
> them to run on that core?

Yes. Even if you are careful to ever run only one process (so: no monitoring,
no logging, no VM's, no 'middleware', etc.) and limit the number of threads to
strictly equal the number of processors, you still have background kernel
threads that force your process to context switch.

~~~
gpderetta
Tough you can instruct the kernel not to run anything (not even interrupt
handlers) on specific cores, except for manually pinned processes.

~~~
monocasa
At a minimum you still need timer and TLB IPI shootdown interrupts hitting
every core, just to have a working SMP system.

~~~
gpderetta
the cpu does need to handle IPI of course but not sure about timers.

