
A tale of an impossible bug: big.LITTLE and caching - rodrigokumpera
http://www.mono-project.com/news/2016/09/12/arm64-icache/
======
pm215
Properly configured big.LITTLE clusters should be set up so that all CPUs
report the same cache line size (which might be smaller than the true cache
line size for some of the CPUs), to avoid exactly this kind of problem. The
libgcc code assumes the hardware is correctly put together.

There is a Linux kernel patchset currently going through review which provides
a workaround for this kind of erratum by trapping the CTR_EL0 accesses to the
kernel so they can be emulated with the safe correct value: [http://www.mail-
archive.com/linux-kernel@vger.kernel.org/msg...](http://www.mail-
archive.com/linux-kernel@vger.kernel.org/msg1227904.html) and it seems to me
that that's really the right way to deal with this.

~~~
pm215
Also, if I'm reading the proposed fix in the mono pull request correctly, it
doesn't deal with the problem entirely because there's a race condition where
the code might start execution on the core with the larger cache line size,
and then get context-switched to the core with the smaller cache line size
midway through executing its cache-maintenance loop. The chances of things
going wrong are much smaller, but they're still there...

(Edit: rereading the blog post, they say they need to figure out the global
minimum, but I can't see how their code actually does that, since there's
nothing that guarantees that the icache flush code gets run on every cpu
before it's needed in anger.)

~~~
hrydgard
It's fine if core migration happens during the invalidation loop - the core
migration itself surely must wipe the non-shared cache levels thoroughly,
otherwise nothing would work.

EDIT: Actually if the big and little cores are used together, and not
exclusively, then this might still be an issue, yeah.

~~~
pm215
No, in general Linux migrating processes between cores won't nuke the caches.
The hardware's cache coherency protocols between CPUs in the cluster ensures
that they are all in sync sufficiently that it's not needed.

~~~
hrydgard
I understood that the configurations currently in use usually only power up
either the big or little cores at the same time, and that kind of migration
has to wipe the caches, right? But that might be inaccurate, and you are of
course right in the general case.

~~~
pm215
The state of the art in Linux scheduler handling of big.LITTLE hardware has
moved through several different models, getting steadily better at getting
best performance from the hardware (wikipedia has a good brief rundown:
[https://en.wikipedia.org/wiki/ARM_big.LITTLE](https://en.wikipedia.org/wiki/ARM_big.LITTLE)).
You're thinking about the in-kernel-scheduler approach, but global task
scheduling (where you just tell the scheduler about all the cores and let it
move processes around to suit) has been the recommended approach for a few
years now I think.

------
userbinator
Seeing bugs like this reminds me of how much nicer things are on x86 where
JITs do not need to flush caches. You can actually modify the instruction
immediately ahead of the currently executing one, and the CPU will naturally
"do the right thing"[1] --- it does slow down execution, as the CPU is
essentially automatically detecting and flushing its cache/pipeline, but used
sparingly can be a great optimisation. The write can even come from another
core ("cross-modifying code") and everything will still work. Someone I know
used this to great effect in squeezing out the last bits of performance from
an application by eliminating checks on a few flag variables and the
associated branching in a tight loop --- it simply "poked" instruction bytes
from another core into the loop when it was time for that core to do something
else.

[1] With the exception of pre-Pentium CPUs, where modifying at various forward
offsets from the locus of execution could give insight into how big the
prefetch queue is. With the Pentium it was fully detected and the later
multithreaded/multicored react similarly to cross-modifying code as described
above, which leads me to believe that Intel is very much supportive of these
things as otherwise they could've just told programmers to do as ARM does.

Maybe what ARM needs, short of doing it the Intel way, is a "flush region"
instruction which takes both the address _and_ size, so it can automatically
flush the appropriate cache lines based on the current hardware's cacheline
size.

~~~
pm215
x86 is really the oddity here -- it has its no-explicit-cache-maintenance
design because of wanting to maintain backwards-compatibility with self-
modifying code that was written for x86 cores that had no caches at all.
Almost all other architectures have explicit cache maintenance because it's
more efficient (and requires less hardware), at the minor cost of requiring
the very few bits of software which do odd things like JITting to explicitly
tell the CPU what they're doing.

An instruction for flushing an entire region would potentially have a very
long execution time, which is awkward because you would want to be able to
interrupt and resume it. So it would need "how far have I got" state stored
somewhere. The obvious observation from a RISC-architecture point of view is
that you can get the equivalent effect without the pain of making a long-
running interruptible instruction, by having an "invalidate one cache line"
instruction plus an explicit loop in the code, and that's what most
architectures do.

~~~
userbinator
Actually, Intel explicitly broke backwards-compatibility starting with the
Pentium, by adding the hardware to make SMC work without additional effort.
The 486 and below needed an explicit branch to flush the prefetch queue, and
this effect has been exploited for various anti-debugging tricks and even this
amazing 8088-only optimisation:

[https://news.ycombinator.com/item?id=9340231](https://news.ycombinator.com/item?id=9340231)

 _An instruction for flushing an entire region would potentially have a very
long execution time, which is awkward because you would want to be able to
interrupt and resume it. So it would need "how far have I got" state stored
somewhere._

x86 has the REP prefix for this purpose; used with certain instructions, it
decrements a register and if it's nonzero, executes the instruction. The
earlier implementations simply didn't update the instruction pointer in this
case so the CPU would repeatedly fetch and execute the same instruction, and
it's interruptable between each step. The register counts down how many
iterations remain. Otherwise, the instruction pointer moves to the next
instruction. Modern x86 handles this by generating uops instead in the
decoder, but the basic functionality is the same.

------
anarazel
Different cacheline sizes for the different cores seems like an absurdly bad
idea. One because it opens one up to bugs like these, but also because it
makes optimization a lot harder. I have a hard time believing the savings due
to a larger line size are worth it.

~~~
gcp
ARM's own designs (A53, A57, A72, A73) all have 64-byte cache line sizes and
avoid the problem entirely.

The one at fault appears to be Samsung, who designed M1 Mongoose with 128 byte
lines and packed it together with A53 cores in their SoC.

~~~
cesarb
Perhaps a better (and simpler) workaround, then, could be to clamp the
reported cache line size to 64 bytes. So even if the Samsung core reports
128-byte cache lines, the code would simply invalidate each line twice, and if
it is migrated in the middle of the invalidation loop, it would still work
correctly.

~~~
gcp
This is exactly the fix used in the kernel, yes. The trickiness and complexity
comes from the cache line size probe having to be intercepted.

------
ChuckMcM
wow, just wow. That is a really awesome bug (and like the authors I have
issues with trying to sleep when that sort of puzzle is sitting there :-)

Still a bit hazy on _why_ they manually flush the cache for a given block of
memory (presumably for protecting disclosure?) but I'm also a bit curious how
it works if you get the sequence big fetches a cache line, switches to little
which fetches a line (half as long and changing half the bytes in the cache)
and then you switch back to big and its thinking it has a full cache line?
Presumably there is some mechanism that invalidates cache lines?

~~~
pm215
Handling of the case where big and little both want the same thing in their
cache should be dealt with by the usual cache-coherency traffic between the
CPUs that ensures they don't disagree about what's in their L1 caches (very
handwaved because I don't know the details).

The reason for the manual cache operations is because they're generating
JITted code -- on ARM to ensure that what you execute is the same thing you
just wrote you have to (1) clean the data cache, so your changes get out to
main memory[.] and then (2) invalidate the icache, so that execution will
fetch the fresh data from memory rather than using stale info. This clean-and-
invalidate operation is usually informally called a flush, though it isn't
really one in ARM terminology.

[.] not actually main memory, usually: only has to go out to the "point of
unification" where the iside and dside come together, which is probably the L2
cache.

~~~
brandmeyer
I don't see why they have to do this in userspace at all. If they did:

* allocate read/write buffer

* JIT instructions into it

* change mapping to read/execute

* run the JITted code

Then the kernel manages flushing the data caches on the mapping change, and
Mono gets to wrap a Somebody Else's Problem field around it. It sounds like
they are instead:

* allocate read/write/execute buffer

* JIT instructions into it

* manually flush relevant data caches (with an assumption that the cache line size is constant)

* run the JITted code

~~~
rodrigokumpera
That approach is harder to use in practice that in sounds. It's not like
people have not tried it.

The OS only let you alloc in large granules, like 4k or 16k, and the vast
majority of the methods are significantly smaller than that, meaning a JIT
must colocate multiple methods in the same allocation block or waste a
significant amount of memory.

We could get around that by remapping memory between read/write to
read/execute and have the OS solve the problem for us. Except for a couple of
small details, modifying a memory mapping is very expensive and we're, well,
in the performance business, and that mono is multi-threaded so one thread
might be executing code from the exact page we just made non-executable.

This approach, IIRC, was tried by Firefox as it has some security advantages,
but discarded due to the measurable performance impact - and they don't have
the second problem as JS is single threaded.

Full Disclosure: I'm part of the Mono team.

~~~
caf
How is this safe in the multithreaded case anyway? If a process has just
written a new JITted method and is flushing the i$ on the CPU it's executing
on, but then gets scheduled away part-way through the flush, if you were very
unlucky then couldn't another thread then get scheduled on that CPU and try to
execute the just-written method, which failed to be fully flushed from that
CPU's cache?

~~~
rodrigokumpera
Multi-threaded safety is simply due to JIT controlling the visibility of the
newly compiled code. First flush, then make it visible for execution, can't go
wrong with that and scheduling won't matter.

Things get a lot more complicated when it comes to code patching, but the
principle is similar.

~~~
caf
I don't think that helps - the point is that the flush might not be effective
if the flushing thread gets scheduled away from the core which has the stale
I$ before it manages to fully issue the flush.

Or is the flush guaranteed to flush _all_ cores caches? That would be a fairly
unusual design.

~~~
oshepherd
IC IVAU instructions are broadcast to all cores in the same 'inner shareable
domain' (all cores running the same OS instance are in the same inner
shareable domain)

~~~
caf
That does seem like a good solution to let you do this kind of invalidation in
userspace. Thanks.

------
hexa00
I had this problem too with GDB on the Odroid UX4 big.LITTLE SoC.

Since GDB is patching the instrution with ptrace to insert a breakpoint for
example.

See my blog post about it: [https://www.kayaksoft.com/blog/2016/05/11/random-
sigill-on-a...](https://www.kayaksoft.com/blog/2016/05/11/random-sigill-on-
arm-board-odroid-ux4-with-gdbgdbserver/)

Or the post on the GDB mailling list:
[https://www.sourceware.org/ml/gdb/2015-11/msg00030.html](https://www.sourceware.org/ml/gdb/2015-11/msg00030.html)

Too bad however that the kernel patchset mentionned in a previous post only
covers arm64..

So it's still a problem from arm32.

------
wolfgke
> Worse, not even the ARM ISA is ready for this. An astute reader might
> realize that computing the cache line on every invocation is not enough for
> user space code: It can happen that a process gets scheduled on a different
> CPU while executing the __clear_cache function with a certain cache line
> size, where it might not be valid anymore.

I rather see the problem in the fact that there seems to be no possibility to
say to the Linux scheduler: Only schedule this process/thread between cores
that have the same cache line size. Or add an attribute when some thread is
created on the cache line size of the cores it is allowed to run. Or an
attribute when some thread is created for "allow arbitrary cache line size but
don't let it run on cores with a different size". This way it would suffice to
check for the cache size on program or thread start.

~~~
tveita
That's too specific a feature to expose to developers - most people wouldn't
even know about the feature, and the ones that did might needlessly enable it
just to be conservative.

The intention of the big.LITTLE architecture is to let processes be migrated
seamlessly between the small and big core and let the unused core be turned
off to save power. The kernel and the hardware should work together to make
the core switching transparent and expose a safe way to invalidate the cache
independent of the current processor.

~~~
wolfgke
> The intention of the big.LITTLE architecture is to let processes be migrated
> seamlessly between the small and big core and let the unused core be turned
> off to save power.

On the other hand flushing specific cache lines is a rather special feature.
If the code uses such obscure low-level features (much more than the "typical"
application) such as the invariant that the size of a cache-line stays
constant over the execution (which most applications really don't care about)
one can at least expect from the developer to pass this information to the
scheduler so that the scheduler can take that this invariant will indeed be
satisfied.

~~~
comex
Anything that uses a JIT necessarily uses this "obscure low-level feature",
including any program implemented in one of many programming languages.
Forcing such programs onto the heavy processor makes no sense; there just
needs to be a way to properly clear the cache.

~~~
wolfgke
On
[https://news.ycombinator.com/item?id=12483698](https://news.ycombinator.com/item?id=12483698)
I wrote a better idea how this might be implemented without causing this
problem. Nevertheless even if we use my first, worse interface this should not
be a problem: Spawn a thread that does the lowlevel stuff, sync it and after
that run the JITted code.

------
sjmulder
Huge props to the team for finding this out. What a nasty issue.

One question about the intro, which states that this is the first mass
produced AMP architecture, but isn't the PlayStation 3's Cell CPU one?

~~~
masklinn
Cell behaves more like a CPU + GPGPU system, big.LITTLE can schedule an
instruction stream on any core of the cluster (depending on the configuration
setup), on Cell you'd primary use the general-purpose PPE from which you'd
start (and chain) vector-based threads on SPEs.

PPE and SPE don't even share an ISA, SPE have a custom-built SIMD-oriented
ISA.

------
Animats
_" first mass produced AMP architecture"_

Nope. Remember the Cell? The processor in the Playstation 3? One main CPU with
8 little CPUs and no shared memory, just channels.

The Playstation 4 isn't a AMP machine because programming the Cell was so
hard.

~~~
dfox
Cell is essentially distributed memory cluster on single chip, because each
SPU has it's own address space and cannot directly access main memory. I'm not
sure about what the exact definition of AMP is, but it does not exactly match
my feeling of what AMP should be. In this regard Wii seems more like AMP
systems with two completely different CPUs (PPC and ARM) sharing what
essentially amounts to be same address space (and in WiiU there are 3 PPC
cores where one of them is slightly different than other two and cache
coherency between them can only be described as broken).

There is no question of hardness of programming for Cell, but I think it's
mostly about middleware support (probably because the platform is so different
from PC and xbox360).

~~~
Animats
The real problem with the Cell was that each Cell SPE processor only has 256K
of local memory. It has bulk DMA access to main memory, but that's more like
I/O. 256K is too small for a video frame, a game level, or much else in a
modern game. So everything has to be done on an assembly line basis, where
data is pumped into a Cell processor, processed, and pumped out. Great for
audio, terrible for everything else. In comparison, the main processor had
access to 256MB of RAM.

If they'd had, say, 16MB per processor, it might have worked out. One CPU for
collision detection and physics, one for NPC management and AI, etc. But
giving each SPE processor only 0.1% of the total memory space was too
constraining.

------
dsp1234
It appears the that the caching code was added in this patch:

[https://gcc.gnu.org/ml/gcc-
patches/2012-09/msg00076.html](https://gcc.gnu.org/ml/gcc-
patches/2012-09/msg00076.html)

Prior to that, the call:

    
    
      asm volatile ("mrs\t%0, ctr_el0":"=r" (cache_info));
    

was always made.

~~~
willvarfar
But the task can be rescheduled on a little core part-way through the
execution...

The big cores should report the smaller cache line size always.

~~~
dsp1234
I was just pointing out where the caching of the size was added. I'm not
making a comment on where the fix should be.

------
funny_falcon
[https://gcc.gnu.org/ml/gcc-
patches/2012-09/msg00076.html](https://gcc.gnu.org/ml/gcc-
patches/2012-09/msg00076.html)

~~~
pawadu
Well, that is the patch that causes this problem...

At the time this patch was submitted, ARM was running a program that awarded
engineers who could improve performance of aarch64.

------
SixSigma
There's no such thing as a simple cache bug. - Rob Pike

Caches are bugs waiting to happen. Rob Pike ‏@rob_pike 21 Mar 2014

~~~
akavel
_" There are only two hard things in Computer Science: cache invalidation and
naming things"_ ― Phil Karlton; not sure to what extent this quote is
compatible with the second one by Rob. Or does it mean by implication that
simply "Computer Science is bugs waiting to happen"?...

~~~
mikeash
It's a great quote, but it's wrong. There are actually _two_ hard things in
CS: cache invalidation, naming, and off-by-one errors.

~~~
deathanatos
The parent actually appears to have the quote both correct in content and
attribution. Supposedly, someone else added the "off by one"[1][2][3]. That
seems to be the extent of the Internet's knowledge on the quote, though the
Skeptics link notes that there's nothing direct to the supposed originator.

This is one of those quotes where I feel there's more than one right answer. I
like the addition of "off-by-one", and to make it a nice round three things, I
usually use this version:

    
    
      There are three hard things in computer science:
    
      1. Naming things
      2. 3. Concurrency
      Cache Invalidation
      4. Off-by-one errors 
    

[1]:
[https://twitter.com/timbray/status/506146595650699264](https://twitter.com/timbray/status/506146595650699264)

[2]: [https://skeptics.stackexchange.com/questions/19836/has-
phil-...](https://skeptics.stackexchange.com/questions/19836/has-phil-karlton-
ever-said-there-are-only-two-hard-things-in-computer-science)

[3]:
[http://martinfowler.com/bliki/TwoHardThings.html](http://martinfowler.com/bliki/TwoHardThings.html)

~~~
mikeash
The other one is the original, I just like the "off by one" version better.

I like the addition of "concurrency" too, but I'm not quite sure how to make
it flow....

------
AceJohnny2
As an embedded developer also working on big.LITTLE arm64...

I would do nasty, painful things to the designer of such a system.

------
randyrand
When would a programmer explicitly need to invalidate the CPU cache? Does this
not happen automatically on a context switch?

~~~
JoeAltmaier
One example is when doing a dma operation to memory. That often bypasses the
cache. If the new data is to 'stick' the cache needs to be convinced it
doesn't know what's in that buffer any more. This is an ARM thing; Intel
architectures integrate DMA with the cache.

------
xxie24
From the pseudo code, what is disadvantage that making
get_current_cpu_cache_line_size() always get called?

~~~
_ihaque
That would create a race condition addressed at the bottom of the article: the
process can get switched onto another CPU between the invocation of
get_current_cpu_cache_line_size() and the invalidation.

    
    
      An astute reader might realize that computing the cache line on every
      invocation is not enough for user space code: It can happen that a process
      gets scheduled on a different CPU while executing the __clear_cache
      function with a certain cache line size, where it might not be valid
      anymore.

~~~
K0nserv
The follow up doesn't make sense to me

    
    
       Therefore, we have to try to figure out a global minimum of the cache line sizes across all CPUs.
    

Wouldn't this mean they'd always just end up clearing half the cache line for
larger core anyway?

~~~
tveita
No, you just sometimes issue twice as many flush requests as necessary. You
can't flush or invalidate half a cache line, since the data is stored in units
of cache lines.

In theory I think you could just invalidate the addresses byte for byte,
ignoring the cache line size, but I assume the performance hit would be
noticeable.

------
Mizza
This is an excellent bug journey, and I'm even more impressed that the
resulting discovery has already been used to improve Dolphin. A testament to
the quality of both projects.

------
flamedoge
I wonder why no one tried to validate Asymmetric MultiProcessing by first
validating all cases with either little or big first. And then bisect further
down when both are enabled.

~~~
PDoyle
Because that's the sort of thing you think of only when you already know the
answer.

------
GrumpyNl
I don't understand that level of coding, what i do understand is the great way
of debugging. Its all about deduction mr Watson.

~~~
dsp1234
Watson was actually Dr. Watson. Which is/was also the name of a debugger in
Windows[0].

[0] -
[https://en.wikipedia.org/wiki/Dr._Watson_(debugger)](https://en.wikipedia.org/wiki/Dr._Watson_\(debugger\))

~~~
GrumpyNl
You are right, although i was referring to a Sherlock Holmes quote, but that
is also Dr.

~~~
gens
Funny thing is that Sherlock Holmes usually used abductive reasoning, instead
of strictly deductive reasoning.

------
betolink
Can someone explain why cache flush is used for on ARM or in general low level
programming?

~~~
wolf550e
Overwriting executable machine code in memory is a rare case (only JITs need
it, only for emitting code). Some CPUs do wire the memory write operation to
the instruction cache to make sure the instruction cache knows some code
changed, but ARM chose a simpler design where the instruction cache assumes
the code never changes in memory and if this assumption is wrong, the
programmer has to use a special instruction to tell the instruction cache to
throw something out of cache. The cache clearing instruction throws out either
64 bytes aligned to multiple of 64 or 128 bytes aligned to multiple of 128,
depending on the core.

