
I.MX7 M4 Atomic Cache Bug - luu
https://rschaefertech.wordpress.com/2018/02/17/imx7-hardware-bug/
======
fest
I've never personally encountered a bug like this but I have hit my fair share
of weird/hard to track down bugs over my embedded software career.

Almost always, they leave me a) longing for the blissful ignorance of low
level details our whole computing infrastructure is built upon and b)
wondering, how on Earth our technology is working as well it does, considering
there are layers upon layers of abstractions which could have a lot of issues
which are either worked around or just not hit in particular application.

~~~
jacquesm
> how on Earth our technology is working as well it does, considering there
> are layers upon layers of abstractions which could have a lot of issues
> which are either worked around or just not hit in particular application.

Our technology works as well as it does _because_ of these layers upon layers
of abstraction. That's the only way you are going to be able to construct
something with a few billion components and a fighting chance at avoiding
unwanted interference between parts. The amazing thing is how often we get it
just right, not that there are super rare edge cases that were not taken into
account during the abstraction process that lead to bugs.

Every leaky abstraction is a bug in the waiting, all it takes is for someone
to focus on the discrepancy with enough time, effort and resources thrown at
it it might lead to a crash or an exploit.

Also note that it is not as if we don't know that caching is a hard problem to
get right, it is one of the three explicitly mentioned in the 'there are two
things hard about computing' joke.

~~~
flamedoge
I feel like the more I learn, the more convinced I become that computing _is_
because we build stupidly impenetrable abstractions that keep us from shooting
our feet. Yet I can't shake the feeling that we are leaving so much room for
optimization on the table.

~~~
jacquesm
That's true, but optimization is always an exercise in economy. If the money
is there someone will do the optimization, for instance, in the Bitcoin mining
arms race you could see the writing on the wall for CPUs long before the jump
to GPU's, FPGA's and eventually ASIC's.

In mobile phones I always expected battery life to cause a resurgence of
things like assembly programming but it never happened, people are happy to
recharge their phones. I wonder what would happen if someone introduced a
smartphone OS based on old school principles jacking up the battery life to 5
days or so.

~~~
jl6
It could still happen. Mobile phones have been riding the CPU speed
improvement gravy train for a decade or so, but there are signs that this is
coming to an end like it did for desktop CPUs.

There will be increased demand for faster software when the hardware stops
getting faster.

Optimization is vertical integration. Guess which mobile phone manufacturer is
best placed to pull that off!

------
codys
Reading the NXP thread, it is not yet clear that NXP considers this errata,
only that it is something that is desirable to avoid.

Does anyone have a link to the changes to FreeRTOS & use of libclang mentioned
in the article?

NXP thread:
[https://community.nxp.com/thread/459977](https://community.nxp.com/thread/459977)

~~~
ChuckMcM
In that thread -- _After reproducing the issue and performing some tests, it
was found that the issue is because “LDREX” and “STREX” instructions
overlooked LMEM cache. That means those instructions always access external
memory directly, which leads to data inconsistency.

There’s no SW configuration to make the cacheable data consistent with those
atomic instructions, and design team will fix it in later CM4 integration._

Its a bug. But they see a workaround so they aren't in a hurry to fix it
apparently.

~~~
Gibbon1
That reminds me of an article about a similar problem with the xbox. Cache
consistency is extremely brittle combine that with speculative execution and
that means just having instructions that break cache consistency in memory is
dangerous.

~~~
Dylan16807
Here you go:

[https://randomascii.wordpress.com/2018/01/07/finding-a-
cpu-d...](https://randomascii.wordpress.com/2018/01/07/finding-a-cpu-design-
bug-in-the-xbox-360/)

[https://news.ycombinator.com/item?id=16094925](https://news.ycombinator.com/item?id=16094925)

------
pslam
I’ve found my fair share of SoC bugs over the last couple of decades, and
cache coherency is by far the most common problem. It’s complex to implement,
and implementers are always messing with it to gain a cycle here and there.
They get it wrong, frequently.

I would go so far as to bet every mainstream SoC has at least one cache
coherency bug either already documented (errata) or undiscovered.

------
mbilker
This bug reminds me of the cache in-coherency bug with the xdcbt instruction
of the Xbox 360 PowerPC CPU.

[https://randomascii.wordpress.com/2018/01/07/finding-a-
cpu-d...](https://randomascii.wordpress.com/2018/01/07/finding-a-cpu-design-
bug-in-the-xbox-360/)

~~~
comex
There's an even more similar bug in the PowerPC CPU used by the Wii U: all
atomic operations have to perform a cache flush (dcbst) between the load-
linked (lwarx) and store-conditional (stwcx) instructions, or else they won't
work properly. (But I believe the issue is that the operations aren't being
propagated from per-CPU caches to main memory, so it's sort of the opposite of
the I.MX7 bug where operations are _skipping_ the cache.)

------
digikata
Nice writeup with coverage of both a discovery and fix. It makes me wonder
where the best distribution point to this ends up. The most convenient, but
hidden, is that it ends up in some embedded dev kit somewhere. But I suppose
it might go into the FreeRTOS software, but then it seems like it's not
globally applicable to the ARM platform, just the iMX.7 (and likely not even
all variants of iMX.7).

~~~
wallacoloo
I believe that the only generic fixes are to either (a) ensure DRAM is never
cached, (b) ensure no atomic operations address DRAM (as the author proposed
by using TCM), (c) implement atomics by disabling interrupts, performing a
_normal_ RMW operation (i.e. no LL/SC, as even with IRQs disabled that would
cause cache incoherency), and re-enabling interrupts.

No library like FreeRTOS can guarantee (a). Not even the compiler can
guarantee (a), since the user can control the cache by memory-mapped
registers. (b) can't be guaranteed by a library, nor can it be guaranteed
statically by a compiler, since only the _linker_ knows where the atomic
variables will reside in memory (and, atomic operations could be performed on
an address that isn't a compile-time constant, e.g. dynamically allocated
memory).

(c) also can't be guaranteed by any library, but it _could_ be guaranteed by a
compiler that has access to the _full_ source of the binary. That's a hefty
limitation though, since it means you can't mix any other compiler/language
(e.g. assembly, which is almost always used for the startup sequence) into the
binary and still have these guarantees.

For method (c), I believe gcc allows one to somehow override __atomic_load &
other "builtins" \- the use-case being that atomics can be implemented for new
or uncommon architectures without modifying the compiler itself. If this _is_
the case, then a potential fix could be shipped by a library (e.g. FreeRTOS)
which defines __atomic_load as something like

    
    
      void __atomic_load (type *ptr, type *ret, int memorder) {  
      #ifdef IMX7_<partnumber>  
      __disable_irq();  
      *ret = *ptr;  
      __enable_irq();  
      #else  
      // insert code to perform a normal atomic load.  
      #endif
      }
    

In fact, gcc allows one to "wrap" functions - it might be possible to do
something like

    
    
      void __wrap___atomic_load (type *ptr, type *ret, int memorder) {
      #ifdef IMX7_<partnumber>
      __disable_irq();
      *ret = *ptr;
      __enable_irq();
      #else
      __real___atomic_load(ptr, ret, memorder);
      #endif
      }
    

which would have the benefit that the library doesn't need to know how to
implement atomic ops on other platforms. This approach could also be used to
implement (b) by performing a _runtime_ assert that `ptr` lives in a cache-
coherent section of memory.

But again, this approach _only_ works if you're not relying on any binary
blobs that perform atomic ops. In the end, if you're doing anything nontrivial
(e.g. atomic ops on heap-allocated memory, or even stack-allocated memory),
it's impossible for a dev kit to completely hide this bug from the developer.

Alternatively, do these M4 processors have some type of updatable microcode
like X86 processors do? NXP might be able to push a fix that somehow patches
LL/SC primitives, or traps when they're encountered at runtime and allows the
user to decide how to handle them (e.g. putting them in a no-IRQ critical
section like above, but since it's done at runtime now you can mix multiple
languages / binary blobs, etc).

~~~
rschaefer2
The M4, to my knowledge, has no microcode like x86 processors.

For solution (b), while it can't be guaranteed by the compiler, it can be
guaranteed with external tooling that manually ensures that all atomic
variables have a gcc section attribute specifying the linker section in the
TCM when all sources are available. This also will prevent heap and stack
allocated atomics, as I believe the linker will error when specifying a
section attribute that the linker cannot respect.

Solution (c), that is actually the solution developed for use with some M0
implementations that don't support LL/SC. It works with gcc as the gcc
functions implementations have the __weak attribute, meaning that your
implementation takes priority. An example override of fetch_add:

    
    
        uint32_t __atomic_fetch_add_4(uint32_t* addr, uint32_t value, int memmodel)
      {
           (void)memmodel;
           uint32_t mask = __get_PRIMASK();
           __disable_irq();
           uint32_t temp = *addr;
          *addr = temp + value;
           if (mask) {
               __enable_irq();
           }
           return temp;
       }

------
unwind
Learning the LDREX/STREX instructions a couple of years back was a great "aha
moment". I was/am fairly new to the ARM platform, and never really dug into
x86 so I'm not very familiar with the corresponding instructions there.

But it's a really elegant model, and it was really fun to use them directly to
implement some primitives we needed.

Later, of course, I realized that since we build with GCC, we can use their
atomic/sync functions instead that compile to LDREX/STREX but are more high-
level in the C code.

Great find, this must have been very frustrating.

~~~
cesarb
> and never really dug into x86 so I'm not very familiar with the
> corresponding instructions there.

The x86 has no corresponding instructions (other than perhaps the very recent
and not yet popular transactional extensions). Instead, the x86 world uses an
"atomic compare and exchange" instruction. Wikipedia articles:
[https://en.wikipedia.org/wiki/Compare-and-
swap](https://en.wikipedia.org/wiki/Compare-and-swap) versus
[https://en.wikipedia.org/wiki/Load-link/store-
conditional](https://en.wikipedia.org/wiki/Load-link/store-conditional)

------
epx
My case of hardware bug was a FPU bug in a PC/104 platform that either
returned an absurd value for a floating point division, or crashed the program
with SIGFPU. It was the only FP operation in the program and I was lucky
enough to log the result. Replaced by scaled integer division to avoid the bug
because replacing thousands of boards was not an option.

