Memory-mapped I/O without mysterious macros

monocasa · on Feb 28, 2019

On a tangentially related note, I love the PowerPC instruction to act as a mmio barrier: Enforce In Order Execution of I/O, or eieio.

For those anglophones that remember the kids' song "Old McDonald Had a Farm", your code can pretty easily follow the song.

  read_a_register_right_here();
  eieio();
  then_write_a_register_right_here();
  eieio();

  bookkeep_here();
  bookkeep_there();
  a_lot_more_bookkeep_everywhere();

  then_write_a_register_right_here();
  eieio();

teddyh · on Feb 28, 2019

EIEIO is also an error code typedef in GNU libc:

https://www.gnu.org/software/libc/manual/html_node/Error-Cod...

“Computer bought the farm.”

userbinator · on Feb 28, 2019

While I/O memory looks like memory, and it can be tempting to access it by simply dereferencing a pointer, that does not always work on every architecture

Unfortunate that there's no examples, because that sounds more like something which is not actually MMIO (like the I/O space on x86) or a bug, since one of the reasons for using MMIO is that it behaves like memory.

ajross · on Feb 28, 2019

Even memory doesn't behave like memory. The order that DRAM sees your writes doesn't necessarily match the order in which you thought they were written, even though the CPUs themselves take extraordinarily complicated pains to make it look the same everywhere. You might thing you're writing into a uniform space with byte granularity, but under the hood the bus is logically 512 bits wide and accessed almost exclusively in whole cache lines.

There are long specifications in every architecture that detail exactly how the memory model works. Nothing is simple.

And in the case of MMIO devices, they tend to care about things software doesn't (like, you must write 32 bit words aligned on 128 bit boundaries or it doesn't work -- stuff like that), so the rules get even hairier.

brandmeyer · on Feb 28, 2019

MMIO means something much narrower than "it behaves like memory". It really only means "its addressable like memory." I can give you a bit of background in one case. This is just an "existence proof" if you will, I don't know how commonplace the situation is. Most of my experience is on microcontrollers that don't have caches, or in userspace.

I'm currently working on a Cortex-A9 platform: Zynq-7000. Somewhat out of date, but not outrageously so. One of this particular SOC's gotchas are that many of the memory masters on the system aren't coherent, either with each other or with the CPUs. Ethernet DMA? Not coherent. General purpose DMA? Not coherent. All but one of the ports from FPGA fabric into the hard processor? Not coherent.

In any case where you want to tell the other peripheral that it should be reading from what the CPU wrote, the CPU must therefore take measures to get it out of the CPU's caches first. Concretely: I want to give the PL330 DMA controller some instructions in the form of a little DMA program. Part of the setup is that I effectively give the DMA the start address of the program, and then the DMA reads the program afterwards. Therefore, I need a happens-before relationship from <constructing the dma program instructions> to <pass instruction pointer to dma engine>.

First idea: I'll use C11 atomics! Nope, no dice: None of the AXI slave ports into peripheral address spaces support an exclusive monitor. Many of the ARMv7 C11 atomics are based on LDREX/STREX (similar to load-linked/store-conditional), and without some bit of hardware to observe the exclusive access, it isn't exclusive. Actually its a bit worse than that: LDREX doesn't have any way to communicate the failure to acquire an exclusive access, so strex never passes. The C11 atomics are based on an assumption of forward progress. So now, `atomic_fetch_or()` is actually `while (true){}`.

In kernel space.

OK, that's unfortunate, but very well, I'll use a barrier instruction, instead! I didn't really need exclusive access anyway, I was just doing that to get a portable release barrier. Well, that's not enough on its own, either. You see, barrier instructions enforce a kind of happens-before relationship, but it isn't the one I need. Yes that peripheral register was in Device memory, so it wasn't cached out from under you. But you placed your little DMA program in Normal memory (aka cacheable SDRAM). The master port on the PL330 doesn't pass through the processor's cache hierarchy at all. So that bit of hardware has no idea what you may have intended to put in the DMA program, it only gets to see whatever parts of it made it out to DRAM by the whimsical choices of the cache replacement policy, phase of the moon, and so on.

OK, fine then, be that way. I'll allocate a buffer for my DMA program from a chunk of memory that is specifically marked in the page tables as Device memory, even though its actually in DRAM. Now, the hardware works, but each individual write of my DMA program takes 10^2 machine cycles to complete. And since the PL330 uses a variable-length instruction set, and I am hold myself to the strict aliasing and alignment rules I must write each instruction one byte at a time. Therefore, I can very easily take more time just sending out each instruction to DRAM that I did forming the entire thing in cache memory in the first place.

Note, that all of this was just consumed for the narrow purpose of describing the DMA transactions to perform. We haven't even gotten to the DMA transactions themselves. As near as I can tell, if a program on the CPU is either a source or sink of the data, I need to use cache maintenance instructions for the job. Fortunately, it isn't super hard to either invalidate or clean-and-invalidate the individual cache lines prior to consuming or right after sending the data. But dammit, I'd just as rather not have to do either one of those things.

So yeah, the difficulties aren't insurmountable. There are moderately standard concoctions of barriers, cache operations, and custom memory allocators that can do the job. But it takes that much more time to deal with. And I couldn't help but notice that the successor generation (Zynq Ultrascale) has far more options to use coherent masters.

monocasa · on Feb 28, 2019

There's generally extra semantics needed to access even bog standard MMIO because of all the caching and write buffers between you and the device. Sometimes, like x86 and ARM, those are encoded with the address in the page tables and MTRRs. Sometimes, like on PowerPC, parts of that is encoded into barriers in the instruction stream.

twtw · on Feb 28, 2019

I need to take a look at the patches (on mobile now), but can anyone comment on how a barrier at unlock can provide ordering between mmio accesses under the same lock? Or is it required that every register have a unique lock?

AnssiH · on Feb 28, 2019

It doesn't, this is about this case:

  CPU1:
  lock()
  write1
  unlock()
  
  CPU2, after CPU1:
  lock()
  write2
  unlock()

Until now it was not guaranteed on all platforms that the writes arrive in the same order even if the CPU2 entered the critical section after CPU1.

This was because spinlocks only implied regular memory barriers, not mmiowb barriers.

wtallis · on Feb 28, 2019

I don't think this is meant to remove the need for explicit barriers if a single function is performing a multi-step process of interacting with a MMIO device, such as a complicated initialization that requires writing some registers, reading the result from some others, and then writing to some more. Instead, this seems to be about synchronizing cases where multiple functions running on different threads may want to simultaneously interact with the same device.

I don't really have any idea of the relative frequency of those two cases.