The point the paper is trying to make is that exposing microarchitecture to the ISA is a decision that ended with problems. It then claims that some of the bulwarks of modern ISAs still act too much like exposing-microarchitectural details, namely the presentation of a finite register set. It sort of doesn't bode well when your references to "this is totally a fruitful idea to go down" are all limited to papers in the 1990s.
Seriously, hardware accelerated context switching is much more of a viable option rather than saying "screw registers, we'll make memory do absolutely everything. To make matters worse, in most languages you can't tell where the pointers are going before you AG them, which results in pointer chasing pain.
Stateless ISAs are interesting, but I seriously doubt the way forward is to make memory do everything for us.
Also, the biggest cost of context switching is not in the register file migration, but in the flushing of the TLBs, look to SASOS+VIVT to address that, not whatever this is proposing.
You're forgetting one very important thing, instructions are memory too. Right now it seems you're doubly stressing the hardware by forcing it to move data (instructions) in order to move data (actual data).
>When considering the period of rapid evolution that microarchitecture is about to face with the end of lithography scaling, abstractions that are free from underlying microarchitectural influence are critical to minimizing future disruption.
Ironically completely accurate! If chipmakers can successfully sell us on super-high-level ISAs (moving the scheduler and hypervisor inside their proprietary chips, as this article seems to suggest), they will be able to lock us in and easily prevent any "disruption" of their business model.
The greatest advantage for innovation in ISAs in the past 20 years has been the development of Linux and GCC, which allow any new chip to get a huge amount of working software with relatively little porting effort. Moving more logic out of this open source software portability layer and into the proprietary chip will just make it harder to build new chips from scratch.
It sounds a bit like SPARC, but more flexible? And not particularly difficult to port to.
The compiler would still need to stack slot coloring. And, as I noted elsewhere, they also seem to suggest dropping memory coherency, which makes the stack here just a register file with single instructions that load and store the registers to memory.
> "We should instead consider how to be more helpful to the consumers of our interface. Looking upwards in the system stack, the first immediate consumers of the architecture abstraction are hypervisors, virtual machine monitors, microkernels, and
operating systems. How might we provide acceleration of
their operations? What might an architecture with a taste
of hypervisor consist of? How might we provide easy-touse
heterogeneous computation without forcing these software
layers to continually adapt to our innovations"
ARM actually has had some task-swich architecture assistance for decades - register bank switching. Like a very cut down hyperthreading. I think these days it's only used for FIQ.
* The Myth of Sufficiently Smart Compiler (SSC)
* The Myth of Sufficiently Smart Virtual Machine (SSVM)
* The Myth of Sufficiently Smart Instruction Set Architecture (SSISA)
All these ideas have the same coal. How to preserve and use high-level information when transforming code to the lower level in a way that gives maximum performance.
In a hypothetical dreamland where where all these exist, compilers, hypervisors, virtual machine monitors, microkernels,operating systems, ISA's and microcode would generate "sufficiently smart" stack that provides performance increase.
So the brunt of the paper is the idea that replace registers with memory. Except:
* The overarching goal is making context switch overheads go away. Except isn't the expensive part of the context switch flushing the TLB and all the caches?
* A register file on an x86/POWER-kind of processor has something like a dozen ports and 1 clock cycle latency. The L1 cache has 4 clock cycle latency (on a hit) and about two ports. Ports are expensive, even more so on large memory banks (such as a cache).
* At three memory operations (two reads and one write) per instruction, and hopefully 1-2 instructions per cycle, you're really hammering on the cache coherency traffic. The paper does suggest making the cache incoherent.
* Given the expense of doing virtual memory lookup, and the probably need for cache incoherence, it may make sense to make the main stack effectively physically tagged within a function and make the compiler emit offsets within the cache.
At this point, you have a block of memory that's indexed independently of main memory, and doesn't communicate with main memory most of the time, and is meant to be accessed as frequently as a register file. Well, it basically is a register file at that point. The only thing you're adding is the ability to dump and read the register file to main memory at specified points.