Hacker News new | past | comments | ask | show | jobs | submit login
This architecture tastes like microarchitecture [pdf] (wp3workshop.website)
71 points by fanf2 on Mar 10, 2018 | hide | past | web | favorite | 15 comments

If you look at the context of the paper, it makes a little more sense. This is a one-day workshop that's effectively a retrospective look at computer architecture over the past 50 years and asking "what's next?" So the real point of the paper is the somewhat meandering discussion of "why exposing microarchitecture is bad" that's the first part of the paper, with the latter part being a suggestion of how to take that advice to the future. That said, it's (IMHO) the worst of the bunch.

The point the paper is trying to make is that exposing microarchitecture to the ISA is a decision that ended with problems. It then claims that some of the bulwarks of modern ISAs still act too much like exposing-microarchitectural details, namely the presentation of a finite register set. It sort of doesn't bode well when your references to "this is totally a fruitful idea to go down" are all limited to papers in the 1990s.

Memory is the bottleneck... so let's use memory ever more intensely.


Seriously, hardware accelerated context switching is much more of a viable option rather than saying "screw registers, we'll make memory do absolutely everything. To make matters worse, in most languages you can't tell where the pointers are going before you AG them, which results in pointer chasing pain.

Stateless ISAs are interesting, but I seriously doubt the way forward is to make memory do everything for us.

Also, the biggest cost of context switching is not in the register file migration, but in the flushing of the TLBs, look to SASOS+VIVT to address that, not whatever this is proposing.

One point the authors make is that registers are memory. I’m not sure they’re saying screw registers, just that they’re counter to portability and in general only a yet faster cache for programs to use. The other point they seem to make is that a secondary reason we have architected registers is that renaming over an arbitrarily large space was (and in some ways is still) difficult. By using graph coloring and reg allow in compiler, we narrow the encoding space before execution and give the hardware a much simpler problem. Recent research results on register less arch show that there are quite a few benefits. There’s a lie in calling it registerless though, it’s still there, just transparent and hence portable.

> Stateless ISAs are interesting, but I seriously doubt the way forward is to make memory do everything for us.

You're forgetting one very important thing, instructions are memory too. Right now it seems you're doubly stressing the hardware by forcing it to move data (instructions) in order to move data (actual data).

Raising the abstraction level of the ISA and making more logic part of proprietary CPU microcode is exactly the wrong direction to go in for security. I'm not sure how the authors can possibly claim this will be beneficial for security. Thankfully, recent events (Meltdown and Spectre) have made the flaws in their philosophy abundantly clear.

>When considering the period of rapid evolution that microarchitecture is about to face with the end of lithography scaling, abstractions that are free from underlying microarchitectural influence are critical to minimizing future disruption.

Ironically completely accurate! If chipmakers can successfully sell us on super-high-level ISAs (moving the scheduler and hypervisor inside their proprietary chips, as this article seems to suggest), they will be able to lock us in and easily prevent any "disruption" of their business model.

The greatest advantage for innovation in ISAs in the past 20 years has been the development of Linux and GCC, which allow any new chip to get a huge amount of working software with relatively little porting effort. Moving more logic out of this open source software portability layer and into the proprietary chip will just make it harder to build new chips from scratch.

I don't think that's what they're suggesting. This is more about replacing most register references within opcodes with stack pointer offsets, for faster context switching. It only replaces the register-spilling part of a context switch, along with the register-allocation code in a compiler.

It sounds a bit like SPARC, but more flexible? And not particularly difficult to port to.

> It only replaces the register-spilling part of a context switch, along with the register-allocation code in a compiler.

The compiler would still need to stack slot coloring. And, as I noted elsewhere, they also seem to suggest dropping memory coherency, which makes the stack here just a register file with single instructions that load and store the registers to memory.

That's what I'm guessing too, something a bit like SPARC but more flexible. Seems you can get a data movement win if you don't mindlessly move the entire linux task struct on every swap.

I don't think that's what anyone is suggesting, and certainly not for secrecy or lockin purposes.

> "We should instead consider how to be more helpful to the consumers of our interface. Looking upwards in the system stack, the first immediate consumers of the architecture abstraction are hypervisors, virtual machine monitors, microkernels, and operating systems. How might we provide acceleration of their operations? What might an architecture with a taste of hypervisor consist of? How might we provide easy-touse heterogeneous computation without forcing these software layers to continually adapt to our innovations"

ARM actually has had some task-swich architecture assistance for decades - register bank switching. Like a very cut down hyperthreading. I think these days it's only used for FIQ.

Welcome to the club.

* The Myth of Sufficiently Smart Compiler (SSC)

* The Myth of Sufficiently Smart Virtual Machine (SSVM)

* The Myth of Sufficiently Smart Instruction Set Architecture (SSISA)

All these ideas have the same coal. How to preserve and use high-level information when transforming code to the lower level in a way that gives maximum performance.

In a hypothetical dreamland where where all these exist, compilers, hypervisors, virtual machine monitors, microkernels,operating systems, ISA's and microcode would generate "sufficiently smart" stack that provides performance increase.



The author basically says "we've been doing it wrong" and doesn't provide an alternative.

Meh, for a retrospective paper, I think the authors probably provided more hints as to the future than they really needed to. I feel like they captured the exact reasons why ISA is so difficult to do right. As for “doing it wrong”, I didn’t take it that way. I think the authors lament the influence of what they seem to feel is a mantra that dictates the programmer must deal directly with the hardware versus providing a sufficiently abstract ISA. Building circuits for ML accelerators, etc. is actually damned easy, exposing those to the programmer in a portable way that does not require rearchitectng the program every time you change the accelerator is tough. I literally loathe porting for Intel b/c the AVX insn behave differently and are essentially architecting in the microarch, passing the complexity directly to the programmer. I’d much rather the Risc-V solution or the ARM solution.

Tl;dr, as far as I can tell: Context switch performance can be improved by abolishing registers. Also, here's some indications that this won't necessarily completely annihilate processor performance.

It certainly took a long time for the paper to get to any sort of point. The beginning of the paper is a well-trod (for anyone who's looked at computer architecture) commentary that "VLIW doesn't work," "delay slots don't work," and "the MIPS let-the-compiler-figure-out-register-interlocking doesn't work."

So the brunt of the paper is the idea that replace registers with memory. Except:

* The overarching goal is making context switch overheads go away. Except isn't the expensive part of the context switch flushing the TLB and all the caches?

* A register file on an x86/POWER-kind of processor has something like a dozen ports and 1 clock cycle latency. The L1 cache has 4 clock cycle latency (on a hit) and about two ports. Ports are expensive, even more so on large memory banks (such as a cache).

* At three memory operations (two reads and one write) per instruction, and hopefully 1-2 instructions per cycle, you're really hammering on the cache coherency traffic. The paper does suggest making the cache incoherent.

* Given the expense of doing virtual memory lookup, and the probably need for cache incoherence, it may make sense to make the main stack effectively physically tagged within a function and make the compiler emit offsets within the cache.

At this point, you have a block of memory that's indexed independently of main memory, and doesn't communicate with main memory most of the time, and is meant to be accessed as frequently as a register file. Well, it basically is a register file at that point. The only thing you're adding is the ability to dump and read the register file to main memory at specified points.

Which sounds a bit like implementing the LLVM IR as an ISA. You don't have registers, but instead you get an infinite number of temporary values.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact