Hacker News new | past | comments | ask | show | jobs | submit login
Design of the RISC-V Instruction Set Architecture [pdf] (berkeley.edu)
138 points by ingve on May 1, 2016 | hide | past | web | favorite | 48 comments

This should be a good read. I aspired to do stuff like this in my undergrad: https://github.com/nickdesaulniers/Omicron/blob/master/Proje...

People like the author, Andrew Waterman, and others like Bunnie Huang who work towards making more of computing open are inspiring. I feel like the last piece of the puzzle is open FPGAs. I'm quite sure FPGAs are critical to the open hardware movement.

I should quit Google and solve this...

Also, some choice quotes I've pulled while reading chpt2 ("Why Develop a New Instruction Set?"):

* MIPS: ...MIPS remains a trademark of Imagination Technologies; MIPS compatibility cannot be claimed without their permission.

* SPARC: SPARC Intl. continues to grant licenses for v8 and v9...for $99...[but] continued development of Oracle SPARC is proprietary.

* Alpha: The Alpha also highlights an important risk of using commercial ISAs: they can die.

* ARMv7: Between ARM[v7] and THUMB, there are over 600 instructions...NEON adds hundreds more.

* ARMv8: The compact THUMB instruction set has not been brought along to the 64-bit address space...as we show in cpht5, it cannot compete in code size with a variable length ISA.

* 80x86: Intel [has done all these great things]...the design quality of the ISA is not one of them.

Good list. Let me add to it.

SPARC: Don't forget the only production-ready, GPL'd CPU from Gaisler & its proprietary improvements. Also, Fujitsu and Russia have SPARC processors. Not just Oracle. Fujitsu's are badass.



Alpha: Yes, it died. However, my digging found that Intel and Samsung spun it off into a dedicated company licensing Alpha to anyone interested. Only found one vendor that bought it for machines. Nobody major wanted it to exist enough to buy a license. So, the market killed it.

ARM: Similar issues to MIPS in that ARM will sue your ass any chance they get. There's a reason the clone went with ARMv2 and said it was not for profit. ;)

Intel did a great one in the past. Still available but crippled. Look here:


POWER and PPC stayed head to head with Intel for years with many PPC cores and boards in embedded. Even in FPGA's. They also accelerated decimal ops nice for business apps.


Dreamcast had a Super-H. Those are still around with their own cost-benefit analysis:


Just to add a little bit more - there's been a handful of developments around the SuperH as patents start to lapse. The Open Processor Foundation have a couple of SuperH implementations you can use on an FPGA (note: I haven't tried this) they've got a bit of information up at their website: http://0pf.org/

However this article is a better introduction IMO: https://lwn.net/Articles/647636/

Yep. Neat that it's getting a revival.

I'm curious why there wasn't more interest in the DEC Alpha AXP. It was designed from the beginning as a 64-bit chip (unlike x86, POWER, ARM, MIPS, HP PA-RISC, and almost everything else except Itanium). It consciously avoided features (and memory model features) that interfered with efficient parallel superscalar implementations.

The Ultrix, Linux, and VMS kernels didn't run in privileged mode, but instead made upcalls into PAL code (a kind of super lightweight hypervisor).

Any insights as to why the Alpha AXP didn't get more interest after it was dropped by DEC / HP?

I'd like to hear others chine in on this as Im also curious. What I saw, though, was a general trend towards Intel dud go its price/performance advantages and many apps moving to Windows. The UNIX workstations could easily be five digits for a few CPU's. Maybe price didn't come down enough.

Additionally, most UNIX users were going toward BSD or Linux as they picked up steam. They were available on cheap, Intel boxes. On top of it, Alpha's may (3rd party info) have had a hard time keeping power usage low trying to beat Intel on performance. IBM PPC chips having that problem is part of why Apple switched to x86 for laptops. Eventually, Compaq killed Alpha's off citing the development cost (8-9 digits) and how sales didnt justify it. I.P. got spun off into Intel/Samsung company, licensed maybe once, and idk from there.

PALcode ended up in Itanium but not sure if for Intel or developer use. Far as Alpha, crash-safe.org secure CPU's use that ISA for some reason. You could always, in theory, put an Alpha decoder or whatever in front of other RISC internals to emulate one.

I never had a chance to play with the i960, what made it great?

I seriously wish the SH-5 had actually been released and the new POWER chips were actually available in a bit of a cheaper form.

True on PPC and SH5. Far as i960, you need to know the context. Look at these:




The i432 was their most original work. It was a mainframe, capability-addressed, error recovery, and OS/apps in safe Ada. Did too much in hardware, though. BiiN had i960 which kept best parts of that. Think fast RISC with high reliability and security features.

> SPARC Intl. continues to grant licenses for v8 and v9...for $99...[but] continued development of Oracle SPARC is proprietary.

Can anyone explain this situation? There are multiple freely licensed (GPL or LGPL) SPARC IP cores, but SPARC is still described as proprietary?

It's going to be a very interesting space, IMO.

The thing about FPGAs is that they are deeply dependant on storage technology. Right now their gate configuration is kept in SRAM loaded in at boot, which is the worst of all worlds. Not only is it non-volatile, it's very power hungry and takes up more than half of the space on the chip. And flash memory hasn't helped much because (from what I've been able to figure out) having tiny sections of flash for each logic element is impractical due to the physics of how it works.

However, flash and SRAM aren't the end of the line, and advances in non-volatile memory will disproportionately affect FGPAs. It seems almost inevitable to me that the NVM landscape will be changing soon: several technologies are being worked on in parallel (e.g. NRAM and RRAM) - one of them is going to win, there's just too much demand for higher performance storage. When that happens, FPGAs will get to draft behind the huge tech investment made for the sake of SSDs etc.

FPGAs with tiny NVM will look a lot more like structured ASICs (a prefabbed grid of gates, with a custom metal interconnect layer) than their current form. Instead of needing ~10x power and supporting ~1/4 the clock speeds, it'd be about half as fast and twice the power - still not ideal, but within spitting distance of the advantages of FPGAs outweighing the disadvantages vs. ASICs, and certainly a qualitative difference from where we're at today.

This, combined with having the same thoughts as you about the criticalness to open computing, has gotten me to start working on improving the tooling for people just getting into FPGAs. It's hard to have an openness-based movement when there's not enough people, and it's hard to get people in if the first rung of the ladder's too high (e.g. installing a grab-bag of software and learning Verilog as step #1).

Here's my (very alpha) work so far: http://blinklight.io (I haven't publicized it yet because I'm going to be doing a big change to the pitch/target market soon). I'm well aware of the limitations of visual programming, but I think it's an excellent basis for developing intuition before moving on to HDLs (especially for people coming from software who are thrown off by Verilog's superficial similarities to C).

Haha, so Andrew's thesis is on here, too. If you like this, you might also be interested in the tech report we published just recently on RocketChip, our open-source RISC-V SoC generator.


I'm a grad student at the UC Berkeley architecture lab and worked with Andrew before he graduated. I'll also be interning at his new company, SiFive, over the summer. Happy to answer any questions you might have on RISC-V or RocketChip.

The best part of this paper is the thorough analysis of why existing ISAs suck (chapter 2).

It's interesting why the ISAs suck or getting rotten over time (for example x86). I mean, designing an ISA is not a job for interns, it have to be developed by experienced engineers. So, are all these guys have no idea what they're doing? Or, maybe some non-visible factors force certain decisions?

Like is said in the abstract, the designers of risc-v have the advantage of 3 decades of hindsight. So they could avoid stuff that was considered good back in the day but history showed was more a disadvantage, like delay slots ot register windows.

The cynic in me likes to think that existing proprietary isa's keep adding new instructions partly to keep customers on the upgrade treadmill. If you design an open ISA from a clean slate, you can make a nice orthogonal one from the get go.

8086 was put together in just a few weeks, once it was clear Intel needed a "stop-gap" ISA, as their other design (432) was taking too many years to put together.

And it shows. The operations they blew 1-byte instructions on is astounding (e.g., ASCII Adjust after Addition...)

Lovely paper.

How about the silicon? By now we've seen quite few more-or-less open microarchitecture designs (with various ISAs), and it sure must be nice to dream up a new, lean ISA on paper, even simulate it in software.. but I would like to run Doom on it. And Quack. And not just in a simulation.

What are the chances that I can actually get to play with this design as I can with an AVR microcontroller on a breadboard or some ARM chip in one of these cheap evaluation boards? In, say, the following 5 or 10 years? Is it just a dream?

Will it be like the CPUs in the Lemote computers? Way underpowered, even compared to a high end smartphone chip, and power hungry enough to require a noisy fan?

lowRISC [0] is designing a RISC-V chip and intending to finish tape out this year.

[0] http://www.lowrisc.org/

I have often wondered, do we really need any general-purpose registers to be visible in the ISA? Aren't they an implementation detail, just like caching?

IMHO, a memory-to-memory architecture would make for a much simpler ISA and allow much easier code generation (no register allocation needed).

One problem would be instruction size. On a system with (e.g.) 32 registers, each register can be addressed with 5 bits, while a memory address is 32-64 bits.

There could be a scheme for encoding large immediates in a compact way, just like ARM does. Besides, I don't think that instruction size or instruction cache pressure is the bottleneck nowadays.

The RISC-V designers implemented a compressed instruction extension (the most common instructions can be represented in half the size), and found a performance gain. So yes, instruction size matters. From the 1.9 draft of the RISC-V compressed instructions specification:

"Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction cache size."

Larger caches are slower, so just doubling the cache size doesn't necessarily recover the performance lost by larger instructions.

implemented a compressed instruction extension (the most common instructions can be represented in half the size), and found a performance gain.

...which makes me wonder why they didn't just make that the standard. x86 is dense and that's partly where its performance comes from.

In the paper they discuss how x86 isn't even dense, & barely competes with ARM's constant width instructions

They planned for it since the beginning. While the base RISC-V encoding uses only 32-bit instructions, it's actually a variable-length encoding, where the least significant bits define the instruction length. If the least significant bits are 00, 01, or 10, it's a 16-bit instruction.

The base RISC-V was designed to be minimal; for instance, hardware multiply and divide is the M extension. I expect that, once the C (compressed instructions) extension is standardized, most high-performance implementations will include it, since the extra area cost should be minimal compared to the out-of-order machinery; small in-order implementations where area is at a premium and performance is not so important will continue to use only the base 32-bit instructions. There's even a variant, RV32E, which omits half the registers to make the core even smaller.

x86 is NOT dense - it averages nearly 4 bytes per instruction on workloads such as SPEC.

And not everybody wants to go through the design effort to support variable-length encoding (it's non-trivial and can potentially increase pipeline latency to support it). It very well could become a de facto RISC-V standard, but that remains to be seen.

Well, if you really want to push it, you can have a smaller, faster L1 cache, like with 32 words or so, and a set of special instructions, shorter, but only capable of addressing this small memory space...

Instruction fetching and parsing is an important bidimensional bottleneck that does restrict your architecture. It's not a bottleneck for run-time performance because it's predictable and optimized at design time.

Memory can addressed by taking offset from RSP (or [ 0x10 ] eg).

Here is how I implemented a small instruction set for the VM of my experimental language, without using GPRs.

Basically, instruction operands are divided into 4 classes:

- G (global): the operand refers to the global memory space and contains an offset and size.

- A (automatic): the operand refers to the current stack frame. The offset is relative to the current base pointer.

- I (immediate)

- T (temporary): this is where intermediate results from computations are stored. For example, the result of `a + b` from `(a + b) * c` would be stored in the T memory space. Upon the end of a statement in the HLL, all values in the temporary space are discarded.

Reference: https://github.com/bbu/quaint-lang/blob/master/src/exec.c

> - A (automatic): the operand refers to the current stack frame. The offset is relative to the current base pointer.

Did you mean stack pointer ? base pointer gets set once where as stack pointer changes between every function call.

> Here is how I implemented a small instruction set for the VM of my experimental language, without using GPRs.

stack pointer is still necessery unless some memory region is predefined as stack pointer.

'A class' can be made more space efficient by dividing stack into 16 chunks and then instruction would only need 4bit to address.

According to the thesis recent intel microarchs caches the top several words of the stack[1]. Thesis references intel optmization manual but does not say where in the manual.

[1] pg 12, para 5.

I don't understand this comment, maybe because I can't picture what a "memory-to-memory architecture" would look like.

You MUST worry about registers due to the exponential increase in latencies between registers and larger memory communicating with the ALU. Also, limited are the amount of execution units.

If all memory accesses cost the same in terms of latency, then yes, registers would be an implementation detail.

Eliminating GP registers from the ISA doesn't preclude any optimizations which could be done dynamically in the microarchitecture. The CPU would still be able to deal with registers and caches internally, it's just that they won't be visible to the programmer.

Think of how x86-64's 16 architectural registers are dynamically renamed to more than 120 physical registers.

That certainly is an interesting idea to think about.

I would think it's hard to beat the compiler though (when it knows how many GP registers there are). Consider register pressure. If the compiler assumes there are infinite GP registers (because they're hidden now, no longer an implementation detail), it can generate code that spends more time spilling and reloading registers than computing.

The compiler has seen the code before; it knows how many variables it will need to work with an how to schedule available resources. The processor (relatively) simply is executing a stream of instructions. It would have to look pretty far ahead (VLIW) or have sufficient execution units (superscalar) to have high throughput and not get bogged down spilling.

But, if the finite amount of resources, like number of processing units or GPRs become very high relative to the amount of symbolic variables, then I think the register names could be hidden.

> when it knows how many GP registers there are

But compilers don't know this! They know how many registers there are in the ISA, but not how many there really are in the physical architecture.

Physical architectures map ISA register names into register file locations. It's an abstraction! You can really use a register more than once at the same time because they will be renamed. Your parent is suggesting that we read and write from memory as normal, and just map memory locations into the register file, rather than ISA register names.

Registers may be more of an abstraction than you realise.

(Maybe compilers do know about how large the register rename buffer really is on certain physical chips and take this into account when allocating them. Sounds like something Intel's C compiler might do. I'm not an expert.)

> You can really use a register more than once at the same time because they will be renamed.

What do you mean by this?

Do you realize register renaming is only done for instruction scheduling purposes, to avoid bubbles in the pipeline?

I guess he refers to pipelining, which sort-of can give you a logical register twice. For example, you may be able write a new value to a register before the instruction writing out the old value has completed.

Yes. I'm not an architecture expert but I understand that if you do something like:

    addl    %ebx, %eax
    movl    %eax, foo(%rip)
    addl    %ecx, %eax
    movl    %eax, bar(%rip)
Then the second add can start while the first and the first mov are still running, because the logical reuse of eax is independent of the first and so will receive a different entry in the rename buffer.

Happy to be confirmed or corrected by someone who knows more.

Yes, on AMD64.

I'm a big fan of the Mill's idea of a specialized compiler that reencodes a general code into your architecture. It solves this problem, allows for JIT optimization of compiled languages, and is a nice portability layer available for virtualizing other architectures.

It also adds a nice layer for malware to hide, but well, we can't have everything.

You should look at the AS/400. They've actually been shipping a product using this scheme for almost 40 years.

> picture what a "memory-to-memory architecture" would look like

Only operate on RAM, let the processor figure loads and stores and register utilization as needed. I suspect leaving these decisions to the software as opposed to the hardware is less prone to bugs, obsolescence and whathaveyou.

The performance advantage isn't clear. It would require optimizations to the instruction coding and the memory access schemes, because plain 64 bit addresses take more space than register codes.

Stack Machines oppose Register Machines. Surely, register machines can emulate stack machines, but I suppose that would incur overhead.

in my (incredibly limited) experience, a fairly naive approach works ok. mips has lots of registers. for my C compiler, leave the arguments in a register. 5 args, 5 registers. pretend the next n registers are the top of the stack, 5 - 32 = 27, each variable gets it's own register, just pretend the 27 available registers are simply the top of the stack. when var 28 comes along, it becomes complicated, you have to fool around with saving something to memory. but really, that's pretty rare.

the obvious penalty is the push/pop pair around each function call. as a first optimization pass, skip the pop when the pushed values aren't used - if a register isn't touched, you can get away with this. it's damn fast, because the push can rely on l1 cache. there's no stall when the processor promises to write that out eventually. you don't care till you need to do the corresponding pop. i had plans to skip the push , depending on usage, but didn't quite get that far.

anyway, the point is, just using registers as the top of the stack can get pretty good results with just a few man-months of effort.

A Counter Machine with only one register would be an abstraction that doesn't need register annotation. It's turing complete with only two instructions.

Chaching is not a transparent implementation detail of the processor, i.e. compilers and assembler programmers do take caching into account, as far as I can tell from reading HN. Register utilization is a concern as well. Register renaming is probably just a implementation detail, I wouldn't really know.

As far as I know, a L1 cache hit takes around 2-3 cycles. On a memory-to-memory architecture, every instruction would take that hit, multiple times, and that's if they hit; unless you use a slower fully associative cache, you can have cache line conflicts which reduce the hit rate. With registers, you can have a long sequence of operations without ever hitting the cache.

The Tradeoff is probably run-time vs compile time optimization.

It defines extensions but no way to detect them. Am I missing something?

Take a look at the mcpuid register (one of the CSR registers) in the draft of the privileged architecture specification.

Maybe try executing one and catch the exception.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact