People like the author, Andrew Waterman, and others like Bunnie Huang who work towards making more of computing open are inspiring. I feel like the last piece of the puzzle is open FPGAs. I'm quite sure FPGAs are critical to the open hardware movement.
I should quit Google and solve this...
* MIPS: ...MIPS remains a trademark of Imagination Technologies; MIPS compatibility cannot be claimed without their permission.
* SPARC: SPARC Intl. continues to grant licenses for v8 and v9...for $99...[but] continued development of Oracle SPARC is proprietary.
* Alpha: The Alpha also highlights an important risk of using commercial ISAs: they can die.
* ARMv7: Between ARM[v7] and THUMB, there are over 600 instructions...NEON adds hundreds more.
* ARMv8: The compact THUMB instruction set has not been brought along to the 64-bit address space...as we show in cpht5, it cannot compete in code size with a variable length ISA.
* 80x86: Intel [has done all these great things]...the design quality of the ISA is not one of them.
SPARC: Don't forget the only production-ready, GPL'd CPU from Gaisler & its proprietary improvements. Also, Fujitsu and Russia have SPARC processors. Not just Oracle. Fujitsu's are badass.
Alpha: Yes, it died. However, my digging found that Intel and Samsung spun it off into a dedicated company licensing Alpha to anyone interested. Only found one vendor that bought it for machines. Nobody major wanted it to exist enough to buy a license. So, the market killed it.
ARM: Similar issues to MIPS in that ARM will sue your ass any chance they get. There's a reason the clone went with ARMv2 and said it was not for profit. ;)
Intel did a great one in the past. Still available but crippled. Look here:
POWER and PPC stayed head to head with Intel for years with many PPC cores and boards in embedded. Even in FPGA's. They also accelerated decimal ops nice for business apps.
Dreamcast had a Super-H. Those are still around with their own cost-benefit analysis:
However this article is a better introduction IMO: https://lwn.net/Articles/647636/
The Ultrix, Linux, and VMS kernels didn't run in privileged mode, but instead made upcalls into PAL code (a kind of super lightweight hypervisor).
Any insights as to why the Alpha AXP didn't get more interest after it was dropped by DEC / HP?
Additionally, most UNIX users were going toward BSD or Linux as they picked up steam. They were available on cheap, Intel boxes. On top of it, Alpha's may (3rd party info) have had a hard time keeping power usage low trying to beat Intel on performance. IBM PPC chips having that problem is part of why Apple switched to x86 for laptops. Eventually, Compaq killed Alpha's off citing the development cost (8-9 digits) and how sales didnt justify it. I.P. got spun off into Intel/Samsung company, licensed maybe once, and idk from there.
PALcode ended up in Itanium but not sure if for Intel or developer use. Far as Alpha, crash-safe.org secure CPU's use that ISA for some reason. You could always, in theory, put an Alpha decoder or whatever in front of other RISC internals to emulate one.
I seriously wish the SH-5 had actually been released and the new POWER chips were actually available in a bit of a cheaper form.
The i432 was their most original work. It was a mainframe, capability-addressed, error recovery, and OS/apps in safe Ada. Did too much in hardware, though. BiiN had i960 which kept best parts of that. Think fast RISC with high reliability and security features.
Can anyone explain this situation? There are multiple freely licensed (GPL or LGPL) SPARC IP cores, but SPARC is still described as proprietary?
The thing about FPGAs is that they are deeply dependant on storage technology. Right now their gate configuration is kept in SRAM loaded in at boot, which is the worst of all worlds. Not only is it non-volatile, it's very power hungry and takes up more than half of the space on the chip. And flash memory hasn't helped much because (from what I've been able to figure out) having tiny sections of flash for each logic element is impractical due to the physics of how it works.
However, flash and SRAM aren't the end of the line, and advances in non-volatile memory will disproportionately affect FGPAs. It seems almost inevitable to me that the NVM landscape will be changing soon: several technologies are being worked on in parallel (e.g. NRAM and RRAM) - one of them is going to win, there's just too much demand for higher performance storage. When that happens, FPGAs will get to draft behind the huge tech investment made for the sake of SSDs etc.
FPGAs with tiny NVM will look a lot more like structured ASICs (a prefabbed grid of gates, with a custom metal interconnect layer) than their current form. Instead of needing ~10x power and supporting ~1/4 the clock speeds, it'd be about half as fast and twice the power - still not ideal, but within spitting distance of the advantages of FPGAs outweighing the disadvantages vs. ASICs, and certainly a qualitative difference from where we're at today.
This, combined with having the same thoughts as you about the criticalness to open computing, has gotten me to start working on improving the tooling for people just getting into FPGAs. It's hard to have an openness-based movement when there's not enough people, and it's hard to get people in if the first rung of the ladder's too high (e.g. installing a grab-bag of software and learning Verilog as step #1).
Here's my (very alpha) work so far: http://blinklight.io (I haven't publicized it yet because I'm going to be doing a big change to the pitch/target market soon). I'm well aware of the limitations of visual programming, but I think it's an excellent basis for developing intuition before moving on to HDLs (especially for people coming from software who are thrown off by Verilog's superficial similarities to C).
I'm a grad student at the UC Berkeley architecture lab and worked with Andrew before he graduated. I'll also be interning at his new company, SiFive, over the summer. Happy to answer any questions you might have on RISC-V or RocketChip.
The cynic in me likes to think that existing proprietary isa's keep adding new instructions partly to keep customers on the upgrade treadmill. If you design an open ISA from a clean slate, you can make a nice orthogonal one from the get go.
And it shows. The operations they blew 1-byte instructions on is astounding (e.g., ASCII Adjust after Addition...)
How about the silicon? By now we've seen quite few more-or-less open microarchitecture designs (with various ISAs), and it sure must be nice to dream up a new, lean ISA on paper, even simulate it in software.. but I would like to run Doom on it. And Quack. And not just in a simulation.
What are the chances that I can actually get to play with this design as I can with an AVR microcontroller on a breadboard or some ARM chip in one of these cheap evaluation boards? In, say, the following 5 or 10 years? Is it just a dream?
Will it be like the CPUs in the Lemote computers? Way underpowered, even compared to a high end smartphone chip, and power hungry enough to require a noisy fan?
IMHO, a memory-to-memory architecture would make for a much simpler ISA and allow much easier code generation (no register allocation needed).
"Waterman shows that RVC fetches 25%-30% fewer instruction bits, which reduces instruction cache misses by 20%-25%, or roughly
the same performance impact as doubling the instruction cache size."
Larger caches are slower, so just doubling the cache size doesn't necessarily recover the performance lost by larger instructions.
...which makes me wonder why they didn't just make that the standard. x86 is dense and that's partly where its performance comes from.
The base RISC-V was designed to be minimal; for instance, hardware multiply and divide is the M extension. I expect that, once the C (compressed instructions) extension is standardized, most high-performance implementations will include it, since the extra area cost should be minimal compared to the out-of-order machinery; small in-order implementations where area is at a premium and performance is not so important will continue to use only the base 32-bit instructions. There's even a variant, RV32E, which omits half the registers to make the core even smaller.
And not everybody wants to go through the design effort to support variable-length encoding (it's non-trivial and can potentially increase pipeline latency to support it). It very well could become a de facto RISC-V standard, but that remains to be seen.
Instruction fetching and parsing is an important bidimensional bottleneck that does restrict your architecture. It's not a bottleneck for run-time performance because it's predictable and optimized at design time.
Basically, instruction operands are divided into 4 classes:
- G (global): the operand refers to the global memory space and contains an offset and size.
- A (automatic): the operand refers to the current stack frame. The offset is relative to the current base pointer.
- I (immediate)
- T (temporary): this is where intermediate results from computations are stored. For example, the result of `a + b` from `(a + b) * c` would be stored in the T memory space. Upon the end of a statement in the HLL, all values in the temporary space are discarded.
Did you mean stack pointer ? base pointer gets set once where as stack pointer changes between every function call.
> Here is how I implemented a small instruction set for the VM of my experimental language, without using GPRs.
stack pointer is still necessery unless some memory region is predefined as stack pointer.
'A class' can be made more space efficient by dividing stack into 16 chunks and then instruction would only need 4bit to address.
 pg 12, para 5.
You MUST worry about registers due to the exponential increase in latencies between registers and larger memory communicating with the ALU. Also, limited are the amount of execution units.
If all memory accesses cost the same in terms of latency, then yes, registers would be an implementation detail.
Think of how x86-64's 16 architectural registers are dynamically renamed to more than 120 physical registers.
I would think it's hard to beat the compiler though (when it knows how many GP registers there are). Consider register pressure. If the compiler assumes there are infinite GP registers (because they're hidden now, no longer an implementation detail), it can generate code that spends more time spilling and reloading registers than computing.
The compiler has seen the code before; it knows how many variables it will need to work with an how to schedule available resources. The processor (relatively) simply is executing a stream of instructions. It would have to look pretty far ahead (VLIW) or have sufficient execution units (superscalar) to have high throughput and not get bogged down spilling.
But, if the finite amount of resources, like number of processing units or GPRs become very high relative to the amount of symbolic variables, then I think the register names could be hidden.
But compilers don't know this! They know how many registers there are in the ISA, but not how many there really are in the physical architecture.
Physical architectures map ISA register names into register file locations. It's an abstraction! You can really use a register more than once at the same time because they will be renamed. Your parent is suggesting that we read and write from memory as normal, and just map memory locations into the register file, rather than ISA register names.
Registers may be more of an abstraction than you realise.
(Maybe compilers do know about how large the register rename buffer really is on certain physical chips and take this into account when allocating them. Sounds like something Intel's C compiler might do. I'm not an expert.)
What do you mean by this?
Do you realize register renaming is only done for instruction scheduling purposes, to avoid bubbles in the pipeline?
addl %ebx, %eax
movl %eax, foo(%rip)
addl %ecx, %eax
movl %eax, bar(%rip)
Happy to be confirmed or corrected by someone who knows more.
I'm a big fan of the Mill's idea of a specialized compiler that reencodes a general code into your architecture. It solves this problem, allows for JIT optimization of compiled languages, and is a nice portability layer available for virtualizing other architectures.
It also adds a nice layer for malware to hide, but well, we can't have everything.
Only operate on RAM, let the processor figure loads and stores and register utilization as needed. I suspect leaving these decisions to the software as opposed to the hardware is less prone to bugs, obsolescence and whathaveyou.
The performance advantage isn't clear. It would require optimizations to the instruction coding and the memory access schemes, because plain 64 bit addresses take more space than register codes.
Stack Machines oppose Register Machines. Surely, register machines can emulate stack machines, but I suppose that would incur overhead.
the obvious penalty is the push/pop pair around each function call. as a first optimization pass, skip the pop when the pushed values aren't used - if a register isn't touched, you can get away with this. it's damn fast, because the push can rely on l1 cache. there's no stall when the processor promises to write that out eventually. you don't care till you need to do the corresponding pop. i had plans to skip the push , depending on usage, but didn't quite get that far.
anyway, the point is, just using registers as the top of the stack can get pretty good results with just a few man-months of effort.
Chaching is not a transparent implementation detail of the processor, i.e. compilers and assembler programmers do take caching into account, as far as I can tell from reading HN. Register utilization is a concern as well. Register renaming is probably just a implementation detail, I wouldn't really know.