Hacker News new | past | comments | ask | show | jobs | submit login
BOOM v2: an open-source out-of-order RISC-V core (berkeley.edu)
170 points by ingve on Sept 26, 2017 | hide | past | web | favorite | 93 comments

For those that want to read the original BOOM paper first, I'll save you a bit of time: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-...

Highlights: The version of BOOM in the paper (two years ago) had the same IPC as a Cortex-A15 in half the die area on a process node two generations older while using compiler ports which were immature at the time (and still probably have plenty of room to improve). Most impressively of all, it was designed and developed by three people (Chris Celio, David A. Patterson, Krste Asanović), two of whom (David and Krste) were not primarily focused on it.

Hi, author here. Just to point out to others (and to be fair to the A15), IPC is just one part of the final performance equation. =)

Chris, I worked hard to get you on that pedestal, don't go jumping off. ;- )

Fair enough, you didn't reach the same frequencies, but that's what the other 1.4mm² and the process shrink are for.

ARM[v7] maybe does a little bit more per instruction, what with those conditions and 14+ character non-mnemonic mnemonics; but ultimately instruction counts should be pretty close, right?

Update: also probably SIMD[or vectors], breakpoints, more interesting memory management, the handling of bizarre FP corner cases, maybe power management[high frequency dvfs? :- )], and other things go in that additional 1.4mm².

> Chris, I worked hard to get you on that pedestal, don't go jumping off. ;- )


> ARM[v7] maybe does a little bit more per instruction, what with those conditions and 14+ character non-mnemonic mnemonics; but ultimately instruction counts should be pretty close, right?

Great question. I just so happen to have written a tech report on this very topic! https://arxiv.org/abs/1607.02318

Basically, performance should be identical between ARMv8 and RISC-V, given the RISC-V core implements macro-op fusion to combine things like pair loads together.

> Great question. I just so happen to have written a tech report on this very topic! https://arxiv.org/abs/1607.02318 .

> Basically, performance should be identical between ARMv8 and RISC-V, given the RISC-V core implements macro-op fusion to combine things like pair loads together.

Yeah, I drew my conclusions from your papers. :- )

I really should diversify my sources, I bring nothing to this exchange.

Wasn't that comparison totally rigged by omitting a bunch of stuff (e.g. SIMD) from the BOOM core that wasn't used by a particular benchmark?

I was using Coremark in that report, which is what ARM used to market their cores. The 32b ARMv7 cores have NEON SIMD, but I believe no FMAs, whereas BOOM is 64b and includes double-precision FMA units.

Also there's no current RISC-V extension for SIMD or vector ops, so I didn't maliciously "omit" things to "totally rig" the comparison. But even running SPECint is not going to fire up the SIMD unit.

Uhhh yes ARMv7 NEON has vector FMAs.

SPEC with modern compilers will definitely use SIMD, utilization on hot regions is variable but it's definitely beneficial.

Not all ARMv7 have FMA, only the newer models.

Models with FMA: Cortex-M4, Cortex-M7, Cortex-A5, Cortex-A7, Cortex-A15, Cortex-A17. I do not remember which Cortex-R have FMA. Models without FMA: Cortex-A8, Cortex-A9.

No numbers. They report a 20 percent REDUCTION in IPC in exchange for an undisclosed increase in clock speed (or is that implicit in the Fo4?). I assume v3 will make the improvements hinted at to bring IPC back up while maintaining clock speed. Anyway the original BOOM paper had lots of real data on IPC in comparison to commercial cores and also gave the actual BOOM clock speed. It's not clear to me where BOOM is going from this paper. Perhaps they're holding off until the conference? Maybe they don't actually have silicon to test yet?

I'd like to see their ideas on macro-op fusion implemented in BOOM. That should give a decent IPC increase too.

Hi phkahler! This is just a short workshop report with a focus on the microarchitecture (edit: oh! and the report mentions a 24% reduction in clock period). Unfortunately, some conferences have very strict pre-publication rules that prevent me from disclosing everything that I'd like to talk about.

Regarding benchmarking though, there is another CARRV workshop paper (coming up in October) that will be reporting some hard performance numbers of BOOM. =)

And I'd love to implement macro-op fusion, but I have too many bigger fish to fry right now. Pull requests accepted though. ;)

If your goal is to build anticipation or excitement in advance of the conference you're doing a mighty fine job!

> Unfortunately, some conferences have very strict pre-publication rules.

That's fine. I'm not sure if this is embargoed too, but: when can we start asking detailed questions and/or looking for new papers?

Yah, you can ask questions, I might just have to give evasive answers ;). This tech report itself isn't embargoed or I wouldn't have posted it. The pre-publication issue is the scope of what I could cover in this tech report was relatively narrow.

How will it perform relative to other ISAs, given that RISC-V is basically a slight variant of the MIPS --- the other "academic ISA" that probably received as much if not more attention and hype as RISC-V does today? Not long ago, a 4-way OoO MIPS fared poorly against ARM and x86:


It would be interesting to see that comparison performed again on more recent ARM and x86, as well as real RISC-V silicon.

The ISA is going to have close to zero impact on a processor's performance. I actually wrote up another tech report on this topic (https://arxiv.org/abs/1607.02318).

In short, you can rely on macro-op fusion to dynamically fuse some idioms that show up as instructions in other ISAs, like load-pair.

Wouldn't relying on macro op fusion increase i$ pressure?

No. The beauty of RISC-V is it has a compressed extension with 2-byte forms of the most common instructions. The average instruction size is 3.0 bytes on SPECint workloads. That's even better than x86.

So instead of creating a 4-byte "load-pair" instruction (or having to fuse two 4-byte loads), you can fuse two 2-byte loads into a single "load-pair" micro-op. Same performance, same code density, but a cleaner ISA (since not everybody wants the complexity of implementing load-pair).

If you want to compare ISAs then RISC-V is also closer to ARM than to MIPS in some regards. For instance MIPS uses delay slots, while ARM and RISC-V do not.

Regarding that extremetech benchmark: how can you not normalize to same node size if you are interested in the ISA?

Shouldn't any ISA at 22nm basically be more power efficient than one at 90nm?

They do --- normalise to 1.5GHz at 45nm. If anything, that only makes those CPUs > 1.5GHz look worse and those < 1.5GHz look better, due to memory subsystem effects (which are nonlinear and dependent on details of the implementation, so I'm pretty sure they did a linear scaling); that fact is well known in the overclocking community, where e.g. a 20% increase in core clock almost always means less than 20% improvement in benchmarks.

The full paper is here: https://research.cs.wisc.edu/vertical/papers/2013/hpca13-isa...

I apologize if this is a silly question but is there the possibility that this chip might see consumer availability outside of do-it-yourself FPGA implementations?

Anything is possible, but it sure looks like everybody is working on microcontrollers and nobody is even trying to ship real cores.

The privileged ISA is still a draft and is supposed to be finished later in this year. The privileged ISA is needed to support operating systems. There are projects out there targeting this kind of processors, for instance lowRISC.

It can run TempleOS :)

That's just because microcontrollers are cheaper and easier to produce. It's going to take someone with deep pockets to make a desktop competitive RISC-V processor.

Lowrisc.org is between those two extremes. Nothing shipping yet though.

Nvidia had a presentation at the risc-v workshop a few years back. I think they are building chips with it. Nothing the regular programmer sees though I don't think


Obviously the "given a sufficiently sophisticated compiler" joke applies, but it seems like the out-of-order optimization is computationally challenging. Now that we have more sophisticated programming languages (and more sophisticated versions of old programming languages) shouldn't that be done at compile time as a one-time (power, time) cost instead of a recurrent cost across all computing (which will inevitably duplicate effort)? Can someone explain to me why this should be done on metal? (besides that it's always been done like that since the late 90s)

Out-of-order execution isn't only about instruction order, and static scheduling can't give you everything. One execution of a given chunk of code might have something in cache, another execution of that same code might not. One execution might consistently take one branch, another might consistently take another, because it varies based on workload.

Out-of-order execution helps with things like "I don't have the memory fetch done yet, but let's make a reasonable assumption and then speculatively do lots more that doesn't depend on that fetch".

While true in principle, in practice the majority (~88%) of the gains of OoO come not from reactions to dynamic events (e.g. caches misses) but rather from schedules.

This is relatively recent research, and definitely against the common grain. See Daniel McFarlin's work at CMU (https://dl.acm.org/citation.cfm?id=2499368.2451143)

thanks! Cache miss latency! I forgot about that.

It means

   load [rp1] -> r1
   inc  r1
   load [rp2] -> r2
   inc  r2
may execute in a different order for every execution.

Hi dnautics!

Dynamic scheduling isn't as complicated as it seems, especially now that we have so many transistors on a chip and Amdahl performance is still very critical for most applications.

However, the value of OoO isn't just about constructing the "best" schedule, it's about being able to change schedules on a dime.

For example, VLIW compilers must try to statically produce the best "trace scheduling" across basic blocks. But OoO, register-renaming, and branch prediction allows you to build "trace schedules" on the fly. And branches tend to be far easier to predict dynamically than they are statically (or rather, branches tend to be more dynamically predictable than statically biased).

Unless you rely on a JIT to rebuild traces, and you can tolerate that overhead every time you fire up the JIT, I think OoO/dynamic scheduling will be with us forever.

The Mill architecture team is working on a static-scheduled architecture, but only time will tell if compilers can reasonably get good enough at it to make it viable. One major problem with static schedules is that if you want more IPC, you generally have to change the ISA or have machine-specific loaders which mutate the code at load time (this is the approach the Mill folks take).

Standard ISAs let you write a program with a linear model, and let the hardware get better at scheduling the micro ops over time, or just let it be linear (as is generally preferred on low end microcontrollers and low-power mobile cores).

> One major problem with static schedules is that if you want more IPC, you generally have to change the ISA or have machine-specific loaders which mutate the code at load time (this is the approach the Mill folks take).

This isn't a problem with static schedules, it's a problem with VLIWs (well actually any exposed pipeline machine).

(And really not that much of a problem anyways)

The Mill is certainly interesting, and while they still haven't provided a satisfactory answer to the ILP question, I think they may be successful enough anyways due to their other innovations.

> if compilers can reasonably get good enough at it

Do compilers, plural, have to get good at it, or just the Mill team's own LLVM backend?

What about the optimizing compilers in v8? SpiderMonkey (and friends)? JVM/Hotspot? Beam? the CLR/Roslyn? I maybe most of what compiles only on GCC (or MSVC) today could be reasonably ported to compile on Clang or DragonEgg, but there are a lot of compilers in the world these days.

What we're talking about is not the difficulty of making one sufficiently intelligent compiler, but probably something more like six of them, along with every other compiler anyone ever wants to write again.

I don't know about you, but I'm not about to assume that we'll somehow manage to develop and maintain six of something we've never even developed one of. The Mill is no Itanium, but it is very unusual. They haven't even tried to make a normal assembler, their assembler is a C++ template metaprogram which generates the assembler which you then run to get your program.

The impression I get is that their odd assembler is pretty much there for development expedience while they are hand-writing assembly. I can't imagine that their specializer (the thing that runs at install or run time which converts the generic load module to specialized concrete machine code for the processor you have) will produce as an IR form assembly that will then go through a C++ compiler... that would just be crazy. And the weird C++ assembler is definitely for the concrete machine code; the generic form is supposed to be very similar to LLVM IR.

Amongst other things it (a C++ assembler) would be slow as hell and require a huge amount of runtime. Ideally the specializer would be a single binary that would convert a generic load module to a specialized one with no intermediate steps. My guess is that they will generate such a specializer with their current generic specification tools.

You'd be right that the specializer doesn't do the same things as the assembler; and programmable assemblers are the norm, but the crucial difference is that their ISA and ABI make the assembler context-dependent.

Well, their exposed ISA and ABI are probably considered to be genAsm, which is the aforementioned mostly-LLVM-based IR. The concrete machine code will mostly likely only ever be generated by the specializer in the real world. Consider the specializer to be the static software equivalent of the x86-to-uOp decoding core in the hardware of a modern x86. On the one hand it can only make static scheduling decisions, whereas on the other hand it can look at the entire program being translated to make those decisions rather than just the window of instructions the x86 decoder can see during execution.

I really think this is the only path to make a long-term-viable staticly-scheduled architecture. Otherwise you'll wind up with the Itanium problem, which is that you eventually need to rewrite/reorder code internally (in hardware) as your architecture evolves, which kind of removes the advantage of having the static scheduling in the first place.

> I really think this is the only path to make a long-term-viable staticly-scheduled architecture.

You may be right. I think the challenges involved (especially as OOBC is proposing to solve them) defeat the goal. For systems that have no present or future interest in JIT, it might offer somewhat higher peak performance than whatever commodity computers are on the market; for those with JIT (especially with frequent compilation, like in a web browser) I can see it being prohibitive. I don't care how magical their specializer is, it will cost something to run, and cost a lot to integrate (imagine not being able to know the size or entry point of the basic block until you've specialized it, infuriating!). The way they talk about the specializer, it even gives the impression that they don't intend to share the source code, which will be a whole lot of fun when deciding whether or not you trust it to run in the middle of your application.

Yeah, who knows. I mostly think the Mill is a pile of interesting ideas that just might, if they're very lucky, eventually become a product. I'm certainly not holding my breath though.

Regarding their intentions, they've said repeatedly that they want to sell chips. It'd therefore be pretty stupid IMHO to not open up as much as they can the toolchain to get software to run on their chips, including the specializer. For that matter they should definitely also be releasing the chip specification data (insn/op bit patterns, functional unit slot arrangement, latencies, etc) that's used to create the specializer so people can roll their own if they want.

Then again this is the same world where Intel ME is an ultra-suspicious closed blob and you can't get datasheets on tons of chips to save your life, which makes no sense to me either. So I'm probably not a very good judge of what's reasonable.

Because you take a huge hit in IPC. Somewhat less if you make your binaries extremely CPU specific. For instance even something simple like the load instruction. To answer that you have to know the state of ALL levels of cache, their size, associativity, contents, TLB, etc. Otherwise it might take 0.3ns or 100ns. To be able to do nothing (because you are in order) is a huge bottleneck.

Doing OOO optimizations in the compiler (Same applies for VLIW etc.) generally requires the ISA to expose its's pipeline to the compiler: This is a complete paradigm shift from current architecture.

Also, special invariants are sometimes needed/helpful to avoid data hazards/simplify data dependencies: e.g. the Mill cpu effectively has no general purpose registers, but an FILO stack of immutable items (This guarantees SSA on the data the cpu is operating on).

> Doing OOO optimizations in the compiler (Same applies for VLIW etc.) generally requires the ISA to expose its's pipeline to the compiler: This is a complete paradigm shift from current architecture.

Not really, exposed pipeline is pretty common in the VLIW world.

VLIW is uncommon outside of DSPs.

The Mill belt is actually (conceptually) a fixed-size FIFO queue; LIFO stacks have been done reasonably often before (see stack machines) but since there's only one stack head they don't support parallel operations particularly well.

There have been CPUs with multiple datastacks e.g. https://bernd-paysan.de/4stack.html is a VLIW ISA with a very dense encoding and "just less ports" to the register file than a classic DSP.

Thanks, that's very interesting. Seems like it'd also be "interesting" to write a compiler for though, especially since you have to manage stack hygiene for four stacks simultaneously. And I'm not sure, but doesn't allowing each functional unit to directly read the top four entries of the other FUs' stacks reintroduce some of the kind of potential hazards that the Mill belt (with its SSA-style usage) tries to avoid?

The Very Long Instruction Word (VLIW) architectures used this philosophy. The most famous of which is Intel's Itanium ISA.

It turns out that it is really hard to create a compiler to effectively take advantage of instruction level parallelism at compile time.

you mean.... it's really hard to create a non-x86 compiler that posts good results on {benchmarks designed to make x86 look good} if you're Intel. And/or no one can really be bothered to make their compile particularly well to a proprietary foreign architecture if the chip already costs an arm and a leg.

it's really hard to create a non-x86 compiler that posts good results on {benchmarks designed to make x86 look good} if you're Intel.

When I used Itanium 2 systems for HPC, Intel's compilers produced the fastest binaries from C and Fortran, at least on the software I used (molecular dynamics and quantum chemistry).

What's meant by "benchmarks designed to make x86 look good"? Itanium was introduced against POWER, Alpha, SPARC, and MIPS chips. It was primarily benchmarked for server/workstation applications, where x86 processors weren't the fastest at the time. In 2003 a "Madison" Itanium 2 was significantly faster on numerical simulation code than a contemporary "Gallatin" x86 Xeon.

It was worse than that. The archives of comp.lang.arch have a lot of detailed discussions on the gory details, including participation by senior designers at Intel among others.

Itanium 2 did finally get reasonable performance, but only by adding back in dynamic scheduling to the microarchitecture.

(This is all in terms of fundamental benchmarks that are not x86 specific).

Let us know when you have written your "sufficiently smart compiler" It is hard, see here [1].

[1] https://en.wikipedia.org/wiki/Itanium

yeah I acknowledged that joke, but if you don't think that, for example, say LLVM is light-years ahead of GCC in implementing plugins that let you do various forms of optimization, you're crazy. It stands to reason that further architectural developments will make things even better than what we have now.

Nowithstanding the fact that your answer is totally unhelpful: I thought about it for a second and I think the answer is basically that if you have a multithreading cores and you're pending on the results of another operation, then you can't necessarily know what optimizations you can do ahead of time because you don't know what other threads are resident on your core, which will affect the on-the-fly reorganization.

Other people have pointed out that Itanium is the most recent example of a project that expected to be able to write a really smart compiler for an in-order CPU and failed.

If you can write better LLVM plugins then I'm sure Intel and HP will be happy to throw lots of money in your direction.

The field is littered with the tombstones of people that have tried to make compilers for insert any foreign architecture. Obviously in-order CPUs are a challenge on top of that. I do think there is a real 'go-to-market' problem. If you think about it this is kind of solved by Nvidia (with CUDA). GPUs are in-order computational arrays with custom compiler.

GPUs are vector processors under another name, that hide the latency of memory differently. They don't need to use the sort of compiler tricks you would with CPU code.

Yeah, but GPUs target only a small subset of the workloads and they are very hard to program even then. I prefer processors like the Rex Neo or Adapteva Epiphany in that respect actually.

have you tried programming rex neo?

Nope, I know you have experience with it though, so perhaps you can help there ;)

I've toyed around with the older Epiphanies though, and they were definitely easier than GPUs in many respects.

Also, as I understand it, GPUs are now out of order as well, aren’t they?

No, they are not.

OoO is much more expensive to implement than simply scheduling a different warp to hide latency.

Absolutely not.

Maybe you're confusing the GPU's threading techniques for OoO?

Can I ask a very dumb question - is there a (simple) roadmap to some kind of open hardware nirvana - ie general purpose chip design that can be created on fabs 2 generations old (I think that's one of the features here)

Things like

- we still need to do X to this design - we still don't have Y

I would be interested in how close or far we are from dropping things like IntelME and other nice things to have

Interesting question! I think the FOSSi community (behind LibreCores and ORConf) are probably the most aware of what that road map is and helping us get there.

One of the issues is that cores are fun to build, but are only one piece of the puzzle (guilty!). So there are lots of FPGA-targeting softcores, but fewer cache coherent high-performance multi-level cache systems, for example. I'm hoping that SiFive's TileLink protocol helps shore up the ecosystem, which is a high performance free and open bus standard. Some other gaps are devices, drivers, graphics, crypto-engines, more open-source testing/verification infrastructure... it's an exciting time for OSHW!

The multi-ported register file is said to be one of the big challenges. I wonder if it would be useful to split it into two or 3 identical copies (like the guy doing the 74xx TTL version). This would allow the different copies to be located closer to execution units, but would require the write data to travel further. So many tradeoffs...

Sadly no. You would still have to broadcast the 3 write ports to each copy, so "being closer" doesn't actually work out.

I think his point is that reading is faster, and the savings from the quadratic cost of the register file might still be useful, a la the DEC Alpha.

But each copy would only have 2 read ports instead of 6, and those 2 would feed directly into a single execution unit. The routing within each register file would be simplified. It's a local vs global routing tradeoff. I don't know if it would be better, but my suspicion is that it might.

Some Alpha AXP cores used a split register file and the register renaming logic took care of the synchronisation.

How does this compare to rocketchip and other RISC-V cores? Which is easiest to get started on with an FPGA? From my looking, rocketchip seems like it is seeing the most active development in open source, but I would love to see a comparison of all open RISC-V cores.

BOOM is a core that fits into the rocketchip SoC ecosystem. Or said another way, BOOM wears rocketchip's skin. :D [i.e., caches, memory system, uncore, test harnesses...]

Default rocketchip uses the Rocket core, which is a single-issue in-order core. If things are going well, a two-wide BOOM is 50% to 100% faster than Rocket [as measured using SPECint and Coremark].

If you want to get started, I'd recommend starting with rocketchip/sifive dev boards (they provide ready-to-go FPGA images). Rocketchip has a company supporting it, it's open-source, and it will boot Linux (if you use the RV64G version). But it's a very complex code base, so if you want to hack the RTL, it will be a steep learning curve.

If you just want to play with smaller, easier to grok cores, you can find some ready-to-go FPGA cores like picorv32. That's targeting high-frequency FPGA softcore applications and is a multi-cycle design (trading off IPC for higher frequencies).

Thanks for the info. I will have to start with the picorv32 (RV32IM) or Rocket Chip itself and see what will fit on my Papilio board. If I wanted to hack on RTL and a toolchain, is the GCC port or LLVM port easier to start with? There seem to be a couple of LLVM ports... I will go with LLVM-RISCV with RocketChip.

Also, any plans of an async RISC-V implementation?

The gcc port is mature and has been upstreamed. The LLVM port is still a work-in-progress.

I know there's been at least one talk presenting some thesis work on an clock-less RISC-V design at an early RISC-V workshop, but I'm not aware of any in-progress implementations.

So is anyone designing RISC-V cores without using Chisel? :)

Yes, I am about 2/3rds of the way through designing and building a single cycle, modified Harvard architecture, RV32E CPU. You can check it out here...


Although a projected speed of 1Mhz and the size might exclude it from your own project!

I meant more like... just using straight Verilog, but very impressive!

I'm not sure I would recommend using straight verilog for something of that magnitude. I do know that some folks that used Perl to assemble verilog into a computer chip, but they regretted the engineering debt that came with it (and the hire they made that instituted that decision).

> Perl to assemble verilog into a computer chip

This is absurdly common.

Yeah there are a few out there. At Vectorblox we have ORCA that we use as a cross-platform replacement for nios/microbaze.

I don't know of anyone who has built anything as sophisticated as BOOM though.

Nvidia said they're not using Chisel for the RISC-V cores they plan to embed in their GPUs.

see https://www.youtube.com/watch?v=g6Z_5l69keI&t=545

There is PicoRV32 [1].

[1] https://github.com/cliffordwolf/picorv32

Yep and there's https://github.com/SpinalHDL/VexRiscv

But admittedly, SpinalHDL is basically Chisel.

There is the PULP project (http://www.pulp-platform.org/) by ETH Zurich and University of Bologna with different open-source RISCV cores written in (System) Verilog.

Yes, see Syntacore's SCR1, Clifford Wolf's PicoRV32,etc.

Granted these are smaller, microcontroller-class cores, but still in pure Verilog. I'm sure there are others out there with more/less performance

Here's a collection of RISC-V cores in Bluespec: https://github.com/csail-csg/riscy

plenty. not as many open source ones though.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact