
BOOM v2: an open-source out-of-order RISC-V core - ingve
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-157.html
======
openasocket
For those that want to read the original BOOM paper first, I'll save you a bit
of time:
[https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-...](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-167.html)

~~~
microcolonel
Highlights: The version of BOOM in the paper (two years ago) had the same IPC
as a Cortex-A15 in _half the die area_ on a process node _two generations
older_ while using compiler ports which were immature at the time (and still
probably have plenty of room to improve). Most impressively of all, it was
designed and developed by three people (Chris Celio, David A. Patterson, Krste
Asanović), two of whom (David and Krste) were not primarily focused on it.

~~~
_chris_
Hi, author here. Just to point out to others (and to be fair to the A15), IPC
is just one part of the final performance equation. =)

~~~
microcolonel
Chris, I worked hard to get you on that pedestal, don't go jumping off. ;- )

Fair enough, you didn't reach the same frequencies, but that's what the other
1.4mm² and the process shrink are for.

ARM[v7] maybe does a little bit more per instruction, what with those
conditions and 14+ character non-mnemonic mnemonics; but ultimately
instruction counts should be pretty close, right?

Update: also probably SIMD[or vectors], breakpoints, more interesting memory
management, the handling of bizarre FP corner cases, maybe power
management[high frequency dvfs? :- )], and other things go in that additional
1.4mm².

~~~
_chris_
> _Chris, I worked hard to get you on that pedestal, don 't go jumping off. ;-
> )_

O:-)

> _ARM[v7] maybe does a little bit more per instruction, what with those
> conditions and 14+ character non-mnemonic mnemonics; but ultimately
> instruction counts should be pretty close, right?_

Great question. I just so happen to have written a tech report on this very
topic! [https://arxiv.org/abs/1607.02318](https://arxiv.org/abs/1607.02318)

Basically, performance should be identical between ARMv8 and RISC-V, given the
RISC-V core implements macro-op fusion to combine things like pair loads
together.

~~~
microcolonel
> _Great question. I just so happen to have written a tech report on this very
> topic![https://arxiv.org/abs/1607.02318](https://arxiv.org/abs/1607.02318)
> ._

> _Basically, performance should be identical between ARMv8 and RISC-V, given
> the RISC-V core implements macro-op fusion to combine things like pair loads
> together._

Yeah, I drew my conclusions from your papers. :- )

I really should diversify my sources, I bring nothing to this exchange.

------
phkahler
No numbers. They report a 20 percent REDUCTION in IPC in exchange for an
undisclosed increase in clock speed (or is that implicit in the Fo4?). I
assume v3 will make the improvements hinted at to bring IPC back up while
maintaining clock speed. Anyway the original BOOM paper had lots of real data
on IPC in comparison to commercial cores and also gave the actual BOOM clock
speed. It's not clear to me where BOOM is going from this paper. Perhaps
they're holding off until the conference? Maybe they don't actually have
silicon to test yet?

I'd like to see their ideas on macro-op fusion implemented in BOOM. That
should give a decent IPC increase too.

~~~
_chris_
Hi phkahler! This is just a short workshop report with a focus on the
microarchitecture (edit: oh! and the report mentions a 24% reduction in clock
period). Unfortunately, some conferences have very strict pre-publication
rules that prevent me from disclosing everything that I'd like to talk about.

Regarding benchmarking though, there is _another_ CARRV workshop paper (coming
up in October) that will be reporting some hard performance numbers of BOOM.
=)

And I'd love to implement macro-op fusion, but I have too many bigger fish to
fry right now. Pull requests accepted though. ;)

~~~
exikyut
> _Unfortunately, some conferences have very strict pre-publication rules_.

That's fine. I'm not sure if this is embargoed too, but: when can we start
asking detailed questions and/or looking for new papers?

~~~
_chris_
Yah, you can ask questions, I might just have to give evasive answers ;). This
tech report itself isn't embargoed or I wouldn't have posted it. The pre-
publication issue is the scope of what I could cover in this tech report was
relatively narrow.

------
userbinator
How will it perform relative to other ISAs, given that RISC-V is basically a
slight variant of the MIPS --- the other "academic ISA" that probably received
as much if not more attention and hype as RISC-V does today? Not long ago, a
4-way OoO MIPS fared poorly against ARM and x86:

[https://www.extremetech.com/extreme/188396-the-final-isa-
sho...](https://www.extremetech.com/extreme/188396-the-final-isa-showdown-is-
arm-x86-or-mips-intrinsically-more-power-efficient)

It would be interesting to see that comparison performed again on more recent
ARM and x86, as well as real RISC-V silicon.

~~~
_chris_
The ISA is going to have close to zero impact on a processor's performance. I
actually wrote up another tech report on this topic
([https://arxiv.org/abs/1607.02318](https://arxiv.org/abs/1607.02318)).

In short, you can rely on macro-op fusion to dynamically fuse some idioms that
show up as instructions in other ISAs, like load-pair.

~~~
monocasa
Wouldn't relying on macro op fusion increase i$ pressure?

~~~
_chris_
No. The beauty of RISC-V is it has a compressed extension with 2-byte forms of
the most common instructions. The average instruction size is 3.0 bytes on
SPECint workloads. That's even better than x86.

So instead of creating a 4-byte "load-pair" instruction (or having to fuse two
4-byte loads), you can fuse two 2-byte loads into a single "load-pair" micro-
op. Same performance, same code density, but a cleaner ISA (since not
everybody wants the complexity of implementing load-pair).

------
bogomipz
I apologize if this is a silly question but is there the possibility that this
chip might see consumer availability outside of do-it-yourself FPGA
implementations?

~~~
wmf
Anything is possible, but it sure looks like everybody is working on
microcontrollers and nobody is even trying to ship real cores.

~~~
legulere
The privileged ISA is still a draft and is supposed to be finished later in
this year. The privileged ISA is needed to support operating systems. There
are projects out there targeting this kind of processors, for instance
lowRISC.

~~~
senatorobama
It can run TempleOS :)

------
dnautics
Obviously the "given a sufficiently sophisticated compiler" joke applies, but
it seems like the out-of-order optimization is computationally challenging.
Now that we have more sophisticated programming languages (and more
sophisticated versions of old programming languages) shouldn't that be done at
compile time as a one-time (power, time) cost instead of a recurrent cost
across all computing (which will inevitably duplicate effort)? Can someone
explain to me why this should be done on metal? (besides that it's always been
done like that since the late 90s)

~~~
microcolonel
The _Mill architecture_ team is working on a static-scheduled architecture,
but only time will tell if compilers can reasonably get good enough at it to
make it viable. One major problem with static schedules is that if you want
more IPC, you generally have to change the ISA or have machine-specific
loaders which mutate the code at load time (this is the approach the Mill
folks take).

Standard ISAs let you write a program with a linear model, and let the
hardware get better at scheduling the micro ops over time, or just let it be
linear (as is generally preferred on low end microcontrollers and low-power
mobile cores).

~~~
mwcampbell
> if compilers can reasonably get good enough at it

Do compilers, plural, have to get good at it, or just the Mill team's own LLVM
backend?

~~~
microcolonel
What about the optimizing compilers in v8? SpiderMonkey (and friends)?
JVM/Hotspot? Beam? the CLR/Roslyn? I maybe most of what compiles only on GCC
(or MSVC) today could be reasonably ported to compile on Clang or DragonEgg,
but there are a lot of compilers in the world these days.

What we're talking about is not the difficulty of making _one_ sufficiently
intelligent compiler, but probably something more like _six_ of them, along
with every other compiler anyone ever wants to write again.

I don't know about you, but I'm not about to assume that we'll somehow manage
to develop and maintain _six_ of something we've never even developed _one_
of. The Mill is no Itanium, but it is _very_ unusual. They haven't even tried
to make a normal assembler, their assembler is a C++ template metaprogram
which generates the assembler which you then run to get your program.

~~~
phs2501
The impression I get is that their odd assembler is pretty much there for
development expedience while they are hand-writing assembly. I can't imagine
that their specializer (the thing that runs at install or run time which
converts the generic load module to specialized concrete machine code for the
processor you have) will produce as an IR form assembly that will then go
through a C++ compiler... that would just be crazy. And the weird C++
assembler is definitely for the concrete machine code; the generic form is
supposed to be very similar to LLVM IR.

Amongst other things it (a C++ assembler) would be slow as hell and require a
huge amount of runtime. Ideally the specializer would be a single binary that
would convert a generic load module to a specialized one with no intermediate
steps. My guess is that they will generate such a specializer with their
current generic specification tools.

~~~
microcolonel
You'd be right that the specializer doesn't do the same things as the
assembler; and programmable assemblers are the norm, but the crucial
difference is that their ISA and ABI make the assembler context-dependent.

~~~
phs2501
Well, their exposed ISA and ABI are probably considered to be genAsm, which is
the aforementioned mostly-LLVM-based IR. The concrete machine code will mostly
likely only ever be generated by the specializer in the real world. Consider
the specializer to be the static software equivalent of the x86-to-uOp
decoding core in the hardware of a modern x86. On the one hand it can only
make static scheduling decisions, whereas on the other hand it can look at the
entire program being translated to make those decisions rather than just the
window of instructions the x86 decoder can see during execution.

I really think this is the only path to make a long-term-viable staticly-
scheduled architecture. Otherwise you'll wind up with the Itanium problem,
which is that you eventually need to rewrite/reorder code internally (in
hardware) as your architecture evolves, which kind of removes the advantage of
having the static scheduling in the first place.

~~~
microcolonel
> _I really think this is the only path to make a long-term-viable staticly-
> scheduled architecture._

You may be right. I think the challenges involved (especially as OOBC is
proposing to solve them) defeat the goal. For systems that have no present or
future interest in JIT, it might offer somewhat higher peak performance than
whatever commodity computers are on the market; for those with JIT (especially
with frequent compilation, like in a web browser) I can see it being
prohibitive. I don't care how magical their specializer is, it will cost
something to run, and cost a lot to integrate (imagine not being able to know
the size or entry point of the basic block until you've specialized it,
infuriating!). The way they talk about the specializer, it even gives the
impression that they don't intend to share the source code, which will be a
whole lot of fun when deciding whether or not you trust it to run in the
middle of your application.

~~~
phs2501
Yeah, who knows. I mostly think the Mill is a pile of interesting ideas that
just might, if they're very lucky, eventually become a product. I'm certainly
not holding my breath though.

Regarding their intentions, they've said repeatedly that they want to sell
chips. It'd therefore be pretty stupid IMHO to not open up as much as they can
the toolchain to get software to run on their chips, including the
specializer. For that matter they should definitely also be releasing the chip
specification data (insn/op bit patterns, functional unit slot arrangement,
latencies, etc) that's used to create the specializer so people can roll their
own if they want.

Then again this is the same world where Intel ME is an ultra-suspicious closed
blob and you can't get datasheets on tons of chips to save your life, which
makes no sense to me either. So I'm probably not a very good judge of what's
reasonable.

------
lifeisstillgood
Can I ask a very dumb question - is there a (simple) roadmap to some kind of
open hardware nirvana - ie general purpose chip design that can be created on
fabs 2 generations old (I think that's one of the features here)

Things like

\- we still need to do X to this design \- we still don't have Y

I would be interested in how close or far we are from dropping things like
IntelME and other nice things to have

~~~
_chris_
Interesting question! I think the FOSSi community (behind LibreCores and
ORConf) are probably the most aware of what that road map is and helping us
get there.

One of the issues is that cores are fun to build, but are only one piece of
the puzzle (guilty!). So there are lots of FPGA-targeting softcores, but fewer
cache coherent high-performance multi-level cache systems, for example. I'm
hoping that SiFive's TileLink protocol helps shore up the ecosystem, which is
a high performance free and open bus standard. Some other gaps are devices,
drivers, graphics, crypto-engines, more open-source testing/verification
infrastructure... it's an exciting time for OSHW!

------
phkahler
The multi-ported register file is said to be one of the big challenges. I
wonder if it would be useful to split it into two or 3 identical copies (like
the guy doing the 74xx TTL version). This would allow the different copies to
be located closer to execution units, but would require the write data to
travel further. So many tradeoffs...

~~~
_chris_
Sadly no. You would still have to broadcast the 3 write ports to each copy, so
"being closer" doesn't actually work out.

~~~
deepnotderp
I think his point is that reading is faster, and the savings from the
quadratic cost of the register file might still be useful, a la the DEC Alpha.

------
ece
How does this compare to rocketchip and other RISC-V cores? Which is easiest
to get started on with an FPGA? From my looking, rocketchip seems like it is
seeing the most active development in open source, but I would love to see a
comparison of all open RISC-V cores.

~~~
_chris_
BOOM is a core that fits into the rocketchip SoC ecosystem. Or said another
way, BOOM wears rocketchip's skin. :D [i.e., caches, memory system, uncore,
test harnesses...]

Default rocketchip uses the Rocket core, which is a single-issue in-order
core. If things are going well, a two-wide BOOM is 50% to 100% faster than
Rocket [as measured using SPECint and Coremark].

If you want to get started, I'd recommend starting with rocketchip/sifive dev
boards (they provide ready-to-go FPGA images). Rocketchip has a company
supporting it, it's open-source, and it will boot Linux (if you use the RV64G
version). But it's a very complex code base, so if you want to hack the RTL,
it will be a steep learning curve.

If you just want to play with smaller, easier to grok cores, you can find some
ready-to-go FPGA cores like picorv32. That's targeting high-frequency FPGA
softcore applications and is a multi-cycle design (trading off IPC for higher
frequencies).

~~~
ece
Thanks for the info. I will have to start with the picorv32 (RV32IM) or Rocket
Chip itself and see what will fit on my Papilio board. If I wanted to hack on
RTL and a toolchain, is the GCC port or LLVM port easier to start with? There
seem to be a couple of LLVM ports... I will go with LLVM-RISCV with
RocketChip.

Also, any plans of an async RISC-V implementation?

~~~
_chris_
The gcc port is mature and has been upstreamed. The LLVM port is still a work-
in-progress.

I know there's been at least one talk presenting some thesis work on an clock-
less RISC-V design at an early RISC-V workshop, but I'm not aware of any in-
progress implementations.

------
gok
So is anyone designing RISC-V cores _without_ using Chisel? :)

~~~
PhilWright
Yes, I am about 2/3rds of the way through designing and building a single
cycle, modified Harvard architecture, RV32E CPU. You can check it out here...

[https://hackaday.io/project/18491-worlds-first-32-bit-
homebr...](https://hackaday.io/project/18491-worlds-first-32-bit-homebrew-cpu)

Although a projected speed of 1Mhz and the size might exclude it from your own
project!

~~~
gok
I meant more like... just using straight Verilog, but very impressive!

~~~
dnautics
I'm not sure I would recommend using straight verilog for something of that
magnitude. I do know that some folks that used Perl to assemble verilog into a
computer chip, but they regretted the engineering debt that came with it (and
the hire they made that instituted that decision).

~~~
senatorobama
> Perl to assemble verilog into a computer chip

This is absurdly common.

