
Boom v2: Open-source out-of-order RISC-V core [video] - nickik
https://www.youtube.com/watch?v=toc2GxL4RyA
======
nickik
Many people predict RISC-V to be mostly for IoT but I hope they will really
push forward high performance implementation. Boom will get industrial support
from Esperanto to push it to be a very high single thread performance core.

We are really moving in a direction where it is possible to have a open
software hardware stack running on your everyday working devices.

There will still be some third party IP and such but having an open source
core (Rocket, Boom and others), SoC (LowRisc), firmware (coreboot/heads),
kernel (linux) and userland (gnu, android) will be fantastic.

The third party IP can be replaced step by step with open source efforts from
the open source community, the universities, government funds and business who
want free IP.

I really hope we see and raspberry pi style computers soon and hopefully
developer laptops in the next 1-2 years. Purism would maybe do something like
that.

~~~
userbinator
_Many people predict RISC-V to be mostly for IoT but I hope they will really
push forward high performance implementation._

The reason for that prediction is simple: it's what happened to MIPS, which in
its heyday got as much if not more hype than RISC-V. Advocates thought it
would be _the_ "architecture of the future" for everything from tiny embedded
systems to high-performance supercomputers. Now, its popularity is only in the
former, because it turns out that getting high performance from an ISA with
only simple and relatively large instructions is actually rather difficult.
You can add SIMD and other extensions to look competitive in specific
benchmarks, but general-purpose code will still be behind compared to
something like x86 or even ARM.

~~~
rwmj
The "official" response is that everything will be fine because the simpler
instructions will be fused together automatically by high-performance CPUs.
(The technique is called macro-op fusion). In fact there was a talk last year
by Chris Celio (same person as in this video) on the subject:

[https://riscv.org/wp-content/uploads/2016/07/Tue1130celio-
fu...](https://riscv.org/wp-content/uploads/2016/07/Tue1130celio-fusion-
finalV2.pdf)

Also note that RISC-V has a standardized compressed instruction extension
which is expected to be present on just about any high-performance 64 bit
chip, and that makes the case for macro-op fusion more persuasive, since n x
16 bit instructions might be fused and still take the same space as a
specialized CISC instruction.

I guess we won't really know until the hardware exists though.

~~~
userbinator
_because the simpler instructions will be fused together automatically by
high-performance CPUs_

...but then the larger instructions will still take up cache space and fetch
bandwidth, with the effects becoming even more significant with multiple
cores.

 _since n x 16 bit instructions might be fused and still take the same space
as a specialized CISC instruction._

As the saying goes, "In theory, there's no difference between theory and
practice. In practice, there is." It's far easier to split a CISC instruction
into uops, than to attempt to detect the possibly near-infinite combinations
of RISC instructions that effectively do the same thing so that they can be
fused together. It's like the CPU has to almost decompile the program, at
runtime, to figure out what it's doing and how to do it more efficiently on
the hardware...

Consider a round of AES, for example. Or for something simpler, a memory copy
loop. On x86, the latter is 2 bytes (and has been since the 8086), and the
former can be as small as 6 on the recent CPUs that have the AESNI extensions.
Maybe RISC-V can do the former in 4, but how often are AES operations
performed vs. bulk memory copying, and can RISC-V do the latter in 2? Those
are the sorts of things which benchmarks often don't show very well --- I have
no doubt RISC-V and ARM can (slightly) beat x86 in code density for vector
operations and such specialised things, but in practice, everything else
matters too --- sequential, general-purpose, perhaps somewhat branchy code.

I lived through the first RISC hype. This feels like a repeat.

~~~
jabl
> It's far easier to split a CISC instruction into uops, than to attempt to
> detect the possibly near-infinite combinations of RISC instructions that
> effectively do the same thing so that they can be fused together.

Er, it's not like macro-op fusion is some theoretical thing that isn't proven
in practice. x86 cpu's have done it for quite a while for CMP+JMP, and
compilers optimize for it by not putting other instructions between the CMP
and JMP. That is, macro fusion doesn't scan through some huge window of
instructions, it only works for neighboring instructions.

If you want to know more, I recommend reading through the slides that the
parent poster linked, and the article link I posted in a sibling comment.

> Consider a round of AES, for example.

FWIW, there is ongoing work to add a crypto extension to RISC-V, so this isn't
really an argument for or against RISC-V per se.

> Or for something simpler, a memory copy loop. On x86, the latter is 2 bytes
> (and has been since the 8086)

REP MOV* is a good example of what's wrong with CISC. It gets you good
instruction density, I grant you that, but it also slower than alternative
implementations using, say unrolled loops and vector instructions. Look into a
modern memcpy() implementation and prepare to be horrified. If it were that
easy to make REP MOV fast, one would have thought that Intel with their near-
unlimited budget would have made it so.

~~~
userbinator
_Look into a modern memcpy() implementation and prepare to be horrified. If it
were that easy to make REP MOV fast, one would have thought that Intel with
their near-unlimited budget would have made it so._

Look up "enhanced REP MOVS" (Ivy Bridge), and before that, "fast strings"
(P6). It can copy entire cachelines at once and easily beats anything else
except in tiny microbenchmarks where the negative effects of huge unrolled
loops don't appear. In general, the historical performance of the string
instructions is interesting and worth researching more...

~~~
jabl
Yes, and for Ice Lake they're introducing the yes-this-time-we-really-mean-it-
fast-rep-mov, er, "fast short rep movs". Time will tell whether it beats a
software implementation.

But my point still stands, if you're developing a new ISA, does it really make
sense to add dedicated string copy instructions, using opcode space, adding
die area for microcode ROM's, design and verification costs etc., considering
that even Intel with all their money can't make them an obvious win compared
to a software implementation?

But yes, both for x86 and RISC architectures, it's probably unwise to optimize
memcpy() implementations by staring at microbenchmarks that do nothing else
than memcpy() of different sizes, thus ignoring the I$ pollution that a
monster implementation causes.

------
robert_foss
This video goes into the depths of CPU design, but is quite easy to follow.

Highly recommended!

------
us589
I don't know anything about chip design. How far away is this chip from
competing with Intel Core i7 or Amd Ryzen 7s?

Will it require hundreds of people and budget in the hundreds of millions of
dollar? Or can small groups of researchers develop BOOM more until it can
compete in the desktop market?

