Fair enough, you didn't reach the same frequencies, but that's what the other 1.4mm² and the process shrink are for.
ARM[v7] maybe does a little bit more per instruction, what with those conditions and 14+ character non-mnemonic mnemonics; but ultimately instruction counts should be pretty close, right?
Update: also probably SIMD[or vectors], breakpoints, more interesting memory management, the handling of bizarre FP corner cases, maybe power management[high frequency dvfs? :- )], and other things go in that additional 1.4mm².
> ARM[v7] maybe does a little bit more per instruction, what with those conditions and 14+ character non-mnemonic mnemonics; but ultimately instruction counts should be pretty close, right?
Great question. I just so happen to have written a tech report on this very topic! https://arxiv.org/abs/1607.02318
Basically, performance should be identical between ARMv8 and RISC-V, given the RISC-V core implements macro-op fusion to combine things like pair loads together.
> Basically, performance should be identical between ARMv8 and RISC-V, given the RISC-V core implements macro-op fusion to combine things like pair loads together.
Yeah, I drew my conclusions from your papers. :- )
I really should diversify my sources, I bring nothing to this exchange.
Also there's no current RISC-V extension for SIMD or vector ops, so I didn't maliciously "omit" things to "totally rig" the comparison. But even running SPECint is not going to fire up the SIMD unit.
SPEC with modern compilers will definitely use SIMD, utilization on hot regions is variable but it's definitely beneficial.
Models with FMA: Cortex-M4, Cortex-M7, Cortex-A5, Cortex-A7, Cortex-A15, Cortex-A17. I do not remember which Cortex-R have FMA. Models without FMA: Cortex-A8, Cortex-A9.
I'd like to see their ideas on macro-op fusion implemented in BOOM. That should give a decent IPC increase too.
Regarding benchmarking though, there is another CARRV workshop paper (coming up in October) that will be reporting some hard performance numbers of BOOM. =)
And I'd love to implement macro-op fusion, but I have too many bigger fish to fry right now. Pull requests accepted though. ;)
That's fine. I'm not sure if this is embargoed too, but: when can we start asking detailed questions and/or looking for new papers?
It would be interesting to see that comparison performed again on more recent ARM and x86, as well as real RISC-V silicon.
In short, you can rely on macro-op fusion to dynamically fuse some idioms that show up as instructions in other ISAs, like load-pair.
So instead of creating a 4-byte "load-pair" instruction (or having to fuse two 4-byte loads), you can fuse two 2-byte loads into a single "load-pair" micro-op. Same performance, same code density, but a cleaner ISA (since not everybody wants the complexity of implementing load-pair).
Shouldn't any ISA at 22nm basically be more power efficient than one at 90nm?
The full paper is here: https://research.cs.wisc.edu/vertical/papers/2013/hpca13-isa...
Out-of-order execution helps with things like "I don't have the memory fetch done yet, but let's make a reasonable assumption and then speculatively do lots more that doesn't depend on that fetch".
This is relatively recent research, and definitely against the common grain. See Daniel McFarlin's work at CMU (https://dl.acm.org/citation.cfm?id=2499368.2451143)
load [rp1] -> r1
load [rp2] -> r2
Dynamic scheduling isn't as complicated as it seems, especially now that we have so many transistors on a chip and Amdahl performance is still very critical for most applications.
However, the value of OoO isn't just about constructing the "best" schedule, it's about being able to change schedules on a dime.
For example, VLIW compilers must try to statically produce the best "trace scheduling" across basic blocks. But OoO, register-renaming, and branch prediction allows you to build "trace schedules" on the fly. And branches tend to be far easier to predict dynamically than they are statically (or rather, branches tend to be more dynamically predictable than statically biased).
Unless you rely on a JIT to rebuild traces, and you can tolerate that overhead every time you fire up the JIT, I think OoO/dynamic scheduling will be with us forever.
Standard ISAs let you write a program with a linear model, and let the hardware get better at scheduling the micro ops over time, or just let it be linear (as is generally preferred on low end microcontrollers and low-power mobile cores).
This isn't a problem with static schedules, it's a problem with VLIWs (well actually any exposed pipeline machine).
(And really not that much of a problem anyways)
Do compilers, plural, have to get good at it, or just the Mill team's own LLVM backend?
What we're talking about is not the difficulty of making one sufficiently intelligent compiler, but probably something more like six of them, along with every other compiler anyone ever wants to write again.
I don't know about you, but I'm not about to assume that we'll somehow manage to develop and maintain six of something we've never even developed one of. The Mill is no Itanium, but it is very unusual. They haven't even tried to make a normal assembler, their assembler is a C++ template metaprogram which generates the assembler which you then run to get your program.
Amongst other things it (a C++ assembler) would be slow as hell and require a huge amount of runtime. Ideally the specializer would be a single binary that would convert a generic load module to a specialized one with no intermediate steps. My guess is that they will generate such a specializer with their current generic specification tools.
I really think this is the only path to make a long-term-viable staticly-scheduled architecture. Otherwise you'll wind up with the Itanium problem, which is that you eventually need to rewrite/reorder code internally (in hardware) as your architecture evolves, which kind of removes the advantage of having the static scheduling in the first place.
You may be right. I think the challenges involved (especially as OOBC is proposing to solve them) defeat the goal. For systems that have no present or future interest in JIT, it might offer somewhat higher peak performance than whatever commodity computers are on the market; for those with JIT (especially with frequent compilation, like in a web browser) I can see it being prohibitive. I don't care how magical their specializer is, it will cost something to run, and cost a lot to integrate (imagine not being able to know the size or entry point of the basic block until you've specialized it, infuriating!). The way they talk about the specializer, it even gives the impression that they don't intend to share the source code, which will be a whole lot of fun when deciding whether or not you trust it to run in the middle of your application.
Regarding their intentions, they've said repeatedly that they want to sell chips. It'd therefore be pretty stupid IMHO to not open up as much as they can the toolchain to get software to run on their chips, including the specializer. For that matter they should definitely also be releasing the chip specification data (insn/op bit patterns, functional unit slot arrangement, latencies, etc) that's used to create the specializer so people can roll their own if they want.
Then again this is the same world where Intel ME is an ultra-suspicious closed blob and you can't get datasheets on tons of chips to save your life, which makes no sense to me either. So I'm probably not a very good judge of what's reasonable.
Also, special invariants are sometimes needed/helpful to avoid data hazards/simplify data dependencies: e.g. the Mill cpu effectively has no general purpose registers, but an FILO stack of immutable items (This guarantees SSA on the data the cpu is operating on).
Not really, exposed pipeline is pretty common in the VLIW world.
It turns out that it is really hard to create a compiler to effectively take advantage of instruction level parallelism at compile time.
When I used Itanium 2 systems for HPC, Intel's compilers produced the fastest binaries from C and Fortran, at least on the software I used (molecular dynamics and quantum chemistry).
What's meant by "benchmarks designed to make x86 look good"?
Itanium was introduced against POWER, Alpha, SPARC, and MIPS chips. It was primarily benchmarked for server/workstation applications, where x86 processors weren't the fastest at the time. In 2003 a "Madison" Itanium 2 was significantly faster on numerical simulation code than a contemporary "Gallatin" x86 Xeon.
Itanium 2 did finally get reasonable performance, but only by adding back in dynamic scheduling to the microarchitecture.
(This is all in terms of fundamental benchmarks that are not x86 specific).
Nowithstanding the fact that your answer is totally unhelpful: I thought about it for a second and I think the answer is basically that if you have a multithreading cores and you're pending on the results of another operation, then you can't necessarily know what optimizations you can do ahead of time because you don't know what other threads are resident on your core, which will affect the on-the-fly reorganization.
If you can write better LLVM plugins then I'm sure Intel and HP will be happy to throw lots of money in your direction.
I've toyed around with the older Epiphanies though, and they were definitely easier than GPUs in many respects.
OoO is much more expensive to implement than simply scheduling a different warp to hide latency.
Maybe you're confusing the GPU's threading techniques for OoO?
- we still need to do X to this design
- we still don't have Y
I would be interested in how close or far we are from dropping things like IntelME and other nice things to have
One of the issues is that cores are fun to build, but are only one piece of the puzzle (guilty!). So there are lots of FPGA-targeting softcores, but fewer cache coherent high-performance multi-level cache systems, for example. I'm hoping that SiFive's TileLink protocol helps shore up the ecosystem, which is a high performance free and open bus standard. Some other gaps are devices, drivers, graphics, crypto-engines, more open-source testing/verification infrastructure... it's an exciting time for OSHW!
Default rocketchip uses the Rocket core, which is a single-issue in-order core. If things are going well, a two-wide BOOM is 50% to 100% faster than Rocket [as measured using SPECint and Coremark].
If you want to get started, I'd recommend starting with rocketchip/sifive dev boards (they provide ready-to-go FPGA images). Rocketchip has a company supporting it, it's open-source, and it will boot Linux (if you use the RV64G version). But it's a very complex code base, so if you want to hack the RTL, it will be a steep learning curve.
If you just want to play with smaller, easier to grok cores, you can find some ready-to-go FPGA cores like picorv32. That's targeting high-frequency FPGA softcore applications and is a multi-cycle design (trading off IPC for higher frequencies).
Also, any plans of an async RISC-V implementation?
I know there's been at least one talk presenting some thesis work on an clock-less RISC-V design at an early RISC-V workshop, but I'm not aware of any in-progress implementations.
Although a projected speed of 1Mhz and the size might exclude it from your own project!
This is absurdly common.
I don't know of anyone who has built anything as sophisticated as BOOM though.
But admittedly, SpinalHDL is basically Chisel.
Granted these are smaller, microcontroller-class cores, but still in pure Verilog. I'm sure there are others out there with more/less performance