First of all, kudos to WD. This makes me feel good about spending $1500 on their spinning drives just last week.
But on a more practical note, what kind of board and toolchain does one need to get this going on an FPGA? Is there a readme somewhere that would walk one through the process?
You'd need a SoC variety of FPGA with a memory controller as I didn't see one in this code base. Putting this in an FPGA seems feasible.
But I see some challenges. It looks like it's a Harvard-architecture core. That means separate buses for data and instructions, which is not common outside of embedded or specialized systems. I'm sure you can setup GCC to work with this, but it would be a project.
You could build (or find) a memory controller that can multiplex separate instruction and data buses to a single memory space... decide data will be in memory range A and instructions in memory range B, and inform the linker where to put code and data.
I'd probably start by downloading whatever free versions of the fpga tools the vender's offer and see if I can synthesize the code with any of the targets and how well it fit. (assuming someone else hasn't posted that info already). If it isn't going to fit in anything supported by the free version of tools, I probably wouldn't go any further with it myself.
Assuming that it did fit, I would switch gears and would build a simulation testbench, and start tinkering to see how it worked as compared to the docs. If it really is strictly harvard, I'd build a bridge to the FPGA's memory controller that could map two buses to a single memory space. If I got that far I'd start working setting up a compiler and linker to map out code and data partitions to that memory space.
At this point you might be ready to build all this and load the FPGA, but you have no peripherals (like ethernet or a vga). I'd consider slaving it to a raspberry pi or something like that. I saw a debug module in the github repo, so that might be a good thing to expose to the raspberry pi. Or pick a simple bus like i2c and use that to get some visibility from the r-pi into the risc-v state and bridge over to the ram.
---
Another direction you could take would be to get something like a snickerdoodle. I believe it can boot linux with the arm core in the FPGA, and it has the peripherals you need like ethernet(wifi) and access to an SD card. So the direction I would take there is trying to supplant the ARM core with the RISC-V. So the effort there would be to disable the ARM core, which ought to be straight forward, and build a wrapper around the RISC-V core to be able to talk to the peripherals in place of the ARM.
Given that it's an ARM core, I'm sure all the internal busing is AMBA (AHB/APB/AXI), so it's probably pretty reasonable to try this.
I wouldn't call this Harvard. It has ICCM/DCCM (closely-coupled memories) that appear to be strict (i.e. core cannot ld/st to ICCM), but everything is in the same (physical) address space, and the CCMs are just mapped into that space. In that sense, the ICCM is not much different from having a ROM in the memory map, which isn't usually seen as making something a harvard architecture.
In my view, RISC-V pretty much precludes a hard Harvard architecture. The fact that RISCV has a FENCE.I instruction that "synchronizes the instruction and data streams" means that a RISC-V implementation can't really be strictly Harvard, since if it was it wouldn't make much sense to invalidate the icache.
This core has an icache, but that also doesn't make it a Harvard architecture. As long as the backing store of both the icache and the dcache is the same, it won't be any more difficult to work with than any other modern modified Harvard architecture (read: pretty much every computer system in every desktop, laptop, phone, etc in the last decades).
> I'm sure you can setup GCC to work with this, but it would be a project.
FWIW AVR is not only Harvard, but code memory addresses point to 16-bit words, and data pointers to bytes. Yet, gcc works great (mostly). It is mainly a matter of whipping up a good linker script and directing code and data to the appropriate sections. Not particularly hard, although Harvardness does leak into your C code, mostly when taking pointers to functions or literal sections stored in flash.
*mostly — gcc long ago stopped taking optimization for code size seriously. Unfortunate for uCtlr users, as for small processors like the AVR optimization for size is pretty much also optimization for speed. Gcc has had some pretty serious code-size regressions in the past — but mostly not noticed by people not trying to shoehorn code into a tiny flash space.
For ARM, I have seen IAR emit unoptimized code half the size of GCCs output with speed optimizations enabled (which lead to slightly smaller code than size optimization!). When optimizing for size, IAR shrank the code size by another 1/3 to 1/2. When you are really strugging to fit functionality into a tight contoller because your hw engineers won't put in a bigger contoller, this is a dealbreaker.
I guess the tone came out wrong. I understand why we got the HW we got. In this case it wasn't even about cost. The power available to the device as a whole was so low that we were counting microamps. It was all incredibly tight:
Just switching on the wrong SoC feature would bring the entire thing outside the envelope. Even the contents of the passive LCD affected power consumption in adverse ways. Showing a checkerboard pattern could make the device fail.
It's surprising to me that the RISC-V ISA specification is loose enough that a core could be considered RISC-V-compliant and yet also need a linker script to accommodate its peculiarities.
On x86-64, the page tables control memory access permissions (rwx). The layout and semantics of these page tables are defined by the ISA and work the same on any processor. You wouldn't need to lay out your binary differently on Intel vs. AMD, for example.
It sounds like AVR, by contrast, has some parts of the address space that are (x) only and others that are (rw) only, even though other RISC-V processors don't have this restriction. That seems odd to me.
Maybe it's not as odd if you think of memory as a device that is mapped into your address space. ROM could be mapped into an x86-64 address space and be read-only -- even if the page table said it was writable it would probably throw some kind of hardware exception if you tried to actually write it.
> It sounds like AVR, by contrast, has some parts of the address space that are (x) only and others that are (rw) only, even though other RISC-V processors don't have this restriction. That seems odd to me.
AVRs have three separate address spaces (program, RAM/data, EEPROM), i.e. the address "0" exists three times. Additionally their registers live in RAM.
Every architecture has a gcc linker script, it's just that most of them have a default one hiding in an "arch" directory somewhere that works for normal use.
It occurs to me that the more likely explanation is that GCC devs aren’t up to date on the specs and that perhaps a unified situation is possible, just not implemented.
> It looks like it's a Harvard-architecture core. That means separate buses for data and instructions, which is not common outside of embedded or specialized systems.
I'd point out that WASM is also Harvard architecture so that's not so exotic anymore
I don't know why WD released this, but it would be useful for people building SoC ASICs that don't want to license an ARM core. Depending on the licensing.
The core speaks AXI and AHB-Lite, so for an experienced FPGA/core guy, it probably wouldn't be too much work to integrate into their own FPGA flow. And since it doesn't have any floating-point, it will probably be able to fit on modestly-sized FPGAs.
From cursory glance they have operand read bypass and branch prediction. Operand read bypass speeds things up considerably (factor is 1/N, where N is pipeline length) and the presence of branch prediction hints at speculative execution.
>"From cursory glance they have operand read bypass and branch prediction."
I'm not familiar with "operand read bypass", is this the same thing as "operand forwarding"? Is this a pipeline optimization? Might you have any link you could share on this?
This one doesn't have out of order processing, a FPU, or a MMU. It's also 32 bit where BOOM is 64.
Edit: There is a little more detail available for an existing Marvell ARM SATA controller (similar to what WD uses now), which this processor is probably the replacement for[1]. That indicates it has 2 ARM Cortex R4 cores. On the Cortex R4, it looks like the FPU is an optional feature, and there is no MMU, just an MPU[2].
Basically, it looks like the BOOM CPU is intended to be a general purpose CPU, and this is intended to be a very fast/capable microcontroller. It wouldn't run Linux, other than maybe something like uClinux.
Because you’re comparing a workstation class out-of-order CPU design to a teeny tiny embedded chip lacking an FPU or even a unified code/data cache. It’s like saying “how does the Intel Atom compare against the AMD Ryzen Threadripper core?”
It would be their dream come true if the hw oss community (very small) takes this up and develops it into something huge like for sw projects. There is so much manpower needed to develop these kinds of things - what better way than oss.
However, too little, too late, ee industry. An entire generation of top tier college students have pretty much skipped ee.
Nowadays the Computer Engineering degree lets those of us who were torn between EE and CS get both. At least at Virginia Tech, CpE(Computer Engineering) easily outstrips CS and teaches the basics of CS and EE fundamentals as well as practical application of both of them.
Sure, that makes sense in theory, but of the computer engineers I've known, they prefer one end or the other and aren't as good at either as someone dedicated to just one.
However, this is a pretty small sample size (I helped interview at a small company), so I'm not sure if it's a trend.
Ya most of us tend to prefer one side or the other but this system tends to be a good way to get create would be EEs with a decent bit of exposure to CS and vice versa.
It doesn't create perfect mixed skills but give a working experience with both sides. For example I hate semiconductor work, dislike circuit analysis, and much prefer software but I enjoy working with VHDLs and embedded software. I'll take Rust or Haskell over Verilog any day but I have no problem working with VHDLs given the need.
Lots of unsubstantiated claims here. WD doesn't need the OSS community, the community won't make or break the project. They're a 'nice to have' at best. Remember, if this really means savings of $X (that goes to ARM ??) per unit of storage shipped, given their volumes, they'll gladly fund this to the finish line. Claiming WD needs the OSS community is like claiming chrome/Google needs the community. Sure, the community contributions are valuable, but the core funding and heavy lifting is always in-home.
As for a whole generation skipping EE, no evidence.
Looks like royalties are somewhere in the 2% range for ARM[1]. There's also whatever markup companies like Marvell charge to WD. It sounds like WD sells around 40 million drives per year. So, even a savings of 25 cents per unit would be $10M/year.
> An entire generation of top tier college students have pretty much skipped ee.
Umm, what? I just looked at MIT's statistics (the registrar publishes them online) and sure, there are a lot more CS students, but there are certainly a lot of EEs.
I would think MIT might be a bit special, given that both EE and CS are Course 6 - some other schools have a similar setup (i.e. one EECS department), but many do not.
From a cursory look, it appears there were around ~55,000 CS graduates in 2015 the US, ~11,000 EE, and ~5,000 CE.
I don't agree that a "generation have skipped EE," but I do think it is probably true that CS graduates in general now probably have much less hardware experience/education than has been the case historically (no comment from me on whether that is good/bad).
The institute breaks out 6-1 and 6-2 (took me a while to figure that because the registrar writes VI-1 :-).
I’ve seen feedback correct this — shortly after I graduated analog fell out of favor (for jobs) so almost all the analog engineers in industry were old guys....so when the worm turned there was a shortage of experienced analog EEs.
I assume this reversion to the mean will happen more broadly for EEs. Then again, China and Europe are graduating a lot of EEs so it’s not like EE won’t be done, just not in the US.
The number of EE grads is going down in Europe too.
In Eastern Europe students still study it but then move to software as Web Dev pays 2x as much. In Western Europe students gave up on EE completely since since SW companies make HW ones look archaic.
No grad wants to work for us when he comes for the interview and sees only old guys. My boss has only hired Eastern Europeans, Indians and Chinese lately since he couldn't find any EE. Also, him and other companies not offering competitive wages to EE while complaining of a shortage is the true problem.
Offer competitive wages and grads will flock to this field.
This was clear 10 years ago when ee and cs salaries started diverging significantly and ee companies remained fairly stingy, even with food and drinks and perks at the workplace, let alone salaries.
It will end up costing them faaaar more than the salaries and benefits that were saved. Gg
I graduated CE and worked for 11 years with digital design and FPGAs in silicon valley. The salaries haven't kept up. I transitioned to SW and I'm not going back.
There's alot less jobs in ee than there is in programming/IT. Half of the ee's I know went into programming due to not finding jobs in what they wanted.
Would have been awesome if they joined the effort of developing FIRRTL[1][2] and using it as a universal (LLVM-alike) hardware intermediate language, better suited modern chip design needs than Verilog or VHDL[3].
It's a neat idea, but to me, that's like it. For SW, LLVM serves two-way: 1)target any (supported) arch you like with your language of choice, 2)customize your toolchain to generate native machine code (i.e. GPU shader langauges) from any (supported) language.
I'd say putting yet another middleman between your logic compiler/netlist and what you actually use is not the best course of action for HW implementations. Even HLS has a big friction for use and is hardly picked up, with its obvious, immediate advantages.
I've been meaning to get started with RISC-V for some time now but can't find much on it for total beginners online. Can anyone recommend a starting point for a total noob?
Or an UPduino which is still cheaper. I've had good experiences with one, though I'm only at the "PWM-animated LED" stage right now.
Oh, and https://www.nand2tetris.org is great. You implement a simple CPU, and though the language isn't a real-world one you learn enough to probably be able to implement a similar CPU on an FPGA using VHDL or Verilog.
The error is understandable, the main prototype RISC-V core is written in Chisel and the "Rocket Core Generator" is written in Chisel. The RISC-V team at Berkeley has a lot of overlap with the Chisel team. So it is easy to think RISC-V is all Chisel. I believe
Chisel, Chisel, RISC-V, Chisel, if that makes any sense.
Unfortunately, if you don't see an edit button on the comment then you can't edit it anymore. But props for admitting you're wrong and trying to correct it :-)
So this is the logic of the controller in Verilog. But I don't see any test scripts. I am not an expert in logic design, but it seems to me that validation is at least as expensive and time consuming as the actual creation because you cannot afford mistakes in the ASIC masks. Am I missing something here?
A full environment that could get you to a blinking LED off of an FPGA would be the complete dump, but I don't see the value in downplaying publishing a bulk of code that all others have hidden behind lock and key.
From a quick look this is a reference model, not a testbench. There is a lot of work needed, writing testbench environment, test cases, analyzing and writing coverage models etc. before the Verilog code can be considered verified and ready for ASIC tape-out.
From my experience writing the design RTL code is at most 25% of the man hours. The rest is verification and some synthesis backend work.
Great work, many thanks to WD. As far as I see only a subset of the SystemVerilog features is used. Is there a "coding standard" somewhere available specifying this subset with a rationale? Is it mostly to be compatible with Verilator? Is there information available why they used Verilator and how it has proven itself?
Pleeeeze I want to have access as an end user. I'd like to do a predicate push down and be able to write into the DRAM buffer and the flush a commit. Or pin certain blocks directly into DRAM. Pleeeeze!
But on a more practical note, what kind of board and toolchain does one need to get this going on an FPGA? Is there a readme somewhere that would walk one through the process?