For a few years I worked with the guy behind this project, Paul Campbell. He is a fearless coder, and moves between hardware and software design with equal ease.
An example of his crazy coding chops, he was frustrated by the lack of verilog licenses at the place he worked back in the early 90s. His solution was to whip up a compliant verilog simulator, then wrote a screen saver that would pick up verification tasks from a pending queue. They had many macs around the office that were powered 24/7, and they could chew through a lot of work during the 16 hours a day when nobody was sitting in front of them. When someone sat down at their computer in the morning or came back from lunch, the screen saver would just abandon the simulation job it was running and that job would go back to the queue of work waiting to be completed.
Synthesizable verilog is a very small language compared to system verilog — especially in the 90s. Off the top of my head I know of six "just real quick" verilog simulators that I've worked with (one of which I wrote). I'm not sure how I feel about them. On one hand, I hate dealing with licenses; on the other hand, now you've got to worry that your custom simulation matches behavior with the synthesis tools. A lot of the "nonstandard" interpretation for synthesizable verilog from the bigs comes from practical understanding of the behavior for a given node. Most of that is captured in ABC files ... but not all of it.
It was more than simple synthesisable verilog, but not a lot - it was also a compiler rather than an interpreter - at the time VCS was just starting to be a thing, verilog as a language was not at all well defined (lots of assumptions about event ordering that no-one should have been making)
I was designing Mac graphics accelerators I'd built it on a some similar infrastructure I'd built to capture trace from people's machines to try and figure out where QuickDraw was really spending it's time - we ended up with a minimilistic graphics accelerator that beat the pants off of everyone else
This is why I think Moore (LLHD), Verilator, and Yosys are such awesome tools. They move a lot more slowly than (say) GCC, but I personally think they're all close to the tipping point.
I wrote a second, much more standard Verilog compiler (because by then there was a standard) with the intent of essentially selling cloud simulation time (being 3rd to a marketplace means you have to innovate) - sadly I was a bit ahead of my time ('cloud' was not yet a word) the whole California/Enron "smartest guys in the room" debacle kind of made a self financed startup like that non-viable
So in the end I open sourced the compiler ('vcomp') but it didn't take off
A lot of people have come up with something similar. Someone I know implemented the Condor scheduler to run models on workstations at night at a hedge fund. That Condor scheduler dates to the 80s. Smaller 3d animation studios commonly do this too.
The presentation was interesting; but I would like to write an idea that is tangentially related to this CPU.
I noticed that modern CPUs are optimized for legacy monolith OS kernels like Linux or Windows. But having a large, multimegabyte kernel is a bad idea from a security standpoint. A single mistake or intentional error in some rarely used component (like a temperature sensor driver) can get attacker full access to the system. Again, an error in any part of the monolith kernel can cause system failure. And Linux kernel doesn't even use static analysis to find bugs! It is obvious that using microkernels could solve many of the issues above.
But microkernels tend to have poor performance. One of the reasons for this could be high context switch latency. CPUs with high context switch latency are only good for legacy OSes and not ready for better future kernels. Therefore, either we will find a way to make context switches fast or we will have to stay with large, insecure kernels full of vulnerabilities.
So I was thinking what could be done here. For example, one thing that could be improved is to get rid of address space switch. It causes flushes of various caches and it hurts performance. Instead, we could always use the single mapping from virtual to physical addresses, but allocate each process different virtual address range. To implement this, we could add two registers, which would hold minumum and maximum accessible virtual addresses. It should be easy to check the address against them to prevent speculative out of bounds memory accesses.
By the way, 32-bit x86 architecture had segments, that could be used to divide single address space between processes.
Another thing that can take time is saving/restoring registers on context switch. One way to solve the problem could be to use multiple banks (say, 64 banks) of registers that can be quickly switched, another way would be to zero out registers on return from kernel and let processes save them if they need it.
Or am I wrong somewhere and fast context switches cannot be implemented this way?
These days there are few caches that need to be flushed at context switch time - RISCV's ASIDs mean that you don't need to flush the TLBs (mostly) when you contect switch.
VRoom! largely has physically tagged caches so they don't need to be flushed, the BTC is virtually tagged, but split into kernel and user caches, you need to flush the user one on on a context switch (or both on a VM switch) - also the trace cache (L0 icache) will also be virtually tagged. VRoom! also doesn't do speculative accesses past the TLBs.
Honestly saving and restoring kernel context is small compared to the time spent in the kernel (and I've spent much of the past year looking at how this works in depth).
Practically you have to design stuff to an architecture (like RISCV) so that one can leverage off of the work of others (compilers, libraries, kernels) adding specialised stuff that would (in this case) get in to a critical timing path is something that one has to consider very carefully - b ut that's a lot of what RISCV is about - you can go and knock up that chip yourself on an FPGA and start trialing it on your microkernel
ASID = Address Space Identifier. It's a tag that uniquely identifies each processes' entries in the TLB. This ensures that your TLB lookups can be limited to the valid entries for the process, so you don't need to flush the TLB on context switch.
I think the way to think of ASIDs is as each being a separate address space - in effect if you have 15 bits of ASID you have 2^15 - 32k address spaces.
One thing I've done in VRoom! which is an extension on to the RISCV spec is that if we have an N hart SMP CPU (for example a 2 cpu SMT system) we use log2(N) bits of the ASID to select which hart/cpu a TLB entry belongs to - from a programmer's point of view the ASID just looks smaller.
However there's a VRoom! specific config bit (by default off) that you can set if you know that the ASIDs you are going to use for all your CPU's effectively see the same address space - if you set that bit then the per-cpu portion of the ASID tags (in the TLB) become available (ie to the programmer the ASID looks bigger) - it's a great hack because it doesn't get into any critical paths anywhere
Long ago, we in the Newton project at Apple had that idea. We (in conjunction with ARM) were defining the first ARM MMU, so we took the opportunity to implement “domains” of memory protection mappings that could be quickly swapped at a context switch. So you get multiple threads in the same address space, but with independent R/W permission mappings.
I think a few other ARM customers were intrigued by the security possibilities, but the vast majority were more like “what is this bizarre thing, I just want to run Unix”, so the feature disappeared eventually.
Its similar to the original macOS, which used handles to track/access/etc memory requested from the OS and swap them to disk as needed. First you request the space, then you request access, which pinned it into ram.
You should look into the Mill CPU architecture.[0] Its design should make microkernels much more viable.
* Single 64-bit address space. Caches use virtual addresses.
* Because of that, the TLB is moved after the last level cache, so it's not on the critical path.
* There's instead a PLB (protection lookaside buffer), which can be searched in parallel with cache lookup. (Technically, there's three: two instruction PLBs and one data PLB.)
Fundamental rethinks take time. The ideas expressed by the mill folks have value independent of any specific implementation or absence thereof. Yosys is incredible and the dropping cost and increasing availability of capable FPGA dev boards equally so. I wouldn't put it past a sharp CS major to whip up a toy mill cpu in FPGA these days just based on what's been shared publicly. It's a bit strange to me that I can still see echos of the Datapoint 2200 in a modern machine.
There's been a lot of recapitulation and growth in the language space recently as well, showing up in languages like Zig and Rust, paving the way for better utilization across heterogeneous and many core architectures. I feel like Rust's memory semantics don't hurt the mill either, and may help a lot.
I went looking and it seems that they're making some progress. I wasn't previously aware of their wiki, which contains ISA documentation and more: http://millcomputing.com/wiki/Main_Page
SASOSes are interesting, sometimes extending a 64-bit address space to cover a whole cluster, but they aren't compatible with anything that calls fork().
The various variants of L4 have pretty good context-switch latency even on traditional CPUs, and seL4 in particular is formally proven correct on a few platforms. Spectre+Meltdown mitigation was painful for them, but they're still pretty good.
Lots of microcontrollers have no MMUs but do have MPUs to keep a user task from cabbaging the memory of the kernel or other tasks. Not sure if any of them use the PDP-11-style base+offset segment scheme you're describing to define the memory regions.
Protected-memory multitasking on a multicore system doesn't need to involve context switches, especially with per-core memory.
Even on Linux, context switches are cheap when your memory map is small. httpdito normally has five pages mapped and takes about 100 microseconds (on a 2.8GHz amd64 laptop) to fork, serve a request, and exit. I think I've measured context switches a lot faster than that between two existing processes.
Multiple register banks for context switching go back to the CDC 6600's peripheral processor (FEP) or maybe the TX-0 on which Sutherland wrote SKETCHPAD; it has a lot of advantages beyond potentially cheaper IPC. Register bank switching for interrupt handling was one of the major features the Z80 had over the 8080 (you cn think of the interrupt handler as being the kernel). The Tera MTA in the 01990s was at least widely talked about if not widely imitated. Switching register sets is how "SMT" works and also sort of how GPUs work. And today Padauk's "FPPA" microcontrollers (starting around 12 cents IIRC) use register bank switching to get much lower I/O latency than competing microcontrollers that must take an interrupt and halt background processing until I/O is complete.
Another alternative approach to memory protection is to do it in software, like Java, Oberon, and Smalltalk do, and Liedtke's EUMEL did; then an IPC can be just an ordinary function call. Side-channel leaks like Spectre seem harder to plug in that scenario. GC may make fault isolation difficult in such an environment, particularly with regard to performance bugs that make real-time tasks miss deadlines, and possibly Rust-style memory ownership could help there.
What I would like to have is a context switch latency comparable to a function call. For example, if in a microkernel system bus driver, network card driver, firewall, TCP stack, socket service are all separate userspace processes, then every time a packet arrives there would be a context-switching festival.
As I understand, in microkernel OSes most system calls are simply IPCs - for example, network card driver passes incoming packet to the firewall. So there is almost no kernel work except for context switch. That's why it has to be as fast as possible and resemble a normal function call, maybe even without invoking the kernel at all. Maybe something like Intel's call gate, but fast.
> they aren't compatible with anything that calls fork().
I wouldn't miss it; for example, Windows works fine without it.
At the core of any protection boundary crossing is likely going to be a pipe flush (throwing away of tens or maybe 100+ instructions) - post spectre/meltdown we all understand that speculating past such a point into a differently privileged environment is very fraught.
I think this means we won't be seeing 'call gate' equivalents that perform close to subroutine calls on high end systems any time soon if at all
Though you certainly know more than I do about the subject, my understanding is that differently privileged environments can enqueue messages to each other without pipeline flushes, and general forms of that mechanism have performed better than subroutine calls on high-end systems since the early 01990s: Thinking Machines, MasPar, Tera, even RCU on modern amd64.
And specialized versions of this principle predate computers: a walkie-talkie has the privilege to listen to sounds in its environment, a privilege it only exercises when its talk button is pressed and which it does not delegate to other walkie-talkies, and the communication latency between two such walkie-talkies may be tens of nanoseconds, though audio communication doesn't really benefit from such short latencies. The latency across a SATA link is subnanosecond, which is useful, and neither end trusts the other.
oh totally, but then you aren't "making a procedure call" you're doing something different.
In this case your data is likely traversing the memory hierarchy far enough so that the message data gets shared (more likely the sending data goes into the sending CPU's data cache and the receiving one will use the cache coherency protocol to pull it from there) - that's likely to take of the order of a pipe flush to happen.
You could also have bespoke pipe-like hardware - that's going to be a fixed resource that will require management/flow control/etc if it's going to be a general facility
Agreed, but even in the cache-line-stealing case, those are latency costs, while a pipeline flush is also a throughput cost, no? Unless one of the CPUs has to wait for the cache line ownership to be transferred.
well if you're making a synchronous call you have to wait for the response which is likely as bad as a pipe flush (or worse, because you likely flood the pipe with a tight loop waiting for the response, or a context switch to do something else while you wait)
Also note that stealing a cache line can be very expensive, if the CPUs are both SMT with each other it's in the same L1, almost 0 cost, if they are on the same die it will be a few (4-5?) clocks across the L2/cache coherency fabric but if they are on separate chiplets connected via a memory controller with L3/L4 in it then it's 4 chip boundary crossings - an order or 2 in magnitude in cost
All that makes sense to me. So for high performance collaboration across security boundaries needs to be either very rare or nonblocking?
Multithreading within a security boundary is one way to "synchronously wait" without incurring a giant context-switch cost (SMT or Tera-style or Padauk FPPA-style; do GPUs do this too, at larger-than-warp granularity?). Event loops are a variant on this, and io_uring seems to think that's the future. But the GreenArrays approach is to decide that the limiting resource is nanojoules dissipated, not transistors, so just idle some transistors in a synchronous wait. Not sure if that'll ever go mainstream, but it'd fit well with the trend to greater heterogeneity.
Before we learned how to make them fast, perhaps. They do now tend to be very fast[0][1].
>One of the reasons for this could be high context switch latency.
As multiserver systems pass a lot of messages around, the important metric is IPC cost. Liedtke demonstrated microkernels do not have to be slow, with L3 and later L4. Liedtke's findings have endured fairly well[2] through time. It helps to know that seL4[3] has an order of magnitude faster IPC relative to Linux.
You'd need it to do a lot (think thousands of times) more IPC for the aggregated IPC to be slower than Linux.
>So I was thinking what could be done here.
I don't have a link at hand, but there's some involvement and synergy between seL4 team and RISC-V. I am hopeful it is enough to prevent the bad scenario where RISC-V is overoptimized for the now obsolete UNIX design, and a bad fit to contemporary OS architectures.
Segment registers are precisely how NT does context switching. I think it may be restricted to just switching from user- to kernel- threads. I can't remember if there's thread-to-thread switching using segment registers — I feel like this was a thing, or it was just a thing we did when we tried to boot NT on Larrabee. (Blech.)
Citation needed. What kind of hit are we talking about? 5%? 90%? We have supercomputers from the future that have capacity to spare. I would be willing to take an enormous performance hit for better security guarantees on essential infrastructure (routers, firewalls, file servers, electrical grid, etc).
This is a very ambitious project, so respect and good luck.
I am wondering if the performance will pan out in practice, as it doesn't seem to have a very deep pipeline, so getting high clockspeeds may be a challenge. In particular the 5 clock branch mispredict penalty suggest the pipeline design is fairly simple. Production CPUs live and die by the gate depth and hit/miss latency of caches and predictors. A longer pipeline is the typical answer to gate delay issues. Cache design (and register file design!) is also super subtle; L1 is extremely important.
As mentioned here I expect that reality will intrude and the pipe will get bigger - of course good a BTC (and spending lots of gates on it) is important because that's what mitigates that deep pipe.
I haven't published my latest work (end of the week) I have a minor bump to ~6.5 DMips/MHz - Dhrystone isn't everything but it's still proving a useful tool to tweak the architecture (which is what's going on now)
Any thoughts about higher level HDLs in embedded in software languages, like Chisel, nMigen, or others? Some other RISC-V core designers claim they've had increased productivity with those.
It seems that despite a lot of valid criticism against (System)Verilog, nothing really seems to be a on trajectory to replace it today. I'm not sure if that's purely inertia (existing tooling, workflows, methodologies), other HDLs not being attractive enough, or maybe Verilog is just good enough?
I think they're great - I earned my VLSI chops building stuff in the 90s and I can write Verilog about as fast as I can think so it's my goto language. I've also written a couple of compilers over the years so I know it really well (you can thank me for the '' in "always @()"). That's just my personal bias.
Inertia in tooling is a REALLY BIG deal - if you can't run your design through simulation, (and FPGA simulation), synthesis, layout/etc you'll never build a chip - it can take a 5-10 years for a new language feature to become ubiquitous enough so that you can depend on it en ough to use it in a design (I've been struggling with this using System Verilog interfaces this month).
If you look closely at VRoom! you'll see I'm stepping beyond some Verilog limitations by adding tiny programs that generate bespoke bits of Verilog as part of the build process - this stops me from fat fingering some bit in
a giant encoder but also helps me make things that SV doesn't do so well (big 1-hot muxes, priority schedulers etc)
Possibly crazy thought. With wider CPUs needing more ports (read and write) on the register file, would it make sense to use accumulators as registers so basic boolean and math ops could be done locally with a single read port that the alu could tap?
Modern CPUs have a structure called the bypass network that lets different ALUs forward their outputs to another's inputs without having to hit the register file. It's not exactly like a local accumulator but it's something in the same direction.
The idea was to have as many simple ALUs as registers. Results are kept in the ALU/register. All reads are essentially result forwarding. For example a simple RV32I requires 2 read ports and one write port on each register. If we use 2 R/W ports and put an ALU on each register, you reduce from 3 to 2 busses and can also do operations 2 at a time when an instruction clobbers one of its inputs. Or an ALU op along with a load/store.
As an aside, the latest and active development of nMigen has been rebranded a few months ago to Amaranth and can be found here: https://github.com/amaranth-lang/amaranth . In case people googled nMigen and came to the repository that hasn't been updated in two years.
What does your benchmarking workflow look like? I am interested in
* From a high level what does your dev iteration look like?
* Getting instruction traces, timing and resimulating those traces
* Power analysis, timing analysis (do you do this as part of performance simulation) ?
* Do you benchmark the whole chip or specific sub units?
* How do you choose what to focus on in terms of performance enhancements?
* What areas are you focusing on now?
* What tools would make this easier?
At the moment I'm just starting working my way up the hierarchy of benchmarks, dhrystone's been useful though it's nearing the end of its use - I build the big FPGA version (on an AWS FPGA instance) to give me a place to run bigger things exactly like this.
I currently run low level simulations in Verilator where I can easily take large internal architectural trace, and bigger stuff on AWS (where that sort of trace is much much harder)
I haven't got to the power analysis stage - that will need to wait until we decide to build a real chip - timing will depend on final tools if we get to build something real, currently it's building on Vivado for the FPGA target.
Mostly I'm doing whole chip tests - getting everything to work well together is sort of the area I'm focusing on at the moment (correctness was the previous goal - being together enough to boot linux), the past 3 months I've brought the performance up b y a factor of 4 - the trace cache might get me 2x more if I'm lucky.
I spend a lot of time looking at low level performance, at some level I want to get the IPC (instructions per clock) of the main pipe as high as I can so I stare at the spots where that doesn't happen
Great questions - I'm using an open source UART from someone else, an d for the AWS FPGA system I have a 'fake' disk driver plus timers/interrupt controllers etc
So far I haven't needed USB/ether/PCIe/etc I've sort of sketched out a place for those to live - I think that for a high end system like this one you can't just plug something in - real performance needs some consideration of how:
- cache coherency works
- VM and virtual memory works (essentially page tables for IO devices)
- PMAP protections from I/O space (so that devices can't bypass the CPU PMAPs that are used to man age secure enclaves in machine mode)
So in general I'm after something uniquer, or at least slightly bespoke.
I also think there's a bit of a grand convergence going on in this area around serdes's which are sort of becoming a new generic interface PCIe, high speed ether, new USBs, disk drivers etc are all essentially bunches of serdes with different protocol engines behind them - a smart SoC is going to split things this way for maximum flexibility
Not Paul Campbell, but I'll share what I know on the matter.
So GPL'd IO blocks - This is a great question, and something I have definitely been asking myself! One thing to keep in mind is that IO interfaces like PCIe, USB, and whatnot have a Physical interface ("Phy" for short.) Those contain quite a bit of analog circuitry, which is tied to the transistor architecture that's used for the design.
That being said, A lot of interfaces that aren't DRAM protocols use what's known as a SerDes Phy (short for Serializer De-serializer Physical interface.) More or less, they have an analog front end and a digital back end, and that digital back end that connects to everything else is somewhat standardized way. So it wouldn't be unreasonable to try to build something like an open PCIe controller that only has the Transaction Layer and Data Link Layer. While there are various timing concerns/constraints when not including a Phy layer (lowest layer,) I don't think it's impossible.
The other big challenge is that anyone wanting to use an open source design will definitely want the test benches and test cases included in the repo (you can think of them like unit tests.) Unfortunately, most of the software to actually compile and run those simulations is cost prohibitive for an individual, because it's licensed software. Also, the companies that develop this software make a ton of money selling things like USB and PCIe controllers, so I'll let you draw your own conclusions about the incentives of that industry.
Even if you were able to get your hands on the software, the simulations are very computationally intensive, and contribution by individuals would be challenging ...though not impossible!
Despite those barriers, it's a direction that I desperately want to see the industry move towards, and I think it's becoming more and more inevitable as companies like Google get involved with hardware, and try to make the ecosystem more open. Chiplet architectures are also all the rage these days, so it would be less of a risk for a company to attempt to use an open source design.
I'd really be curious to hear Paul Campbell's take on this question though. He definitely knows a lot more than I do!
It's likely too big for those programs - I am (just now) starting a build with the Open Lane/Sky tools not with the intent of actually taping out but more to squeeze the architectural timing (above the slow FPGA I've been using for bringing up Linux) so I can find the places where I'm being stupidly unreasonable about timing (working on my own I can't afford a Synopsys license)
I'm just starting this week, I've recently switched to some use of SV interfaces and it does not like arrays of them - sv2v seems the way to go - but even without that yosys goes bang! somethings too big Vivado compiles the same stuff - I rearchitected the bit that might obviously be doing this but no luck so far.
Do you think it is feasible to add some kind of sticky overflow detection bit for integer arithmetic, the way IEEE 754 specifies for FP? The hope is to be able to efficiently implement checked arithmetic as required by e.g. Ada. It came as a real disappointment that RiscV seems to have made that harder rather than easier, compared to the x86 and its ilk. The sticky bit is hopefully more efficient than traditional condition codes, since the compiler can emit checks for it at the end of a function or basic block, giving the hardware some time to catch up.
RISCV doesn't have condition codes, which makes building systems like this with lots of ALUs a lot easier, everything happens in the registerfile and the renaming system.
It does have 'sticky' state bits for FP and I can see how I'll implement them - the big problem is not setting them (because they can be accumulated in any order as instructions hit the commit stages), it's how you test them that effectively becomes a synchronising point in a pipe where you spend all your time trying not to do that - everything has to stop and line up in order before you can sense that state reliably.
Right, the idea of the sticky bits is that you can execute a number of instructions before hitting a sync point, as compared with traditional condition flags or exceptions which have to potentially sync on every instruction. The hope is for that have less effect on performance. I don't know if it causes issues with modern FPU's. Are imprecise exceptions (like on some old machines) still a thing? I wonder if there is a way to record the exception info so its origin can be reported even if it is not noticed til later.
If you don't synchronise it's not an issue :-) but if you have 100+ instructions in motion at any one time creating a bottleneck at the end of every basic block would likely have a big effect on how things run.
Normally we retire up to 8 instructions per clock - merging 8 bits of sticky state is some simple logic (some or gates, it doesn't matter what order they get processed in), merging 8 saved PCs (which one do you save?) is harder, it probably means a priority encoder (order probably does matter) and a 63 bit mux (still doable in a clock though)
Ah, well, you're better equipped than I am to figure this out. Anyway the FPU will have to deal with it. Maybe if the big float computations are mostly vector ops, though, that will help.
Very impressive.
Do you have any experience with designing Verilog or SystemVerilog MIPS implementations as well? If so, how does that compare to RISC-V? Which one was easier in terms of design, testability, SoC integration and overall understandability?
Sorry no, I've worked on VLIW CPUs and an (unreleased) x86 core.
In general though MIPS and RISCV are similar sorts of RISC architecture, they make some of the same design trade offs (no condition codes for example), RISCV avoids some of the mistakes (delay slots for example) - I'd guess they're about the same amount of work. I can imaging making a version of my CPU by switching out the instruction decoders (probably not really that simple).
As far as SoC it probably doesn't matter - that's more of an issue of which internal buses you choose to use for memory and peripherals
As a language VM implementor, I would really love to have a conditional call instruction, like arm32. AFAICT this would be a relatively simple instruction to implement in the CPU. Is that accurate?
1 - architectural - RISCV has a nice clean ISA, it's adding instructions quickly, CMOV is contentious issue there - I'm not an expert on the history so I'll let others relitigate it - it's easy to add new instructions to a RISCV machine, unlike Intel/ARM it's even encouraged - however adding a new instruction to ALL machines is more difficult and may take many years. But unlike Intel/ARM there IS a process to adopt new instructions that doesn't involve just springing them on your customers
2 - remember RISCV is a no-condition code architecture - that would make CMOV require 3 register file ports (the only such instruction that also requires an adder [for the compare]) - register file ports are extremely expensive, especially for just 1 instruction
3 - micro-architectural - on simple pipes CMOV is pretty simple (you just inhibit register write, plus do something special with register bypass) I'd have to think very hard about how to do it on something like VRoom! with out of order, speculative, register renaming - I can see a naive way to do it, but ideally there should be a way to nullify such an instruction early in the pipe which would mean some sort of renaming-on-the-fly hack
How do you feel about short forward optimizations? If I understand it correctly, BOOMv3 can convert instructions under a short forward branch shadow to be predicated.
conditional CALL is MUCH harder to implement well - it's because the call part essentially happens up at the PC/BTC end of the CPU while at the execution stage what you're doing is writing the saved PC to the LR/etc and the register compare (or accessing a condition code that may not have been calculated yet).
In many ways I guess it's a bit like a conditional branch that needs a write port - in RISCV, without condition codes, your conditional call relative branch distance will be smaller because the instruction encoding will need to encode 2-3 registers
I imagine something like that might be viable in the to-be-designed RISC-V J extension, as safety checks (mostly in JITs) would be close to the only thing benefiting from this.
Though, maybe instead of a conditional call, a conditional signal could do, which'd clearly give no expectation of performance if it's hit, simplifying the hardware effort required.
Yeah, I can imagine that being particularly easy to implement in VRoom! exceptions are handled synchronously at the end of the pipe (with everything before them already committed, and everything after flushed). Instructions that can convert to exceptions (like loads and store taking TLB misses) essentially hit two different functional units - a conditional exception would be tested in a branch/ALU unit and then transition into an effective a no-op or transition into
an d exception and synchronise the pipe when they hit the commit stage
8080 had it too, 8086 dropped it due to disuse. In a modern context it's just a performance hack, an alternative to macro-op fusion, but for high-performance RISC-V (or i386, or amd64, or aarch64) you need macro-op fusion anyway.
heh! - I'm a Kiwi who lived and worked in Silicon Valley for 20 years, moved back when the kids started high school, but mostly still work there - while I was there I started a company using "Taniwha" ... great for a logo, but a mistake because of course no one in the US knows how to pronounce it (pro-tip the "wh" is most close to an english "f")
It's a personal project, but an expensive one (with very large AWS bills some months), needs a commercial sponsor to end up as a real chip you could put in real platforms and a real chip design team to build it
You could probably find a riscv intl member company willing to donate small fpga boards, but if this design barely fits on an AWS f1 instance, I think realistically that'd have to be a 5 digit price board
For reference, Digikey unit price for a VU440 floats around the $40-60k range
Next step up is probably a VU13P based board, there were a few on EBay a year ago ~$5k, almost bought one, but then bitcoin went up again ..... I'm hoping for a crypto crash ....
Not yet - I have a pretty generic combined bimodal/global predictor - there's a lot of research on BTCs - it's easy to throw gates at this area - I can imagine chips hitting 20-30% BTC in area just to keep the rest running
My next set of work in this area will be integrating an L0 trace cache into the existing BTC - that will help me greatly up the per-clock issue rate
Dhystone's just a place to start, it helps me make quick tweaks, and I'm at that stage of the process - it's particularly good because it's somewhat at odds with my big wide decoders - VRoom! can decode bundles of up to 8 instructions per clock, while Dhrystone has lots of twisty branches, only decodes ~3.7 instructions per bundle - it's a great test for the architecture by pushing at the things it might not be as good as.
Having said that I'm about reaching the end of the point where it's the only thing - being able to run bigger longer benchmarks is one of the reasons for bringing up linux on the big FPGA on AWS
I'll add that freq scaled Dhrystone (DMIPS/MHz) is a particularly useful number because it helps you compare architectures rather than just clocks - you can figure out questions like "If I can make this run at 5GHz how will it compare with X?"
You could use verilator simulating VRoom! as a benchmark.
I haven't actually looked at the generated code, but I imagine it's thousands upon thousands of instructions in a row with no conditional branching at all. CHOMP.
The Architectural presentation linked from the GitHub repository for this project is an incredibly good resource on how these kinds of things are designed.
Yes, there is a huge lack of open and approachable information sources in micro-architecture.
Be aware though, the micro-architecture used here is very interesting but differs in many ways from state of the art industrial high-end micro-architectures for superscalar out-of-order speculative processor.
I am quite curious about how the author came up with these choices
Well, everyone was building tiny RISCVs, I kind of thought "can I make a Xeon+ class RISCV if I throw gates at the problem ?" :-)
Seriously though I started out with the intent of building a 4/8 instruction/clock decoder, and an O-O execution pipe that could keep up - with the end goal of at least 4+ instruction s/clock average (we peak now at 8) - the renamer, dual register file, and commitQ are the core of what's probably different here
Yes, the "dual register file" is probably the most intriguing to me.
This looks like a renaming scheme used in some old micro-architecture (Intel Core 2 maybe) where ROB receives transient results and acts as a physical regfile, at commit reg value are copied to a arch regfile.
But in your uarch the physical regfile is decoupled from ROB, which must correspond to your commitQ.
I wonder if this solution is viable for a very large uarch (8 way) because read ports to copy reg value from pysical regfile to arch regfile are additional read ports that can be avoided with other (more complex) renaming scheme.
These additional read ports can be expensive on a regfile that already has a bunch of ports.
Any thoughts about this?
But I haven't read much of your code yet, that's just a raw observation
the commitQ entries are smart enough to 'see' the commits into the architectural file and request the data from it's current location
It does mean lots of register read ports .... but you can duplicate register files at some point (reducing read ports but keeping the write ports) (you want to keep them close to the ALUs/multipliers/etc) - in some ways these are more implementation issues rather than 'architectural'
I see, there are indeed solutions like regfile duplication to handle large port number but it's expensive when physical regfile becomes large. I still think that the uarch's job is to ensure minimal implementation cost ;).
Thank you for your opinion and thought process, it's very valuable !
I think that one has to separate out architecture and implementation a bit, they're obviously a deeply intertwingled dance - but you have to start with the architecture and tweak from there to get the best result in the end - I'm probably halfway through that process now, starting to introduce deeper timing constraints to flesh out that stuff
BTW once great thing that sort of falls out of this architecture is that the commit register file gets shared between the integer and FP registers (and probably vector registers too), and duping just that may be an interesting architectural way to go
Well (author here) - this is a private project - typically such a project would be very propriety - people don't get to show their work.
But I'm looking to find someone to build this thing, it's been a while since I last built chips (last CPU I helped design never saw the light of day due to
reason that had little to do with how well it worked). So I need a way to show it off, show it's real. So GPLing it is a great way to do that - as is showing up on HN (thanks to whoever posted this :-).
In practice the RTL level design of a processor is only a part of making a real processor - a real VRoom! would likely have hand built ALUs, shifters, caches, register files etc those things are all in the RTL at a high level but are really different IP - likely they'd be entangled with GPL and a manufacturer might feel that to be an issue.
However I'm happy to dual license (I want to get it built, and maybe get paid to do it).
Also about half the companies building RISCVs are in China (I've been building open source hardware in China for a decade or so now, so I know there's lots of smart people there) - they have a real problem (in the West) building something like this - all the rumors about supply chain/etc stuff - having an open sourced GPL'd reference that's cycle accurate is a way help build confidence.
One other comment about why GPLing something is important for some like me - publishing my 'secrets' are a great way to turn them into "prior art" - you read it here first, you can't patent it now - I can protect my ideas from becoming part of someone else's protected IP by publishing it.
I spent a few years working on an x86 clone, I had maybe 10 (now expired) patents on how to get around stupidly obvious things that Intel had patented - (or around ways to get around ways to get around In tel that other's had patented) - frankly from a technical POV it was all a lot of BS, including my patents
> One other comment about why GPLing something is important for some like me - publishing my 'secrets' are a great way to turn them into "prior art" - you read it here first, you can't patent it now - I can protect my ideas from becoming part of someone else's protected IP by publishing it.
This is a great strategy!
> I spent a few years working on an x86 clone, I had maybe 10 (now expired) patents on how to get around stupidly obvious things that Intel had patented - (or around ways to get around ways to get around In tel that other's had patented) - frankly from a technical POV it was all a lot of BS, including my patents
It might be worthwhile to GPL implementations of those expired patents if they are at all likely to be useful. And perhaps then do a bit of procedural generation of various combinations of them for release under the GPL as well (because those would be newly patentable).
> It might be worthwhile to GPL implementations of those expired patents if they are at all likely to be useful.
Probably not. It's always something stupid that can be easily worked around.
The real problem is convincing a jury: no one wants to risk hundreds of millions of dollars based on what 12 random people think. nVidia caved to Intel after building a Transmeta style VLIW chip that could run x86 assembly because a decade long patent battle would have been costly and invalidated patents on both sides.
Also IANAL, but as I understand it, the HDL would compile down to a sequence of gates, and presumably we'd treat that the same way as a binary - a "Non-Source Form" as the GPL calls it. So anyone that receives a copy of those gates (either as a binary blob for a FPGA, or pre-flashed on a FPGA, or made on actual silicon) would be entitled to the source as per GPL3 section 6 "Conveying Non-Source Forms".
I don't think the GPL anti-tivoization clause has much bearing there other than presumably you'd have to provide the full tool chain that resulted in the final gates - presumably this would affect companies producing actual chips the most since you couldn't have any propriety optimization or layout steps in producing the actual chip design, though also no DRM for FPGAs (is that even a thing?)
I guess you could argue that, if you bought a device with this CPU, you should be able to replace the CPU with one of your own that’s derived from this one.
I think that’s the spirit of the GPL in a hardware context, but I don’t think it’s a given (by a long stretch) that courts would accept that argument.
A somewhat clearer case would be if you bought a device that implements a GPL licensed design in a FPGA. I think you could argue such devices cannot disable the reprogrammability of the FPGA.
IANAL, but as far as I know it's very important it's GPLv3 which means the antitivoization clause, which means that hardware that uses this firmware must provide full source code and a way to let you use your own firmware.
If somehow this code is not in a firmware... No idea.
some of this design IS firmware - the lowest level bootstrap is encoded into an internal ROM - currently it's a very dumb bootstrap, a real implementation would boot from a number of possible sources. The sources are there on github.
All the ARM systems you can buy today have a similar embedded boot loader - almost all of them do not release that source, because it's the root of their secure boot chain.
IMHO this code should be public (but not the keys)
RMS wrote "I've considered selling exceptions acceptable since the 1990s, and on occasion I've suggested it to companies. Sometimes this approach has made it possible for important programs to become free software."
If you're at the point in your career where you're not sure which is the right textbook then "A Quantitative Approach" is likely to be really tough to get through.
Computer Organization and Design, by the same authors, is considered a better choice for a first book. I personally loved it and couldn't put it down the first time I read it.
This book definitely skews pragmatic, hands on and doesn't assume much. Covers both VHDL and Verilog. Has sections on branch prediction, register renaming, etc.
I personally am not into the verilog specific books. For me HDLs are hardware description languages, so first you learn to design digital hardware, then you learn to describe them.
> Eventually we'll do some instruction combining using this information (best place may be at entry to I$0 trace cache), or possibly at the rename stage
So much for "we will do only simplest of commands and u-op fusing will fix performance".
It is why I'm very suspicious about this argument from RISC-V proponents.
As far as I understand, RISC-V proponents want to have "recommended" command sequences for compilers, to avoid situation when different RISC-V CPUs will need different compilations. If different RISC-V implementations have different "fuseable" command sequences, we will be in dreadful situation when you will need exact "-mcpu" for decent performance and binary packages will be very unoptimal.
And such "conventions" are bad idea, like comments in code, IMHO. It can not be checked by tools, etc.
-target <triple>
The triple has the general format <arch><sub>-<vendor>-<sys>-<abi>, where:
arch = x86_64, i386, arm, thumb, mips, etc.
sub = for ex. on ARM: v5, v6m, v7a, v7m, etc.
vendor = pc, apple, nvidia, ibm, etc.
sys = none, linux, win32, darwin, cuda, etc.
abi = eabi, gnu, android, macho, elf, etc.
"arch", "sys" and "eabi" are irrelevant to the core performance. You can not run "arm" on "i386" at all, and "eabi" and "sys" don't affect command scheduling, u-ops fusing and other hardware "magic".
So, only "sub" is somewhat relevant and it is exactly what RISC-V should avoid, IMHO, and it doesn't with its reliance on things like u-op fusion (and not ISA itself) to achieve high-performance.
For example, performance on modern x86_64 doesn't gain a lot if code is compiled for "-march=skylake" instead of "-march=generic" (I remember times, when re-compiling for "i686" instead of "i386" had provided +10% of performance!).
If RISC-V performance is based on u-op fusing (and it is what RISC-V proponents says every time when RISC-V ISA is criticized for performance bottlenecks, like absence of conditional move or integer overflow detection) we will have situation, when "sub" becomes very important again.
It is Ok for embedded use of CPU, as embedded CPU and firmware are tightly-coupled anyway, but it is very bad for generic usage CPU. Which "sub" should be used by Debian build cluster? And why?
It is always frustrating when you have put in the work to optimize code, and turn out to have pessimized it for the next chip over.
The extremum for this is getting a 10x performance boost by using, e.g., POPCNT, and suffering instead a 10-100x pessimization because POPCNT is trapped and emulated.
I'm not sure "the point" is a well-defined term in this context.
Are you guessing that the extension is optional specifically so that nobody will need to emulate things they can't afford to implement in hardware?
But trapping and emulating is explicitly allowed. Maybe it should be possible to ask at runtime whether an extension is emulated. Maybe it is? But I have not seen any way to tell. I guess a program could run it a thousand times and see how long it takes... It would be a serious nuisance to need to do that for each optimization, and then provide alternate implementations of algorithms that don't depend on the missing features.
This is why leaving popcount out of the core instruction set is such a nuisance. It is cheap in hardware, and very slow to emulate.
On organically-evolved ISAs, there are about N variants that correspond to releases. You can decide what is the oldest variant M you want to support, and use anything that is implemented in targets >=M; and the number of <M machines declines more or less exponentially with time.
With RISC-V, there are instead N=2^V variants, at all times, increasing exponentially with time. You too frequently don't know if your program might need to run on one that lacks feature X. So you (1) arbitrarily fail on an unknown fraction of targets, (2) fail on some and run badly on some others, with instructions you relied on for optimization instead emulated very slowly, (3) run non-optimally on all targets, or (4) have variant versions of (parts of) your program configured to substitute at runtime, for each of K features that might be missing. None of these choices is tenable.
The notion of "profiles" appears meant to reduce the load of this problem, but that makes it even more complicated.
An example of his crazy coding chops, he was frustrated by the lack of verilog licenses at the place he worked back in the early 90s. His solution was to whip up a compliant verilog simulator, then wrote a screen saver that would pick up verification tasks from a pending queue. They had many macs around the office that were powered 24/7, and they could chew through a lot of work during the 16 hours a day when nobody was sitting in front of them. When someone sat down at their computer in the morning or came back from lunch, the screen saver would just abandon the simulation job it was running and that job would go back to the queue of work waiting to be completed.