I think it's very achievable.
I'd gladly switch to a completely open source hardware architecture even if it meant losing a significant amount of raw performance provided that the hardware and OS are stable and it's not prohibitively expensive.
Grab one HiFive Unleashed for $999 :
- 4 cores up to 1.5Ghz
- 8Gb DDR4 ECC Ram
- 1Gbps ethernet
Then grab one HiFive Unleashed Expansion Board for $1,999 :
- SSD M.2 Connector
- SATA3 Connector
- x16 PCIE Connector (4 lanes of pcie2)
- A bunch of other cool stuff you wouldn't probably use (SPI, FPGA, etc.)
Finally grab some M.2 Drive and a graphics card ($500 maybe?).
This would set you back a grand total of $3500, which is definitely way more expensive than the current mainstream but may fit within "non-prohibitively expensive" for some. The whole platform should be open source .
My issue would be paying all that, and then strapping closed source hardware to it.
I'm currently dealing with our legacy system which means I have multiple virtualised systems running in a replica of our production system.
Currently using ~25 gig of RAM just to run the environments and an IDE.
Since the modern web is basically one of the worlds most overengineered and crufty VM platforms, I don't expect it would run very well, and that would probably be enough to doom the system in the eyes of many, sadly.
Maybe people just demand more from their software these days, and all these little conveniences just add up more than you'd realize? That would also explain why the system is not "doomed in the eyes of many" as you claim it ought to be.
Laggy response times in aforementioned text editors, frivolous UI animations on certain OSs, terrible web apps like JIRA...
The complexity doesn't necessarily mean better software. It's quite often worse along many dimensions: performance, usability, maintainability, portability.
In reality I think it's just not a priority for browser devs because the overwhelming majority of users do not use huge numbers of tabs.
About a year ago FF was really awful with lots of tabs, CPU usage was very high and the UI would occasionally freeze up for 10+ seconds at a time. They're really been making some big performance improvements lately.
All things I've seen achieved on a 486... In fact, come to think of it, I've seen people doing all of the above on m68k-based systems too.
I highly doubt this is achievable now. Turn on NoScript and vast majority of web sites just refuse to work not even properly, but to just load the content.
The sites that need it are a small whitelist of ones I trust (e.g. the banks a sibling comment mentioned), which I enable.
...That said, a 75MHz RISC-V will be approximately comparable in performance to a 100MHz 486DX4, or a 40MHz Pentium.
Maybe less, should specific purpose assists be available.
And RAM address space sufficient to contain the tasks.
With 1680 RISC-V cores running in parallel at 250MHz, the result is impressive, even working in a FPGA!
You couldn’t run anything but small toy programs on this machine. This is more like what a student would build in an undergraduate course in computer architecture.
For example, there is no MMU, no debug support, no traps, no interrupts, no exception handling, no OS privledge levels, no FP, no memory controller etc.
Of course, one wouldn’t implement all of these in a few hours.
The fact that this is RISCV is somewhat of a red herring as you could do a similar thing with a restricted subset of MIPS or ARM or even x86 as they do in UT Austin’s comp arch class.
The Atmel CPU is more constrained but still has hardware breakpoints, IO instructions, watchdog timers and interrupt support. It also has far more complex addressing modes (more CISC-y) to save on instruction counts and a variable length instruction set encoding where memory space to store code is a first order concern.
I’m also sure there is a memory controller to control the SRAM.
So, even if you were to build a simple micro controller, you’d need a lot more features and most likely higher performance (and power efficiency) than you would get from a trivial 2-stage pipeline. Not to mention there are no instruction or data caches in this RISC-V machine.
Would probably be a decent step in the right direction for validating/verifying the future of trusted computing.
Although... this gives rise to a 2nd thought. If it was _this easy_ to build a RISC-V implementation, is it all that special, technically speaking? I ask as someone naive about processor design. Is implementation relatively straightforward, but design hard?
However, if you want to build really high performance cores, there are plenty of challenging techniques you have to employ that add a lot of complexity that is hidden below the ISA abstraction layer (speculation for example).
So if you want to make RISC-V go fast, you have to employ more design tricks like "macro-op fusion". For example, scan for two load instructions in the fetch stream and fuse them into a single "load-pair" micro-op if they access adjacent addresses. There are a whole bag of tricks like this that are irrespective of the ISA and add a fairly high "skill-ceiling" to processor design.
Except for the C variant where they went to 110% complexity for maximum ICache efficiency: 32bit instructions aligned to 16bit??
I wonder if there are other RISC ISA which made the same choice.
The RISC-V ISA does great with a very small number of instructions, so playing around with encodings is rather easy. I'm reaching the conclusion that fixed 24-bit opcodes are an close to optimal if immediate constants are allowed after.
The advantage of working with RISC-V over ARM right now is that you can configure a RV core to have far less capability than an ARM core. You can license RTL from SiFive for RV64imc (64-bit address and data, baseline integer instruction set, hardware multiply, and 16-bit compressed instructions). Such a core simply does not exist in the ARM marketplace today, partially because NEON is mandatory in ARMv8.
First the complexity comes in fully implementing the whole ISA. Yes RISC-V has an advantage over ARM/x86 in that it will have less cruft, but the complexity in ARM/x86 doesn't exist for no reason. It's driven by real software requirements so RISC-V will need similar complexity if it wants to seriously challenge either architecture (with the advantage they can learn from the previous mistakes made and thus implement things more cleanly).
Second it comes in verifying your design, especially around the fun corner cases. There are plenty of bugs that occur when a series of rare things happens at once (maybe involving some obscure areas of the ISA) which can be very hard to track down. Unless you can successfully hunt down and fix these you'll end up with a phone or a computer that occasionally crashes for no good reason.
Third is making it hit your power and performance targets. It's one thing to say you have a 256-bit data path and can issue 4 instructions per cycle with out of order execution. It's quite another to build such a design so it can actually sustain decent throughput on real software whilst hitting a decent clock frequency and remaining within power budget, especially when you have all of the various complex bits of ISA to deal with.
Like what? Are you talking about extensions that don't exist for RISC-V, like Transnational Memory?
I would say most reason for complexity is legacy and there is little actual software requirements in that regard if your doing a new architecture.
Well for one thing your load may come in multiple sizes, can target different kinds of memory (e.g. device memory, non-cacheable memory, fully-cached memory, ARM architecture actually allows you to get quite specific about differing levels or shareability and cacheability too), can be unaligned with respect to the access size (but you still need full performance with them). There are ordering requirements with respect to other loads and stores in the system (even within the same CPU avoiding read-after-read ordering issues may not be simple) and various different kinds of barriers that can effect loads. You get exclusive loads or atomics (some variations can return the data seen so are performing a load on top the atomic op). It may be a vector load that needs to quickly feed the vector register file as opposed to the 'standard' register file. In a multi-processor system you can have various different types of snoop operation coming in that could effect the load. You also need to work out if you're actually allowed to do that load, modern page tables are pretty complex affairs. The page table itself could be changing as the load is executing. A decent chunk of the complexity is for virtualisation support but even without that there's various fiddly bits.
You also have speculative execution attacks to worry about. Certain loads may need to be very sure they should execute before doing anything, others may be free to speculate away and forward data into further speculative execution.
You can certainly build a perfectly functional ISA that avoids a lot of this (and avoid other things by keeping to a simple in-order microarchitecture) but that will loose you a lot of performance.
x86 has TSO and still is the fastest, so I think overall you are doing much better for yourself if you avoid massive complexity in your memory model because that is gone cost you in application complexity that you could be using for optimizations.
RISC-V has privilege architecture and a vector architecture as well, and they of course do add complexity, but are still simpler then corresponding functionality in ARM/x86 while doing many things better.
RISC-V was specifically design as it was because the ISA does really not impact performance that much and having something simple and understandable was not going to be a huge performance hit.
In RISC-V there are currently only 2 byte and 4 byte instructions, which you can brute force your way through. The specification does technically allow for longer sequences, in which case sandsifter will work just fine. [And the RISC-v privileged specification has existed for years].
It is a draft which has not been finalized and contains a disclaimer that it might be modified in a non-backwards compatible way prior to final release.
Chips with MMUs have been taped out for years. Likewise the Linux port has existed for years.
The HDL (Verilog) code looks quite short and simple. If the partial implementation of the ISR implementation is like that it shouldn't be so bad for learning...
A major project of the course is to build RISVC emulator and implement 2-stage pipeline in logisim.
It does seem to hilight an increasing unease I have with riscv. Implementations are many and cheap, but reusable verification is rare and people don't use what is out there. They have maybe the riscv-tests set working. But that's not enough to call your new CPU usable for anything other than a hobby project.
Fwiw, the riscv-formal package from Clifford wolf is the closest thing to a turn key solution to verifying a riscv core, even if people must remember it doesn't cover everything.
And with this
and similar projects already spun up in other ways, risc-v is becoming more interesting as time goes by.
IIRC this isn't the first open source RISC-V core but it's great to see another implementation.
How much would this increase if it used an ASIC instead of FPGA? And how much would it cost for different batch-sizes?
> how much would it cost
A CPU IC isn't very interesting until it has some I/O, so it's much more meaningful to talk about an SoC with one or more of these darkriscv cores.
Sorry, I don't have an answer other than to say "this isn't quite complete enough for it to be useful for most tasks." That said, there's probably tons of open source implementations of DDR/SPI/PCI/USB interfaces (on opencores.org, e.g.). So it's "only" a matter of integrating these.
darkriscv@75MHz cache=off 0-wait-states 2-stage pipeline 2-phase clock: 6.40us
darkriscv@75MHz cache=on 3-wait-states 3-stage pipeline 1-phase clock: 9.37us
darkriscv@50MHz cache=on 3-wait-states 2-stage pipeline 2-phase clock: 13.84us
The first configuration works in a zero wait-state environment with separate instruction and data high speed synchronous memories working in a different clock phase (weeeeeird!). As long there are no latency, this configuration works at 75MIPS with a 2-stage pipeline, which means only one clock is lost when the pipeline is flushed by a branch.
The second configuration uses a small hi-speed cache with 256 bytes for instruction and 256 bytes for data, a 3-stage pipeline, which means two clocks are lost when the pipeline is flushed by a branch and a more convencional single phase clock architecture, as well a memory with 3 wait states or something like this. Although working at 75MIPS, the cache miss and the longer pipeline decrease the performance to around 51MIPS.
The third configuration is the core configuration from the first scenario, but with the small hi-speed cache from the second scenario and the 3 wait states. In this configuration, the performance decreased to 50MHz and, according to my calculations, the performance is around 34MIPS.
By this way, if is possible work only with the interna FPGA memory, the first configuration is better, otherwise you can use the second configuration.
I guess is possible create a fourth configuration with the 3-stage pipeline and zero wait-states (no cache), but I need implement a two-clock load instruction. In this case, I guess is possible peak around 100MHz.
"after one week of exciting sleepless nights of work (which explains the lots of typos you will found ahead), the darkriscv reached a very good quality result"
Not commenting on the actual quality of the code, but I wonder how can one make typos due to sleep deprivation, and yet produce "good quality results" in software.
I wonder when will we, as a community, stop praising all nighters and rushed work.
In this case it just seems to indicate enthusiasm for the project rather than dangerous overwork.
I agree, but it sounds like the author wanted to claim that he did something impressive in a small timeframe, thus suggesting to the reador some level of technical prowess. If the author claimed instead that he did it while well-rested in a couple of months then the achievement wouldn't be so impressive.
On the Amiga we don't need a replacement for MC680## processors because we have the Vampire 2+ accelerator, which gives us a superscalar, 64-bit MC68080 with AMMX extensions.
Coming to ATARI ST and Amiga 1200 near you if the Apollo team keeps this momentum.
The posters intention is not to start producing hardware, at no point does his project mention taping it out and manufacturing. Obviously it is just a fun side project to implement the RISC-V core ISA in an FGA. It was then made open-source on GitHub so anyone else interested can look at it. Chill out mate.
Without being able to do that, it's going to be exceptionally difficult for me to contribute any more to the discussion, especially typing on a mobile telephone.
That's my contribution for now, I'm pointing out what to compare with. That point was obviously missed.
You have to bring something to the table too, instead of just demanding everything be spoon fed and served on a Silver platter to you.
I found the RISC-V spec enjoyable to read, especially with all the rationales; it felt clean, minimal, and well-thought-out. Writing a program in it was probably not much different from writing in x86-64 assembly. The encodings seemed much easier to deal with than x86-64, though as mentioned I didn't go as far as writing an encoder.
What kinds of important features are present in SPARC or MC68k but not in RISC-V? Are they absent from x86-64 as well?
- bitwise rotate (let them eat macro-op fusion)
- byte and bit swapping (strictly missing from RV, although proposals exist)
- leading zero count, trailing zero count, and popcount (strictly missing from RV, proposals exist)
- efficient multiword arithmetic (let them eat macro-op fusion, or long dependency chains)
- base + [scaled] index addressing modes (you don't really need those)
- multi-register save/restore instructions (ARMv8 doesn't have them / RVC is equivalent in density to ARMv8, nanoMIPS is irrelevant, let them eat millicode)
So yeah, there are deficiencies. None of them are crippling, but I wouldn't say that RV is super-wham-o-dyne, either.
The assembler reads almost like a high level programming language. The register scheme is intuitive as well, from a0-a7 being the address, to d0-d7 being the data registers.
Now, let's do a mental exercise: I'm going to load an effective address relative to the program counter, 32-bits wide, into the 1st address register. Then, I'm going to load an arbitrary value from an arbitrary memory location into the second address register. Do the same in RISC-V; compare intuitiveness.
lea MemoryAddress(pc), a0
move.l $00bfe001, a1
MemoryAddress: DC.l 0
; 1. lea MemoryAddress(pc), a0
auipc a0, [upper 20 bits of (MemoryAddress - label)]
addi a0, a0, [lower 12 bits of (MemoryAddress - label)]
; 2. move.l #$00bfe001, a1
lui a1, 0x00bfe000
addi a1, a1, 0x001
; 3. rts
jalr x0, x1, 0
If there's a canonical format for "pseudoinstructions", and all assemblers handle them in the same way, and the abstraction doesn't leak in any way (i.e. the only temporary registers you use are ones you overwrite fully by the end; it is true that now some "instructions" have longer encodings, but that comes with the compressed instructions anyway; and it is true that an interrupt could happen between the two halves of the "instruction", but I think that shouldn't make a difference), then I don't think there's much of a problem.
Any RISC processor has that issue, because encoding is fixed at 32-bits to keep the hardware simple. That's not what I'm referring to.
Look at that retardation, "auipc". Because "auipc" is intuitive, right? (For the record, I'm being sarcastic.) What the hell was the instruction designer smoking? Then there is the square bracket notation, like on intel, and intel has some of the most idiotic, non-conventional assembler syntax -- and this thing mimics something so bad? Every other processor uses parenthesis, that's a well understood norm.
Then there is more intel retardation in the form of dst, src (or dst, src, src on RISC). What kind of a warped, twisted mind came up with that? What was going on in that person's head to work in such twisted ways?
Then there's the "cherry on top":
jalr x0, x1, 0 ; because "jalr" is intuitive as well, it immediately tells you what it does?
Bad names are harder to work around. However, about those names: "auipc" is certainly letter salad, but (a) it stands for "Add Upper Immediate to Program Counter", and its effect is to add an immediate value (multiplied by 2^12) to the program counter and put the result in a register, so the name is entirely logical given its task; and (b) except in the rare case where the PC-relative offset is exactly a multiple of 2^12, an AUIPC will be immediately followed by an add (to load a full PC-relative address), or perhaps a load-with-offset (to load a value at a full PC-relative address), or a jump-with-offset (to jump to a label at a full PC-relative address); and all three of these AUIPC+(add|load|jump) combinations are given as pseudoinstructions as well (e.g. "la" for "load address"), so the programmer will probably never need to write a bare "auipc". As for "jalr", well, it stands for "jump and link register", which jumps to the "register" argument and stores the return address in the "link" argument. There's a set of pseudo-instructions based off x0 being the "always-zero" register and x1 being the conventional "return address" register:
Pseudo Real Description
j offset jal x0, offset Jump
jal offset jal x1, offset Jump and link
jr rs jalr x0, rs, 0 Jump register
jalr rs jalr x1, rs, 0 Jump and link register
ret jalr x0, x1, 0 Return from subroutine
Anyway, if your goal was to convince me there are serious flaws in RISC-V, then criticizing the naming conventions and surface syntax has led me away from the hypothesis that you know any.
Obviously more than a night, but hey development is completely open so you can see for yourself.
> How much will it take to tapeout?
0 days. It is running on an FPGA.
> Who will pay for it?
See previous comment. Presumably the author paid for the FPGA, but maybe it was a gift?