(1) The amount of compute and/or data motion that can be achieved with a single instruction. This is really about amortizing the cost of decode and allowing the pipeline to be kept full by producing a lot of work from a single instruction.
(2) efficiency gains from vector processing. This both amortizes the cost of decode and produces the amount of control circuitry relative to the number of ALUs --> more flops/area. it also generally favors larger sequential memory accesses which is good for bandwidth.
(3) extracting parallelism from the instruction stream. The VLIW debate is about whether that should be done by the CPU itself, e.g., in the form of out-of-order execution, or whether it should be handled by the compiler. VLIW allows the compiler to do this work, which keeps the CPU simpler.
It's been clear for several years that larger vectors are a win, and that's been happening in the Intel and arm space, not to mention GPU. The VLIW debate is less clear, and has been going back and forth. I think that we have been doing a better job of getting a handle on when instruction complexity and diversity is beneficial versus not - remember that the initial RISC proposal was in contrast to the PDP instruction set, which was kind of ridiculously over specialized for the technology of the time.
That's a good point and I expect that it still somewhat holds true today: many of the RISC supporters (and I'm one of them) effectively assimilate CISC with x86 since that's by far the most popular CISC instruction set out there these days. And x86 is the C++ of instruction sets: decades and decades of legacy feature which shouldn't be used in modern code, a large selection of paradigms copied from whatever other instruction set was popular at the time. You want to do BCD? We have opcodes for that (admittedly no longer supported in long mode, but it's there). Also do you like prefixes? Asking for a friend.
But obviously that's a bit of a fallacy, one could design a modern CISC-y IA without all the legacy baggage of x86. It's not so much that I love MIPS or RISC V, it's that I don't want to have to deal with x86 anymore.
 We don't yet consider ARM CISC, right?
The amount of baggage x86 is carrying around now days long ago shrunk to the point of not mattering.
And sure, if some ancient compiler spits out some opcode that Intel long ago stopped optimizing, it may take 10x the clock cycles it used to take back in the days of 386s, it is still going to run 100x faster at the end of the day.
> We don't yet consider ARM CISC, right?
I think ARM64 was an attempt to prevent that from happening! I haven't looked into it at all so I am not sure how successful they were. But it was starting to get pretty close there for awhile.
Power9: I forget, but trust me, its in there somewhere.
IIRC, they're all 64-bit carryless (or polynomial) multiplications. The funniest part is that ARM and Power9 still claim to be "reduced instruction set computers", despite having highly specialized instructions like these. Power9 is the funniest: implementing hardware-level decimal-floats (for those bankers running COBOL), quad-floats (for the scientific community), Polynomial-Multiplication and more.
Here's a CRC32 impl using POWER8/9 polynomial instructions: https://github.com/antonblanchard/crc32-vpmsum — coincidentally, you don't have to do this on ARM, ARM just has CRC32 instructions.
> hardware-level decimal-floats (for those bankers running COBOL)
IBM being IBM :) I've heard somewhere that they are sharing some internals between POWER and z/Architecture (mainframe) chips. No idea how true it is and how much is shared if true, but a scenario like "decimal floats were implemented because mainframes are using the same backend with a different frontend, so we exposed them in the POWER frontend too" sounds quite plausible.
Ain't that the truth. Reduced never meant anything; they never defined it. Patterson said they cooked up the name on the way to a grant presentation. I think they should call it Warmed Over Cray. What was new in RISC that wasn't in Cray? It was like the Geometry Engine; it was possible for an academic department to accomplish but that doesn't mean they accomplished anything that hadn't already been done.
To be fair, even back in the day decimal arthimetic could at least have an argument made for it on RISC chips. https://yarchive.net/comp/bcd_instructions.html
- 68010s were essentially a 68000 with this fixed - unix could do a real paging system
- 68020s dumped their entire microcode state on the kernel stack - to do signals correctly (ie recusively) unix systems had to be able to spill this to user space and validate it on the way back in
- 68030s and 68040s were saner and restarted exceptions in useful ways
After all, you don't need to be able to restart on a page fault if you just swap in whole tasks out and back into memory just before switching to them (which you know when you're going to do), and you still get virtual memory's benefit of the swapped in tasks being able to occupy different physical, and discontiguous, memory regions without the task itself noticing.
So did they only design for that, or did they actually make a mistake?
Sun solved this problem in an expensive way. The MMU would make the CPU wait while a second 68000 was used to process the page fault.
Now that you mention it, I recall reading that two 68000s story before (possibly from another comment you made here?), and reading it this second time, it still amazes me. Is it possible to find the source code involved? Not sure what of Sun's kernels is available.
You do need to restart instructions if you want your stack to extend automatically, there's a hack you can do to make this happen on a 68000
You could do a paging system on a 68010/68451 - you had to treat the 68451 as a software refilled TLB
Sun built their own MMU using SRAM, at the time there was a guy going around the Valley designing "sun-style" MMUs, for IP reasons every one was different, some were broken (one I ported to couldn't protect the kernel from user mode).
Someone (maybe Sun?) managed to get 68000s to do paging by having two of them and freezing the primary on a page fault and having the second one load the page somehow.
68020s had external MMUs, Moto made the PMMU which most people used Phillips tried to make one (it never really worked), Apple had a weird pseudo MMU that fit in a PMMU socket on the mac 2 - it emulated the 24-bit addressing on the 68000 (Apple had stupidly used the upper bits in their pointers)
Finally: I think that the 68000 not being able to restart a page fault was probably a mistake, something they hadn't thought all the way through
Though qualified with "until the design was almost complete", making the answer to my question (as often) much more fuzzier than I thought.
And another commenter speculates it was Apollo with the two-68k implementation, but without sources and the commenter is unsure of it themselves.
In any case, very interesting stuff.
EDIT: Someone else claims it was Masscomp: http://www.dadhacker.com/blog/?p=1383#comment-3582
The '451 is more functional than his power of 2 design (because it offers multiple powers of 2 that you can cobble together into arbitrary regions) - the 'throwaway instructions' he mentions is indeed what we used for stack extensions - you do a 'tstb offset(%sp)' in every subroutine entry where 'offset' is the extent of the stack the subroutine will use, the results of the instruction aren't used and the fault can be detected - by recognizing the instruction at the fault pc offset and returning to the address following it
Again as someone who somehow never got to touch 68k during its day, the level of creativity required in 68000 MM, and the internal state spilling of the 68010 and later, really got me curious (x86 of course always had plenty of curious quirks, but I think I'm too familiar with them at this point %) ).
I'd also still love to read more about the details of the two-68000 design, but everything I could find so far was essentially folklore.
Funny story is that this is still kind of a problem today, where a single pointer-dereference may be an mmap'd file over a network over SSHD tunneled through (increasingly arbitrary indirections).
foo->next = bar->next->next can be surprisingly complicated. I don't think most programmers think to handle SIGBUS errors.
Whereas to me, it sounds like this polynomial evaluation instruction needed all the 21 pages within one (very "large") instruction, meaning you actually need to have those (again, pathological case) 21 different pages present at once?
EDIT: A sibling comment mentioned keeping partial state of (apparently) single instructions instead of fully restarting them. So that might be an option, too.
If you like x86 at all then prefixes are not a good thing but indeed a great thing. Prefixes are what allowed AMD to completely destroy Intel's Itanium. They simply took x86 and reused the REX prefix to access 16 more registers. Yeah, the committed a few other architectural atrocities before the god of expediency, but they won and Intel had to crawl back and beg forgiveness in the x86 space that they owned.
You are talking about ARM here, aren't you?
But seriously, there is no reason not to say ARM is CISC. It's not the old AI it used to be anymore.
btw, one of the "best" x86 features is the string instructions:
> the destination operand symbol must specify the correct type (size) of the operand (byte, word, or doubleword), but it does not have to specify the correct location
I'm basing this on my knowledge from CS classes in the late 90s but isn't the point of RISC for the decode to be very simple because the instructions are very simple?
Much of the optimization of x86 has been in the Decode where a CISC can be transformed in to multiple and often parallel instructions. And parallelism where we've seen our gains and benefited the most in the last ~10-15 years of CPU design.
Does that mean in order for RISC to be performant some degree of parallelism will have to be done by the compiler? Isn't that what ultimately killed Itanium which was a VLIW processor?
(2) The compiler is critical for performance on any platform these days. But RISC/CISC and VLIW vs out-of-order/speculative are separable. You can do an out-of-order speculative RISC machine and it does the same thing an x86 CPU does. (At the micro-op level they're pretty similar). For an OOO processor, the compiler still needs to do a decent-ish job of grouping instructions so that the processor can find the parallelism, because its execution window is limited -- the CPU won't look ahead 200 instructions to find more parallel stuff to execute. The requirements are even more stringent with an in-order VLIW core, where the compiler is responsible for all of it.
I know of more production VLIW cores today than back in the Itanium days. Compilers have gotten better. The GPU manufacturers are an interesting case of this: AMD, for example, was VLIW until GCN in ~2012. NVidia tried VLIW in NV3x, but abandoned it in 2004-2005 with the move to more general-purpose computation on GPUs. Some of the AI hardware accelerators are looking again at VLIW now, probably because the regular, fixed things they do are more amenable to compiler-based scheduling, whereas very fine-grained dynamic scheduling is better done in hardware.
> Intel increased the reorder buffer by over 50% from 224 entries in Skylake to 352 entries in Sunny Cove. Likewise, the amount of inflight loads and stores has been increased by over 50% to 200 memory operations inflight.
The CPU won't look ahead 400 instructions! ;-)
(1) Because RISC chips were simpler they consumed less transistors. Because there were less transistors you could fit an entire CPU on a single chip of silicon which drastically improved performance.
(2) It was much easier to apply pipelining to load-store architectures than to architectures where instructions could contain multiple memory accesses or indirect memory accesses and also retain precise interrupts. Pipelining dramatically increased the rate at which instructions could execute.
Also, back in the CISC days the limitations of the memory subsystems available made it supremely important to optimize for minimizing instruction size and instruction bandwidth. This was far less of a concern by the time the RISC paradigm rolled around compared to other issues.
You are correct in that GPUs have a degree of software scheduled memory instructions. There's no Out-of-order schedulers in AMD or NVidia. The best you got is "start memory load/store", and "synchronize" kind of instructions. On AMD, you perform s_waitcnt to wait for memory to respond. NVidia has read/write barriers + stall counts encoded into each instruction that performs a similar task.
However, I argue that GPU advantage comes from the innate "warp" / "wavefront" model. The programmer will specify 100,000+ "SIMD threads" of execution. There are blatantly obvious ways to optimize the memory access when you have hundreds-of-thousands of threads stepping together.
In particular: with 32-threads operating at a time, you can stripe the data so that the 32-threads all access 128-byte chunks at a time. (Thread-local variable "uint32_t foo" for threads #0 through #31 can be stored in memory location #0 through #127. When threads #0 through #31 read "foo" they only need to perform 2-memory transfers: because modern memory reads/writes 64-byte blocks at a time).
"Striping" memory access, and grouping memory accesses together (aka coalescing) is very important for modern GPU (and CPU) memory optimization. Its far easier to do on GPUs due to the parallel model. In fact, all thread-local variables are optimized as such in GPU-code, its basically automatic.
CPUs attempt to do the same through auto-prefetching. But it doesn't get as far as the GPU-approach.
GPUs benefit from wide memory busses, because of the wavefront / warp model of programming. 32+ threads grab memory, and stall on memory, at the same time. Its more efficient to calculate memory stalls for the 32+ batch of threads (one warp) rather than doing it one-at-a-time like a CPU.
If you look at Patterson's initial paper that makes the argument for RISC, he compares it to the VAX instruction set that had support for managing doubly-linked lists, polynomial evaluation, string matching and CRC among other things!
Did a little explainer video on that: https://youtu.be/o14ecAoGN8w?t=105
There is little to no reason why high performance RISC-V implementations can not achieve performance comparable to modern "CISC" cores.
As other commenters have said, VLIW doesn't suddenly make the problem go away and no amount of compiler magic is going to fix the fundamental issue that memory access on a modern processor is highly non-uniform and scheduling is not something that can be done statically ahead of time.
RISC-V is designed to be extended with complex instructions and accelerators and that's the scalable part.
- Does it mean reducing the actual number of instructions in the ISA?
- Does it mean reducing the functionality of each instruction to it's absoloute minimum?
These are similar, and overlapping in places. You can also do one without the other.
I think that the base RISC-V ISA does both to a perhaps unhelpful degree. As soon as you want a RISC-V core to be competative with other peer ISAs, you need a bunch of the extensions, which minimises the value in calling it reduced in the first place. At the least, it exposes a possible dichotomy between "reduced-ness" and "usefullness".
A comperable RISC-V core in terms of feature, will always have less instruction.
Also RISC-V preferes less in in the core spec because if you have application that really needs something there will be standard extentions to add it.
Again: why should I care about that? As a user, I care about big desktop-class cores, not academic minimal cores that fit on FPGAs. Small decoder is not a benefit for real world usage.
> there will be standard extentions to add it
That is, there will be fragmentation.
Have you considered that the world doesn't revolve around you?
You are infact wrong, many 'real world' uses like how small these cores can be.
And given that many SoC now have lots and lots of small cores in them, having those cores be as small as possible is beneficial.
> That is, there will be fragmentation.
Yes. There will be fragmentation because an industry that is so broad, in terms of minimal soft cores to massive HPC systems, so having a true one-size-fits-all would have been doomed from the beginning and was never a viable design goal.
RISC-V is design to approch the problem of a universal open-source ISA. It knows that avoiding fragmentation is impossible and thus they tried to build something that makes fragmentation managable both in terms of the organisation of the standard and in terms of the tools.
RISC was invented as an alternative approach in an era when processors had really complex instructions, with an idea that high level languages could be efficiently compiled to them and assembly programmers would be efficient if they can do many things with one instruction. RISC philosophy was to make simpler instructions, let compilers figure out how to map high level languages to simple instructions, and therefore fit the processor on one die (yes, "processors" used to be several chips) and therefore run it at high clock speeds. RISC is not a dogma, it is a design philosophy.
On top of that exception handling in complex instructions is hard. Implementing complex instructions in hardware consumes considerable design and validation effort. RISC has won for these reasons.
Some things have changed, we can fit really complicated processors on a single die. Memory access is the bottleneck. The downsides of RISC in this reality is well known: It takes many more instructions to do the same thing, which means instruction cache is used inefficiently (anyone remember the THUMB instruction set of ARM?). It might be useful to add application-specific hardware acceleration features, because we now have the transistors to do it. How does that make RISC unscalable?
Many CISC machines (eg Intel's) are CISC in name only. The instructions are translated to micro-ops in the hardware. The micro-ops and the hardware itself, is RISC.
Register re-naming was invented to ensure that we enjoy the benefits of improvements in hardware without having to recompile. Let us assume you have a processor with 16 registers. You compiled software for it. Now we can put in 32. What do you do? Recompile everything out there or implement register re-naming?
VLIW failed because they took the stance that if we remove hardware-based scheduling, the extra transistors can be used for computation and cache. Scheduling can be done by compilers anyway. The reason they werent successful is because if a load misses the cache, you wait. Instead of superscalars which would have found other instructions to execute. On top of it, if you had a 4-wide VLIW and then you wanted to make a 8-wide one, you had to recompile. And oh, the "rotating registers" in VLIW is a form of register re-naming.
Poorly informed article.
This is a half-truth at best. Unless you work at Intel/AMD/etc. as a chip designer odds are you do not know what is "really happening" behind the scenes, so the underlying implementation is whatever they "want it" to be. The underlying implementations can even change from microarchitecture to microarchitecture.
So internally we might make a guess that microcode must look more "RISC-y" after transformation, given the fact that "some transformation" must be happening. The larger internal (renamed-to) register file was a common trait of RISC architectures, after all, and a lot of microcoded systems historically have been regarded as "RISC-y." But the existence of modern optimizations like macro-operation fusion suggest that the internals of CPUs today renders things much more ambiguous today with regards to "RISC-ness" vs. "CISC-ness."
Fun fact: the original ARM1 processor in 1985 was microcoded - https://en.wikichip.org/wiki/acorn/microarchitectures/arm1#D...
This is an oft-repeated but incorrect statement. Modern x86 CPUs perform macro-op fusion and micro-op fusion. As an example of the former, a comparison and a jump instruction can be fused into a single micro-op , which is decidedly non-RISC. As for the latter, some micro-ops perform a load from memory and an arithmetic operation with the retrieved value – also very non-RISCy.
Modern x86 CPUs are CISC above and below the surface.
And FWIW, I've heard that there's two distinct uOp formats inside a.single core these days for quite a few of the uArchs. There's the frontends view which is concerned with amortizing decode costs (so wide fixed purpose instructions, that other wise look pretty CISC), and the backend's uOps that's concerned with work scheduled on functional units. A lot of the fusion happens on the front end, and a lot of the cracking happens on the backend, and the frotbejd tends to be two address, and the backend three address. So like a frontend's
and rax, [rbx, addr]
ld_agu t0, rbx, addr
ld t1, t0
and rax, rax, t1
I'm not sure I quite buy that. In practice, what would happen is that a suitably large, optimized VLIW core would start fetching more than one wide instruction in a cycle and issuing the resulting ops in parallel with interlocks for dependencies, etc... Effectively, that is, VLIW would drop the explicit promise of the instruction set and turn into a speculative RISC core internally. And the cost for that translation would have been very comparable with what we see in all the existing very successful x86 CPUs.
But this never happened, because we never got that far. VLIW failed for other reasons. This particular problem had an obvious safety safety valve.
I can't even understand how OOO VLIW would work.
Regarding the OoO part, the dynamic scheduling is supposedly done by a JIT layer in firmware.
The other big difference is that Denver uses a HW ARM decoder vs Transmeta's SW decoder. That was smart since they could license the IP from ARM whereas Intel would fight them every step of the way.
Well at this point this would be quite insane to stay VLIW except for backward compat; but I suspect that the backward compat in question would have way more cost than e.g. the x86 tax. For example x86 is dense which is actually an advantage, but quite hard to decode (which is the main drawback of x86, but even then that's a quite relative drawback, because modern processes let us have complex decoders). What would VLIW have left for itself if internally turned into a classical modern speculative core? Maybe even load-time software translation would perform better...
Just run the software? Nothing forces a software to use all available registers, if yours was compiled for 16 and the CPU has 32, it uses 16 registers. Modern CPUs have plenty of additional instruction codes too and you don't need to recompile your software because one CPU has the RDRAND instruction and some other doesn't have it (minus AMD breaking RDRAND but that's a different topic).
>The micro-ops and the hardware itself, is RISC.
I would argue that µOps are VLIW wearing a RISC hat, it's very VLIW-y what's happening under the hood in x86, just less coherent.
>On top of it, if you had a 4-wide VLIW and then you wanted to make a 8-wide one, you had to recompile. And oh, the "rotating registers" in VLIW is a form of register re-naming.
As mentioned above, in that case the software doesn't perform as well as it could but there is no reason it would stop working if you properly designed it.
No, that's just a popular HN myth. First of all microops are not part of the public ISA which is what the entire RISK vs CISC debate is about. If the public ISA does not matter and you can always convert to whatever is best then RISC loses because it's raison d'etre is to enable optimizations through a betzer ISA. Second using microcoding (not the same as microops) to implement a huge number of instructions is an integral core of CISC, if you are saying microcode is RISC then CISC did RISC way before RISC even existed. RISC was all about removing microcoding to simplify CPU designs so clearly if your CPU has extensive microcode it is not RISC and definitively not "CISC in name only".
If CISC is RISC then why even bother with RISC? After all CISC does "RISC" better.
Yes, RISC was simpler than the VAX-11 (which was simpler than the 432). But what did RISC do that Cray hadn't already done? It was a Cray on a chip sans vector. Even the CDC 6600 (also co-designed by Cray) was load/store.
2. GPUs are the scalable architecture: The 2080 Ti has 4352+ SIMD cores (136 compute units). And NVidia can load 30+ threads per compute unit, so 130560+ conceptual threads (kinda like hyperthreading) can exist on a GPU executing at once.
3. VLIW seems like a dead end. AMD GPUs gave up the VLIW instruction set back in 2008. Instead, the SIMT AMD GCN or NVidia PTX model has been proven to be easier to program, easier to schedule, and easier to scale. If you want high-scaling, you should do SIMD or SIMT, not VLIW. Intel, RISC, ARM, and Power9 have all chosen SIMD for scaling (AVX, ARM-SVE, Power9 Vector Extensions, NVidia PTX, and AMD GCN).
4. I think VLIW might have an opportunity for power-efficient compute. Branch-prediction and Out-of-order execution of modern CPUs relies on Tomasulo's algorithm + speculation, which feels like it wastes energy (in my brain anyway). VLIW would bundle instructions together and require less scheduler / reordering overhead. If a company pushed VLIW as a power-efficient CPU design... I think I'd believe them. But VLIW just seems like it'd be too unscalable compared to SIMD or SIMT.
5. Can we stop talking about RISC vs CISC? Today's debate is CPU (latency-optimized) vs GPU (bandwidth-optimized). The most important point is both latency-optimized and bandwidth-optimized machines are important. EDIT: Deciding whether or not a particular algorithm (or program) is better in latency-optimized computers vs bandwidth-optimized computers is the real question.
> VLIW seems like a dead end.
Since you mentioned this in the same breath as GPUs, I feel I have to point out that according to some reverse engineering paper, Nvidia's Turing is a VLIW architecture. (I'm talking about the actual hardware here, not PTX.)
Presumably they have some reason for that that's unrelated to increasing IPC, since AFAIK their GPUs aren't superscalar.
NVidia Volta / Turing has been reverse engineered here: https://arxiv.org/abs/1804.06826 . EDIT: I'm talking about the actual hardware, the SASS assembly, not PTX.
It doesn't look like a VLIW architecture to me. Page 14 for the specific instruction-set details. I realize this is mostly a matter of opinion, but... those control-codes are very different from the VLIW that was implemented in the Itanium.
> On Volta, a single 128-bit word contains one instruction together with thecontrol information associated to that instruction.
So NVidia has a 128-bit instruction (16-bytes) with a LOT of control information involved. The control information encodes read/write barriers, but there is still one-instruction per... instruction.
The "core" of VLIW was to encode more than one instruction-per-bundle. Itanium would encode maybe 3-instructions per bundle for example.
What NVidia is doing here is having the compiler figure out a bunch of read/write/dependency barriers so that the GPU won't have to figure it out on its own (I presume this increases power-efficiency). The only thing similar to VLIW is that NVidia has a "complicated assembler" which needs to figure out this information and encode it into the instruction stream. Otherwise, it is clearly NOT a VLIW architecture.
> Presumably they have some reason for that that's unrelated to increasing IPC, since AFAIK their GPUs aren't superscalar.
NVidia Turing can execute floating-point instructions simultaneously with integer-instructions. So they are now superscalar. The theory is that floating-point units will be busy doing the typical GPU math. Integer-instructions are primarily used for branching / looping constructs (and not much else in graphics-heavy code), so they are usually independent operations.
VLIW has been amazingly successful in DSP. All the world's baseband processors use VLIW machines. Many of the accelerated video decoders are also VLIW machines under the covers.
Its all CPU-based, GPU-based, or that weird "PEZY" from Japan (which is just... weird). Who knows what the future holds... but SIMD is currently making a big change to the supercomputing landscape (both in CPU-design with AVX512 and GPU-design with NVidia).
Although automatic compilers and instruction schedulers for VLIW machines exist, they aren't nearly as good as hand-tuning. VLIW machines have a niche where a fairly small amount of code is both very arithmetic-intensive, predictable, and executed in a power-sensitive environment.
In HPC and most business workloads, you just can't dedicate the labor required to making your code work well on them.
Well, I think if such advantage is real, it leads to taking over the servers, HPC, and obviously, mobile.
I imagine some hybrid approach will wind in the end. I do expect transparent graph processors to come from VLIW with real time optimization done by software, OoO style. This combo feels unbeatable, but all the real problems hide in the details anyway.
Qualcomm Hexagon (aka: Snapdragon 855) is still used for a lot of mobile applications, and its definitely a VLIW architecture. http://pages.cs.wisc.edu/~danav/pubs/qcom/hexagon_hotchips20...
But VLIW isn't popular anywhere else by my understanding.
Mill still provides speculation, but it is exposed as part of the user-visible architecture. Before Spectre was publicized, the Mill team proposed using speculation in a way that would make Mill systems vulnerable to Spectre:
There is an obvious fix for this (avoid feeding speculated values into address calculations), but they didn't say how much it costs in terms of performance.
I would like to see more exotic architectures out there, but I think I speak for many when I say that I am starting to question if the Mill architecture is going to land in a major way.
Personally I hope that Mill happens. If it will beat existing CPUs left and right or not can then finally be answered.
Deferred loads do have a massive advantage; your compiler knows best when loads are needed, so it can, for example, run the load for the next array value while still processing the current one, allowing the CPU to not stall at all at highest efficiency (because the compiler can know instruction latencies and counts out when to start the load).
Prefetching is an optimization of caching, it still doesn't beat the speed of signal in a silicon CPU, nor does it solve the scalability issues mentioned in the article.
History has shown repitedly that in fact compilers do not know best outside some very specific use cases. Whether a load need to be issued early and how early is a dynamic property of a program and it is hard to compute statically.
Prefetching and deferred loads have existed for a long time but they didn't help, for eg. Itanium.
Aliasing is also a dynamic property, but read ahead hardware should be able to handle it (but failure might stall the whole pipeline).
Saying that Mill doing is fine might be tiny bit optimistic seening that the whole thing is still vaporware. Do they have a working compiler now?
I don't see why it matters to a deferred load being issued if the data is in L1, L2 or L3. The compiler emits a load, the caching hardware handles the rest, once the load arrives the pipeline unstalls.
The compiler can atleast assume a L1 cache latency and then put some cycles between the load and execution. The earlier you can issue the load the better, if too early you might ruin it if you get preempted. Either way, it should not be hard to solve in VLIW.
The compiler also does not necessarily know instruction latencies since they change from one CPU to the next.
Out of order execution already does the job of deferred loads. Loads can be executed as soon as they are seen and other instructions can be run later when their needed memory has made it to the CPU. This is why Haswell already had a 192 instruction out of order buffer. OoO execution also schedules instructions to be run on the multiple execution units and ends up doing what compilers were supposed to do with VLIW CPUs.
> "Prefetching is an optimization of caching, it still doesn't beat the speed of signal in a silicon CPU, nor does it solve the scalability issues mentioned in the article."
None of this is true. Prefetching looks at access patterns and pulls down sections of memory before they are accessed. Caching is about saving what has already been accessed. I'm not sure what you mean by 'beating the speed of signal' but if you are talking about latency, that is exactly what it deals with. By the time memory is needed it is already over to the CPU. The article talks about issues that are due to memory latency (which much of modern CPUs features deal with on way or another) and prefetching directly confronts this. Instruction access that happens linearly can be prefetched.
HW looks at access patterns (as you say) and does at least as good a job.
The instruction latencies are actually all documented and modern compiler do take care when generating assembly code to optimize the output a bit to not stall the pipeline too often.
I don't see how compilers, which have greatly improved since Itanium, shouldn't be able to tell when a load is best issued and be able to keep the pipelines full.
>None of this is true. Prefetching looks at access patterns and pulls down sections of memory before they are accessed.
That pretty much sounds like an optimization to caching for me. Granted, it's not precisely a caching algorithm but it's more of a cache predictor, but cache in the end.
Compiler know best how to order instructions to ensure that loads happen when best needed and sufficiently beforehand that the data arrives in time without having to stall any part of the pipeline. The CPU solves it at runtime, why can't the compiler do it at compiletime when all the same information is available?
That plus to my knowledge, there is no single core performance difference at same clock between a 4-core and 12-core CPU in AMD CPUs, which would be the case if instruction fetching is maxing out memory bandwidth.
But whether or not the CPU's uOPs are RISC are not isn't really relevant here. The article is talking about pressure on things like L1. The argument is that CISC becomes almost a form of compression. If the CPU internally splits it into multiple uOPs that's fine - you still got the L1 savings, and those uOPs can potentially be more specialized. The CPU doesn't need to look ahead to see if the intermediate calculation is kept or anything like that.
As in, it's overall more efficient to take a macro op and split it than take micro ops directly.
When PowerPC started adding complex vector instructions (AltiVec) and the Pentium turned x86 into a RISC core with translation layer around it you knew the distinction was pretty much dead.
RISC-V is designed to make pipelining very efficent but there has always been a limit. RISC-V just helps you get close to that limit with limited complexity.
Beyond that, part of RISC-V will be the 'V' standard extension that will give you access to a advanced vector engine that is a improvment on many of the ways we do SIMD now.
... as well as combining two (or even more?) instructions into one μOp.
The root of CISC's persistent dominance over true RISC instruction sets is that memory bandwidth is far lower what would be needed to feed micro-ops directly into the CPU. It makes sense to solve that by compressing the instruction stream. RISC looks far better on paper in every other way if you ignore memory bandwidth and latency issues.
That being said, I've wondered for many years about whether a more conscious realization of this might lead to a more interesting design. Maybe instead of CISC we could have CRISC, Compressed Reduced Instruction Set Computer? Instead of CISC you'd have some kind of compression codec that defines macros dynamically. I'm sure X64 and ARM64+cruft are nowhere near optimal compression codecs for the underlying micro-op stream. If someone wants to steal that idea and run with it, be my guest.
It might not be advantageous to just compress a RISC stream of instructions instead of higher level instructions made up of micro-ops for that reason alone.
But most modern RISCs do take a sort of Huffman encoding perspective on ISA design, starting with SH, into Thumb(2), and into RVn-C. I do agree that there's farther we can probably go; stuff like memory referencing ALU ops can be thought of as a way of addressing PRF registers without using any bits in the instruction stream for instance.
Using the Intel Vtune tools you can see how each port is utilized, so you could in theory change your code to mix instructions for best utilization beyond what reordering the CPU can do itself, so I can see some analogy with building a VLIW instruction group.
There's a crazy amount of performance counters you can look at (the perf tool can do that too, but just try running "perf list" to view available counters).
The (proven) modern high perf microarchitecture for generalists CPUs is pretty much the same for everybody nowadays (well some details vary of course, but the big picture is the same for everybody), so a RISC ISA is not necessarily very interesting anymore, but also not necessarily a big problem. However if we take things that still can have in impact on the internals, I would prefer RMW atomic instructions to LL/SC most of the time (arguably LL/SC is the "RISC" way to go). Hell I would even sometimes want posted atomics...
Parallel prefix, vector addition, vector transpose, 32x32 dense matrix multiply ...
FPGAs which are large and fast enough to make interesting CPUs are readily available, but they are far too expensive, compared with buying an equivalent CPU core. For perspective, the FPGAs you can rent on Amazon are listed at around USD $20,000 per individual FPGA chip!
Much cheaper and smaller FPGAs exist, but they aren't generally cost-effective and performance-compatitive against an equivalent design using a CPU, even with the advantages provided by custom assembly languages and hardware extensions.
There are times when it's worth it. People do implement custom assembly languages on FPGAs quite often, for custom applications that benefit from it. I've done it, people I work with have done it, but each time in quite specialised applications.
(CPU manufacturers use arrays of FPGAs to simulate their CPU designs too.)
Processors in memory is another thing entirely. These are actively being worked on.
GPUs using HBM exist, where the HBM RAM is a stack of silicon dies laid directly on top of the GPU die, with large numbers of vertical interconnections. These behave similarly to processors-in-memory, because there are a lot of processing units (in a GPU), and a lot of bandwith to reach the memory from all of the units.
Some studies show diminishing returns from increasing the memory bandwidth further, with the present GPU cores and techniques, so it's not entirely clear that intermingling the CPU cores with the RAM cores on the same die would bring much improvement.
There is a physical cost to mingling CPU and bulk RAM on the same die, which is that optimal silicon elements are different for CPUs than for bulk RAM, so manufacturing would be either more expensive, or make compromise elements.
nitpick: actually they're off to the side, connected via the interposer:
Here's what Wikipedia has to say about it:
The terms CISC and RISC have become less meaningful with the continued evolution of both CISC and RISC designs and implementations. The first highly (or tightly) pipelined x86 implementations, the 486 designs from Intel, AMD, Cyrix, and IBM, supported every instruction that their predecessors did, but achieved maximum efficiency only on a fairly simple x86 subset that was only a little more than a typical RISC instruction set (i.e. without typical RISC load-store limitations). The Intel P5 Pentium generation was a superscalar version of these principles. However, modern x86 processors also (typically) decode and split instructions into dynamic sequences of internally buffered micro-operations, which not only helps execute a larger subset of instructions in a pipelined (overlapping) fashion, but also facilitates more advanced extraction of parallelism out of the code stream, for even higher performance.
And Intel has an article on how instruction pipelining is done which covers CISC designs as well.
Access to RAM on a MOS 6502 is about 2-6 cycles, isn't it? The total amount of RAM in a 6502 system is usually roughly the size of a modern L1 cache as well. It would be interesting if you could just program with an L1 exposed and structured as a main memory rather than a cache ...
During the boot of the computer, you can. At least you could a few years ago, I'm not sure about brand new processors. Anyway, the boot was (and maybe still is) done in multiple phases and at first the RAM is not available. So at first the code only uses registers. But this is extremely limiting (I'm not sure there are even compilers capable to emit such code, so you have to write that shit in assembly language). So one of the first thing to do, even before you have the memory controller programmed and calibrated, is that you flip a bunch of model specific registers, run a tiny loop to touch a range of addresses, and update the MSRs again to prevent the cache lines from being evicted, and BOOM! You go into the Cache-As-Ram mode. Later you will initialize the memory controller, and leave Cache-As-Ram. I'm not sure if you can hard-limit to L1 though (well obviously you can if you restrict to a small enough range). Probably L1+L2 is used most of the time? Not really sure...
Maybe you can take a look at coreboot and actually program the application you want in there instead of under a traditional OS. Seems fun :P
One cycle. Of course, the cycles are a lot slower.
> It would be interesting if you could just program with an L1 exposed and structured as a main memory rather than a cache ...
The Cell processor, used in the PlayStation 3, provided something like that. What happened in practice was that programmers trying to get code working soon enough to be useful, just ended up leaving most of the theoretical performance on the table, which is why the next generation of consoles scrapped the idea in favor of a conventional x64 CPU.
The 6502 accesses the memory bus every cycle, although for some cycles of some instructions, the value read isn't used; as a result, RAM and ROM in a 6502 system needs to be fast relative to the CPU.
Perhaps you're thinking about the instruction latencies, which are 2-6 cycles, depending on the instruction and the addressing mode and what not. (BRK (0x00) aka software interrupt is 7, though)
My old 486 from the early 90's HDD almost fits in modern L3 cache.
Also large caches need a lot of logic to ensure coherency between multiple cores - with a small cache the probability of conflict isn't that big, so you can afford to keep it simple. With a huge cache such conflicts between cores would be pretty much assured and you would have to dedicate a lot of silicon just for managing access.
In extremis this also means the mono-kernel has to go, but then we're talking big a change.
Anyway I don't think the improvements are going to come from hardware or software only, we need to improve both simultaneously together, which is complex and often requires one (or atleast very few) person(s) to do the job.
Even that is super hard given that the leading producer of hardware interpreters for this VM shipped hardware with the SPECTRE vulnerability.
The RISC-V V extention is the result of this work and exploration into parallel architecture.
RISC-V was to be modular and the V extention is just as much part of RISC-V as the F extention.
Seems that mips was scalable back then
There was a story headline about a 16 core RISC V just the other day
The rest of the article is talking about how VLIW "solves" this problem, which is does not. Given an existing parallelizable problem, a VLIW architecture can encode the operations a little more efficiently, and the decode hardware required to interpret the instructions can be a lot simpler. But that's as far as it goes. If you have a classic CPU which can decode 4 instructions in a cycle, then it can decode 4 instructions in a cycle and VLIW isn't going to improve that except by reducing die area.
VLIW also forces compilers to make a choice between crazy specificity to specific hardware architectures, or an insanely complicated ISA with a batching scheme a-la Itanium. This is largely why it failed, not that "compilers weren't smart enough".
There's nothing wrong with RISC. Or classic CISC. Even VLIW isn't bad, really. The sad truth is that ISAs just don't matter. They're a tiny bit of die area and a few pipeline stages in much larger machines, and instruction decode is largely a solved problem.
Programmers just like to yell about ISA because that's what software touches, and we understand that part.
The fundamental issue that CPU architects run into is that the speed of light isn’t getting any faster. Even getting an electrical signal from one end of a CPU to the other now takes more than one cycle.