Hacker News new | past | comments | ask | show | jobs | submit login
RISC Is Unscalable (blackhole12.com)
192 points by signa11 76 days ago | hide | past | web | favorite | 135 comments



The article is conflating three separate issues:

(1) The amount of compute and/or data motion that can be achieved with a single instruction. This is really about amortizing the cost of decode and allowing the pipeline to be kept full by producing a lot of work from a single instruction.

(2) efficiency gains from vector processing. This both amortizes the cost of decode and produces the amount of control circuitry relative to the number of ALUs --> more flops/area. it also generally favors larger sequential memory accesses which is good for bandwidth.

(3) extracting parallelism from the instruction stream. The VLIW debate is about whether that should be done by the CPU itself, e.g., in the form of out-of-order execution, or whether it should be handled by the compiler. VLIW allows the compiler to do this work, which keeps the CPU simpler.

It's been clear for several years that larger vectors are a win, and that's been happening in the Intel and arm space, not to mention GPU. The VLIW debate is less clear, and has been going back and forth. I think that we have been doing a better job of getting a handle on when instruction complexity and diversity is beneficial versus not - remember that the initial RISC proposal was in contrast to the PDP instruction set, which was kind of ridiculously over specialized for the technology of the time.


>remember that the initial RISC proposal was in contrast to the PDP instruction set, which was kind of ridiculously over specialized for the technology of the time.

That's a good point and I expect that it still somewhat holds true today: many of the RISC supporters (and I'm one of them) effectively assimilate CISC with x86 since that's by far the most popular CISC instruction set out there these days[1]. And x86 is the C++ of instruction sets: decades and decades of legacy feature which shouldn't be used in modern code, a large selection of paradigms copied from whatever other instruction set was popular at the time. You want to do BCD? We have opcodes for that (admittedly no longer supported in long mode, but it's there). Also do you like prefixes? Asking for a friend.

But obviously that's a bit of a fallacy, one could design a modern CISC-y IA without all the legacy baggage of x86. It's not so much that I love MIPS or RISC V, it's that I don't want to have to deal with x86 anymore.

[1] We don't yet consider ARM CISC, right?


The amount of silicon legacy code takes up literally decreases non-linearly with each process shrink.

The amount of baggage x86 is carrying around now days long ago shrunk to the point of not mattering.

And sure, if some ancient compiler spits out some opcode that Intel long ago stopped optimizing, it may take 10x the clock cycles it used to take back in the days of 386s, it is still going to run 100x faster at the end of the day.

> We don't yet consider ARM CISC, right?

I think ARM64 was an attempt to prevent that from happening! I haven't looked into it at all so I am not sure how successful they were. But it was starting to get pretty close there for awhile.


Actually it was in contrast to the Vax instruction set ... the PDP instruction set is something else (and some Vax's had a PDP emulation mode ....)


VAX was a crazy town ISA too, with stuff like single isntruction polynomial evaluation.


Crazy for its age, but all modern processors (x86, ARM, and Power9) implement a polynomial multiplication.

ARM: vmull.p8

Intel: PCLMULQDQ

Power9: I forget, but trust me, its in there somewhere.

IIRC, they're all 64-bit carryless (or polynomial) multiplications. The funniest part is that ARM and Power9 still claim to be "reduced instruction set computers", despite having highly specialized instructions like these. Power9 is the funniest: implementing hardware-level decimal-floats (for those bankers running COBOL), quad-floats (for the scientific community), Polynomial-Multiplication and more.


Not GF polynomial multiplication, full Horner's method polynomial eval in a single instruction.

https://web.archive.org/web/20181126134850/http://uranium.va...


Well, the meaning of "reduced" has shifted from "not having specialized instructions" to "load-store and preferably fixed-length".

Here's a CRC32 impl using POWER8/9 polynomial instructions: https://github.com/antonblanchard/crc32-vpmsum — coincidentally, you don't have to do this on ARM, ARM just has CRC32 instructions.

> hardware-level decimal-floats (for those bankers running COBOL)

IBM being IBM :) I've heard somewhere that they are sharing some internals between POWER and z/Architecture (mainframe) chips. No idea how true it is and how much is shared if true, but a scenario like "decimal floats were implemented because mainframes are using the same backend with a different frontend, so we exposed them in the POWER frontend too" sounds quite plausible.


Well, the meaning of "reduced" has shifted from "not having specialized instructions" to "load-store and preferably fixed-length".

Ain't that the truth. Reduced never meant anything; they never defined it. Patterson said they cooked up the name on the way to a grant presentation. I think they should call it Warmed Over Cray. What was new in RISC that wasn't in Cray? It was like the Geometry Engine; it was possible for an academic department to accomplish but that doesn't mean they accomplished anything that hadn't already been done.

preferably fixed-length

Or fixed-length/2.


IBM likes decimal floats because of COBOL (and maybe RPG?).

To be fair, even back in the day decimal arthimetic could at least have an argument made for it on RISC chips. https://yarchive.net/comp/bcd_instructions.html


Which resulted in a particular problem .... worst case an instruction could cause 21 different page faults (because of all the operands), that meant that every process had to be able to have an in-core working set of a minimum of 21 (512-byte) pages in order to make progress, at a time when memory was not cheap and plentiful


more importantly too many operands with side effects (auto increments) result in complexity in the handling of exceptions (do you restart or save partial internal state - compare the 68020 with the 68040 for example)


Yep. I'm playing with a 68k decoder frontend for a modified boom core. Indirect memory accesses are the bane of my existence at the moment.


- 68000s were just broken (couldn't restart a page fault) - we still put unix on it, hacked the compiler to probe the stack to do stack extension on swapping only systems

- 68010s were essentially a 68000 with this fixed - unix could do a real paging system

- 68020s dumped their entire microcode state on the kernel stack - to do signals correctly (ie recusively) unix systems had to be able to spill this to user space and validate it on the way back in

- 68030s and 68040s were saner and restarted exceptions in useful ways


Wow. Having pretty much ventured from 6502 directly to the 386 and later ARM, 68k was something I never really took a close look at. Was that limitation in the original 68000 unintentional, or did they just not design for faulting in virtual memory at page granularity?

After all, you don't need to be able to restart on a page fault if you just swap in whole tasks out and back into memory just before switching to them (which you know when you're going to do), and you still get virtual memory's benefit of the swapped in tasks being able to occupy different physical, and discontiguous, memory regions without the task itself noticing.

So did they only design for that, or did they actually make a mistake?


They didn't support any MMU at all. People would add one as a separate chip, right between the CPU and the RAM, but this wasn't supported. The CPU was unaware of it. There was no way for the MMU to tell the CPU to take a recoverable fault.

Sun solved this problem in an expensive way. The MMU would make the CPU wait while a second 68000 was used to process the page fault.


Ah, your comment about them being "broken" led me to believe that some form of virtual memory was expected in the design.

Now that you mention it, I recall reading that two 68000s story before (possibly from another comment you made here?), and reading it this second time, it still amazes me. Is it possible to find the source code involved? Not sure what of Sun's kernels is available.


There was no builtin in MMU in early chips (68000-68020). An external MMU for the 68000/68010 did exist - a weird thing called a 68451 that mapped arbitrary power of two aligned physical blocks of memory to arbitrary power of two aligned virtual addresses (harder to use and particularly hard to allocate memory for) - it was VERY slow (7 clocks from memory) and was rumored to have been designed by a summer intern at Moto.

You do need to restart instructions if you want your stack to extend automatically, there's a hack you can do to make this happen on a 68000

You could do a paging system on a 68010/68451 - you had to treat the 68451 as a software refilled TLB

Sun built their own MMU using SRAM, at the time there was a guy going around the Valley designing "sun-style" MMUs, for IP reasons every one was different, some were broken (one I ported to couldn't protect the kernel from user mode).

Someone (maybe Sun?) managed to get 68000s to do paging by having two of them and freezing the primary on a page fault and having the second one load the page somehow.

68020s had external MMUs, Moto made the PMMU which most people used Phillips tried to make one (it never really worked), Apple had a weird pseudo MMU that fit in a PMMU socket on the mac 2 - it emulated the 24-bit addressing on the 68000 (Apple had stupidly used the upper bits in their pointers)

Finally: I think that the 68000 not being able to restart a page fault was probably a mistake, something they hadn't thought all the way through


I've since also found a comment on an interesting blog post in which one of the original designers is quoted in saying that virtual memory was not a consideration for the 68000 initially: http://www.dadhacker.com/blog/?p=1383

Though qualified with "until the design was almost complete", making the answer to my question (as often) much more fuzzier than I thought.

And another commenter speculates it was Apollo with the two-68k implementation, but without sources and the commenter is unsure of it themselves.

In any case, very interesting stuff.

EDIT: Someone else claims it was Masscomp: http://www.dadhacker.com/blog/?p=1383#comment-3582


That article kind of mixes up AT&T releases - we'd ported V7 and System 3 to the Lisa (base and bounds MMU like the original PDP-11 the ports came from), and the 68451 and Sun-style MMUs by mid '84, SVR (System 5 Release 1) didn't exist yet. SVR2 was a full paging release from AT&T (competing with BSD 4.2 by then).

The '451 is more functional than his power of 2 design (because it offers multiple powers of 2 that you can cobble together into arbitrary regions) - the 'throwaway instructions' he mentions is indeed what we used for stack extensions - you do a 'tstb offset(%sp)' in every subroutine entry where 'offset' is the extent of the stack the subroutine will use, the results of the instruction aren't used and the fault can be detected - by recognizing the instruction at the fault pc offset and returning to the address following it


Thanks for your insights, and for clarifying (I primarily linked to the article for its comments, which seemed to contain a collection of interesting historical tidbits.)

Again as someone who somehow never got to touch 68k during its day, the level of creativity required in 68000 MM, and the internal state spilling of the 68010 and later, really got me curious (x86 of course always had plenty of curious quirks, but I think I'm too familiar with them at this point %) ).

I'd also still love to read more about the details of the two-68000 design, but everything I could find so far was essentially folklore.


I'm afraid I never saw one, we did over 150 unix ports in the mid to late 80s and saw just about everything out there


From what I could glean from the combined 68000+68010 data sheet, the 68010 also dumps some internal state on the stack to support restarts, though that now just makes me more curious about the differences between 68010 and 68020 in that regard.


> worst case an instruction could cause 21 different page faults

Funny story is that this is still kind of a problem today, where a single pointer-dereference may be an mmap'd file over a network over SSHD tunneled through (increasingly arbitrary indirections).

foo->next = bar->next->next can be surprisingly complicated. I don't think most programmers think to handle SIGBUS errors.


Not exactly the same problem, though, since once you dereferenced one pointer, you technically don't need its page anymore. So in a truly pathological case, your bar->next->next could be "page in the page for bar, move pointer to register, throw page away, page in the page for next, move pointer to register ...".

Whereas to me, it sounds like this polynomial evaluation instruction needed all the 21 pages within one (very "large") instruction, meaning you actually need to have those (again, pathological case) 21 different pages present at once?

EDIT: A sibling comment mentioned keeping partial state of (apparently) single instructions instead of fully restarting them. So that might be an option, too.


Also do you like prefixes? Asking for a friend.

If you like x86 at all then prefixes are not a good thing but indeed a great thing. Prefixes are what allowed AMD to completely destroy Intel's Itanium. They simply took x86 and reused the REX prefix to access 16 more registers. Yeah, the committed a few other architectural atrocities before the god of expediency, but they won and Intel had to crawl back and beg forgiveness in the x86 space that they owned.


> one could design a modern CISC-y IA without all the legacy baggage of x86

You are talking about ARM here, aren't you?

But seriously, there is no reason not to say ARM is CISC. It's not the old AI it used to be anymore.


ARM is load-store and fixed-length, which seem to be the defining traits of "RISCiness" these days :)

btw, one of the "best" x86 features is the string instructions:

https://lobste.rs/s/m7ioti/risc_is_fundamentally_unscalable#...

> the destination operand symbol must specify the correct type (size) of the operand (byte, word, or doubleword), but it does not have to specify the correct location


In thumb2 mode, an ARM will not be fixed-length: instructions may be 16 or 32 bits long.


I'm talking about aarch64, I don't care about 32-bit stuff.


> The amount of compute and/or data motion that can be achieved with a single instruction. This is really about amortizing the cost of decode and allowing the pipeline to be kept full by producing a lot of work from a single instruction.

I'm basing this on my knowledge from CS classes in the late 90s but isn't the point of RISC for the decode to be very simple because the instructions are very simple?

Much of the optimization of x86 has been in the Decode where a CISC can be transformed in to multiple and often parallel instructions. And parallelism where we've seen our gains and benefited the most in the last ~10-15 years of CPU design.

Does that mean in order for RISC to be performant some degree of parallelism will have to be done by the compiler? Isn't that what ultimately killed Itanium which was a VLIW processor?


(1) That was part of the point of RISC. Other parts were that compiling for CISC was hard at the time.

(2) The compiler is critical for performance on any platform these days. But RISC/CISC and VLIW vs out-of-order/speculative are separable. You can do an out-of-order speculative RISC machine and it does the same thing an x86 CPU does. (At the micro-op level they're pretty similar). For an OOO processor, the compiler still needs to do a decent-ish job of grouping instructions so that the processor can find the parallelism, because its execution window is limited -- the CPU won't look ahead 200 instructions to find more parallel stuff to execute. The requirements are even more stringent with an in-order VLIW core, where the compiler is responsible for all of it.

I know of more production VLIW cores today than back in the Itanium days. Compilers have gotten better. The GPU manufacturers are an interesting case of this: AMD, for example, was VLIW until GCN in ~2012. NVidia tried VLIW in NV3x, but abandoned it in 2004-2005 with the move to more general-purpose computation on GPUs. Some of the AI hardware accelerators are looking again at VLIW now, probably because the regular, fixed things they do are more amenable to compiler-based scheduling, whereas very fine-grained dynamic scheduling is better done in hardware.


> the CPU won't look ahead 200 instructions to find more parallel stuff to execute

CPUs do:

> Intel increased the reorder buffer by over 50% from 224 entries in Skylake to 352 entries in Sunny Cove. Likewise, the amount of inflight loads and stores has been increased by over 50% to 200 memory operations inflight.

https://fuse.wikichip.org/news/2371/intel-sunny-cove-core-to...


Thanks - rats, I had it in my head it was 192, but that was Haswell.

The CPU won't look ahead 400 instructions! ;-)


At current dispatch rates, 400 instructions is how many nanoseconds? It isn't really looking ahead that far even now!


The future is INSANE.


Ease of decode was one bit but there were a couple of other important ones.

(1) Because RISC chips were simpler they consumed less transistors. Because there were less transistors you could fit an entire CPU on a single chip of silicon which drastically improved performance.

(2) It was much easier to apply pipelining to load-store architectures than to architectures where instructions could contain multiple memory accesses or indirect memory accesses and also retain precise interrupts. Pipelining dramatically increased the rate at which instructions could execute.

Also, back in the CISC days the limitations of the memory subsystems available made it supremely important to optimize for minimizing instruction size and instruction bandwidth. This was far less of a concern by the time the RISC paradigm rolled around compared to other issues.


Yes, even if VLIW failed, GPUs are showing us that there is a lot of performance to be had in switching from hardware to software scheduled parallel memory accesses.


GPUs advantage isn't from scheduling IMO.

You are correct in that GPUs have a degree of software scheduled memory instructions. There's no Out-of-order schedulers in AMD or NVidia. The best you got is "start memory load/store", and "synchronize" kind of instructions. On AMD, you perform s_waitcnt to wait for memory to respond. NVidia has read/write barriers + stall counts encoded into each instruction that performs a similar task.

However, I argue that GPU advantage comes from the innate "warp" / "wavefront" model. The programmer will specify 100,000+ "SIMD threads" of execution. There are blatantly obvious ways to optimize the memory access when you have hundreds-of-thousands of threads stepping together.

In particular: with 32-threads operating at a time, you can stripe the data so that the 32-threads all access 128-byte chunks at a time. (Thread-local variable "uint32_t foo" for threads #0 through #31 can be stored in memory location #0 through #127. When threads #0 through #31 read "foo" they only need to perform 2-memory transfers: because modern memory reads/writes 64-byte blocks at a time).

"Striping" memory access, and grouping memory accesses together (aka coalescing) is very important for modern GPU (and CPU) memory optimization. Its far easier to do on GPUs due to the parallel model. In fact, all thread-local variables are optimized as such in GPU-code, its basically automatic.

CPUs attempt to do the same through auto-prefetching. But it doesn't get as far as the GPU-approach.

GPUs benefit from wide memory busses, because of the wavefront / warp model of programming. 32+ threads grab memory, and stall on memory, at the same time. Its more efficient to calculate memory stalls for the 32+ batch of threads (one warp) rather than doing it one-at-a-time like a CPU.


GPUs are used in tasks where the memory access patterns are simple enough that you can predict them ahead of time. In cases where there isn't heaps of parallelism and memory access patterns are predictable people still use VLIW processors. Your smartphone probably has one.


Indeed.

If you look at Patterson's initial paper that makes the argument for RISC, he compares it to the VAX instruction set that had support for managing doubly-linked lists, polynomial evaluation, string matching and CRC among other things!

Did a little explainer video on that: https://youtu.be/o14ecAoGN8w?t=105


This is relatively misinformed piece of writing.

There is little to no reason why high performance RISC-V implementations can not achieve performance comparable to modern "CISC" cores.

As other commenters have said, VLIW doesn't suddenly make the problem go away and no amount of compiler magic is going to fix the fundamental issue that memory access on a modern processor is highly non-uniform and scheduling is not something that can be done statically ahead of time.

RISC-V is designed to be extended with complex instructions and accelerators and that's the scalable part.


I think it depends on how you interpret "reduced"

- Does it mean reducing the actual number of instructions in the ISA?

- Does it mean reducing the functionality of each instruction to it's absoloute minimum?

These are similar, and overlapping in places. You can also do one without the other.

I think that the base RISC-V ISA does both to a perhaps unhelpful degree. As soon as you want a RISC-V core to be competative with other peer ISAs, you need a bunch of the extensions, which minimises the value in calling it reduced in the first place. At the least, it exposes a possible dichotomy between "reduced-ness" and "usefullness".


Reduced in RISC means nothing today. It made sense in the '80 when a RISC ISA allowed implementing a fully pipelined cpu on a single chip. Today RISC basically means a load store architecture with easy to decode ISA, with usually but not necessarily fixed size instructions.


RISC-V even with the basic extention set is still far smaller then the competition. And when you add the 'V' extention it is also considerably smaller comperative SIMD instructions.

A comperable RISC-V core in terms of feature, will always have less instruction.


Indeed, but that just means RISC-V designers have pursued the "less instructions" goal for its own sake. Having less instructions is not inherently beneficial. It can actually be detrimental: common operations require more instructions.

https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d99...


The RISC-V designers disagree. A number of typical patterns can be optimized easly by macro-op fusion in higher performance cores. The benfit is a decoder that has an incredibly small size, minimizing minimal core for RISC-V.

Also RISC-V preferes less in in the core spec because if you have application that really needs something there will be standard extentions to add it.


> minimizing minimal core for RISC-V

Again: why should I care about that? As a user, I care about big desktop-class cores, not academic minimal cores that fit on FPGAs. Small decoder is not a benefit for real world usage.

> there will be standard extentions to add it

That is, there will be fragmentation.


> Again: why should I care about that?

Have you considered that the world doesn't revolve around you?

You are infact wrong, many 'real world' uses like how small these cores can be.

And given that many SoC now have lots and lots of small cores in them, having those cores be as small as possible is beneficial.

> That is, there will be fragmentation.

Yes. There will be fragmentation because an industry that is so broad, in terms of minimal soft cores to massive HPC systems, so having a true one-size-fits-all would have been doomed from the beginning and was never a viable design goal.

RISC-V is design to approch the problem of a universal open-source ISA. It knows that avoiding fragmentation is impossible and thus they tried to build something that makes fragmentation managable both in terms of the organisation of the standard and in terms of the tools.


Well, RISC-V is ultimately just a trademark/brand, as is Advanced RISC Machines (ARM), nothing stops them standardizing a CISC-like extension and keeping the brand if needed.


The author fundamentally misunderstands what RISC is, what CISC is, what SIMD is and what VLIW is apart from misunderstanding every major computer architecture concept

RISC was invented as an alternative approach in an era when processors had really complex instructions, with an idea that high level languages could be efficiently compiled to them and assembly programmers would be efficient if they can do many things with one instruction. RISC philosophy was to make simpler instructions, let compilers figure out how to map high level languages to simple instructions, and therefore fit the processor on one die (yes, "processors" used to be several chips) and therefore run it at high clock speeds. RISC is not a dogma, it is a design philosophy.

On top of that exception handling in complex instructions is hard. Implementing complex instructions in hardware consumes considerable design and validation effort. RISC has won for these reasons.

Some things have changed, we can fit really complicated processors on a single die. Memory access is the bottleneck. The downsides of RISC in this reality is well known: It takes many more instructions to do the same thing, which means instruction cache is used inefficiently (anyone remember the THUMB instruction set of ARM?). It might be useful to add application-specific hardware acceleration features, because we now have the transistors to do it. How does that make RISC unscalable?

Many CISC machines (eg Intel's) are CISC in name only. The instructions are translated to micro-ops in the hardware. The micro-ops and the hardware itself, is RISC.

Register re-naming was invented to ensure that we enjoy the benefits of improvements in hardware without having to recompile. Let us assume you have a processor with 16 registers. You compiled software for it. Now we can put in 32. What do you do? Recompile everything out there or implement register re-naming?

VLIW failed because they took the stance that if we remove hardware-based scheduling, the extra transistors can be used for computation and cache. Scheduling can be done by compilers anyway. The reason they werent successful is because if a load misses the cache, you wait. Instead of superscalars which would have found other instructions to execute. On top of it, if you had a 4-wide VLIW and then you wanted to make a 8-wide one, you had to recompile. And oh, the "rotating registers" in VLIW is a form of register re-naming.

Poorly informed article.


> Many CISC machines (eg Intel's) are CISC in name only. The instructions are translated to micro-ops in the hardware. The micro-ops and the hardware itself, is RISC.

This is a half-truth at best. Unless you work at Intel/AMD/etc. as a chip designer odds are you do not know what is "really happening" behind the scenes, so the underlying implementation is whatever they "want it" to be. The underlying implementations can even change from microarchitecture to microarchitecture.

So internally we might make a guess that microcode must look more "RISC-y" after transformation, given the fact that "some transformation" must be happening. The larger internal (renamed-to) register file was a common trait of RISC architectures, after all, and a lot of microcoded systems historically have been regarded as "RISC-y." But the existence of modern optimizations like macro-operation fusion suggest that the internals of CPUs today renders things much more ambiguous today with regards to "RISC-ness" vs. "CISC-ness."

Fun fact: the original ARM1 processor in 1985 was microcoded - https://en.wikichip.org/wiki/acorn/microarchitectures/arm1#D...


> Many CISC machines (eg Intel's) are CISC in name only. The instructions are translated to micro-ops in the hardware. The micro-ops and the hardware itself, is RISC.

This is an oft-repeated but incorrect statement. Modern x86 CPUs perform macro-op fusion and micro-op fusion. As an example of the former, a comparison and a jump instruction can be fused into a single micro-op [1], which is decidedly non-RISC. As for the latter, some micro-ops perform a load from memory and an arithmetic operation with the retrieved value – also very non-RISCy.

Modern x86 CPUs are CISC above and below the surface.

[1] https://en.wikichip.org/wiki/macro-operation_fusion#x86


Yeah, vertical microcode always looked pretty RISCy if you're not used to it.

And FWIW, I've heard that there's two distinct uOp formats inside a.single core these days for quite a few of the uArchs. There's the frontends view which is concerned with amortizing decode costs (so wide fixed purpose instructions, that other wise look pretty CISC), and the backend's uOps that's concerned with work scheduled on functional units. A lot of the fusion happens on the front end, and a lot of the cracking happens on the backend, and the frotbejd tends to be two address, and the backend three address. So like a frontend's

   and rax, [rbx, addr]
is something like

   ld_agu t0, rbx, addr
   ld t1, t0
   and rax, rax, t1
on the backend.


I'd add that horizontal microprogramming well predates RISC. x86 μops are wide.


> The reason they werent successful is because if a load misses the cache, you wait. Instead of superscalars which would have found other instructions to execute.

I'm not sure I quite buy that. In practice, what would happen is that a suitably large, optimized VLIW core would start fetching more than one wide instruction in a cycle and issuing the resulting ops in parallel with interlocks for dependencies, etc... Effectively, that is, VLIW would drop the explicit promise of the instruction set and turn into a speculative RISC core internally. And the cost for that translation would have been very comparable with what we see in all the existing very successful x86 CPUs.

But this never happened, because we never got that far. VLIW failed for other reasons. This particular problem had an obvious safety safety valve.


Denver (and Crusoe before it) implements OoO behaviour on top of a VLIW core via dynamic translation. It performed fairly competitively, but in the end the complexity moved from one place to the other and was hardly worth it.


The Denver team came from Transmeta but they didn't bring VLIW with them. Denver is a very wide (7+ wide) in-order superscalar pipeline.

https://en.wikichip.org/wiki/nvidia/microarchitectures/denve...

I can't even understand how OOO VLIW would work.


The rumour I have heard is that Denver is indeed an VLIW machine (in order superscalar does not exclude that). It might be wrong of course.

Regarding the OoO part, the dynamic scheduling is supposedly done by a JIT layer in firmware.


Everything I've read/heard is that it's superscalar. They've gone from 7 wide to 10 wide with Carmel. That's a lot of width to schedule and/or waste.

The other big difference is that Denver uses a HW ARM decoder vs Transmeta's SW decoder. That was smart since they could license the IP from ARM whereas Intel would fight them every step of the way.


It seems that we had the same discussion 2 years ago on HN :). I can't find any authoritative source so I'll concede. The guys at RWT are pretty sure it is an VLIW though.


> I'm not sure I quite buy that. In practice, what would happen is that a suitably large, optimized VLIW core would start fetching more than one wide instruction in a cycle and issuing the resulting ops in parallel with interlocks for dependencies, etc... Effectively, that is, VLIW would drop the explicit promise of the instruction set and turn into a speculative RISC core internally.

Well at this point this would be quite insane to stay VLIW except for backward compat; but I suspect that the backward compat in question would have way more cost than e.g. the x86 tax. For example x86 is dense which is actually an advantage, but quite hard to decode (which is the main drawback of x86, but even then that's a quite relative drawback, because modern processes let us have complex decoders). What would VLIW have left for itself if internally turned into a classical modern speculative core? Maybe even load-time software translation would perform better...


>Register re-naming was invented to ensure that we enjoy the benefits of improvements in hardware without having to recompile. Let us assume you have a processor with 16 registers. You compiled software for it. Now we can put in 32. What do you do? Recompile everything out there or implement register re-naming?

Just run the software? Nothing forces a software to use all available registers, if yours was compiled for 16 and the CPU has 32, it uses 16 registers. Modern CPUs have plenty of additional instruction codes too and you don't need to recompile your software because one CPU has the RDRAND instruction and some other doesn't have it (minus AMD breaking RDRAND but that's a different topic).

>The micro-ops and the hardware itself, is RISC.

I would argue that µOps are VLIW wearing a RISC hat, it's very VLIW-y what's happening under the hood in x86, just less coherent.

>On top of it, if you had a 4-wide VLIW and then you wanted to make a 8-wide one, you had to recompile. And oh, the "rotating registers" in VLIW is a form of register re-naming.

As mentioned above, in that case the software doesn't perform as well as it could but there is no reason it would stop working if you properly designed it.


>Many CISC machines (eg Intel's) are CISC in name only. The instructions are translated to micro-ops in the hardware. The micro-ops and the hardware itself, is RISC.

No, that's just a popular HN myth. First of all microops are not part of the public ISA which is what the entire RISK vs CISC debate is about. If the public ISA does not matter and you can always convert to whatever is best then RISC loses because it's raison d'etre is to enable optimizations through a betzer ISA. Second using microcoding (not the same as microops) to implement a huge number of instructions is an integral core of CISC, if you are saying microcode is RISC then CISC did RISC way before RISC even existed. RISC was all about removing microcoding to simplify CPU designs so clearly if your CPU has extensive microcode it is not RISC and definitively not "CISC in name only".

If CISC is RISC then why even bother with RISC? After all CISC does "RISC" better.


RISC was invented as an alternative approach in an era when processors had really complex instructions

Yes, RISC was simpler than the VAX-11 (which was simpler than the 432). But what did RISC do that Cray hadn't already done? It was a Cray on a chip sans vector. Even the CDC 6600 (also co-designed by Cray) was load/store.


1. CPUs are about minimizing latency. CPUs aren't designed to scale, they're designed to execute your (presumably single-threaded) program as quickly as possible. This means speculatively executing "if" statements and speculatively predicting loops, renaming registers and more.

2. GPUs are the scalable architecture: The 2080 Ti has 4352+ SIMD cores (136 compute units). And NVidia can load 30+ threads per compute unit, so 130560+ conceptual threads (kinda like hyperthreading) can exist on a GPU executing at once.

3. VLIW seems like a dead end. AMD GPUs gave up the VLIW instruction set back in 2008. Instead, the SIMT AMD GCN or NVidia PTX model has been proven to be easier to program, easier to schedule, and easier to scale. If you want high-scaling, you should do SIMD or SIMT, not VLIW. Intel, RISC, ARM, and Power9 have all chosen SIMD for scaling (AVX, ARM-SVE, Power9 Vector Extensions, NVidia PTX, and AMD GCN).

4. I think VLIW might have an opportunity for power-efficient compute. Branch-prediction and Out-of-order execution of modern CPUs relies on Tomasulo's algorithm + speculation, which feels like it wastes energy (in my brain anyway). VLIW would bundle instructions together and require less scheduler / reordering overhead. If a company pushed VLIW as a power-efficient CPU design... I think I'd believe them. But VLIW just seems like it'd be too unscalable compared to SIMD or SIMT.

5. Can we stop talking about RISC vs CISC? Today's debate is CPU (latency-optimized) vs GPU (bandwidth-optimized). The most important point is both latency-optimized and bandwidth-optimized machines are important. EDIT: Deciding whether or not a particular algorithm (or program) is better in latency-optimized computers vs bandwidth-optimized computers is the real question.


All good points, but:

> VLIW seems like a dead end.

Since you mentioned this in the same breath as GPUs, I feel I have to point out that according to some reverse engineering paper, Nvidia's Turing is a VLIW architecture. (I'm talking about the actual hardware here, not PTX.)

Presumably they have some reason for that that's unrelated to increasing IPC, since AFAIK their GPUs aren't superscalar.


> Since you mentioned this in the same breath as GPUs, I feel I have to point out that according to some reverse engineering paper, Nvidia's Turing is a VLIW architecture. (I'm talking about the actual hardware here, not PTX.)

NVidia Volta / Turing has been reverse engineered here: https://arxiv.org/abs/1804.06826 . EDIT: I'm talking about the actual hardware, the SASS assembly, not PTX.

It doesn't look like a VLIW architecture to me. Page 14 for the specific instruction-set details. I realize this is mostly a matter of opinion, but... those control-codes are very different from the VLIW that was implemented in the Itanium.

> On Volta, a single 128-bit word contains one instruction together with thecontrol information associated to that instruction.

So NVidia has a 128-bit instruction (16-bytes) with a LOT of control information involved. The control information encodes read/write barriers, but there is still one-instruction per... instruction.

The "core" of VLIW was to encode more than one instruction-per-bundle. Itanium would encode maybe 3-instructions per bundle for example.

What NVidia is doing here is having the compiler figure out a bunch of read/write/dependency barriers so that the GPU won't have to figure it out on its own (I presume this increases power-efficiency). The only thing similar to VLIW is that NVidia has a "complicated assembler" which needs to figure out this information and encode it into the instruction stream. Otherwise, it is clearly NOT a VLIW architecture.

> Presumably they have some reason for that that's unrelated to increasing IPC, since AFAIK their GPUs aren't superscalar.

NVidia Turing can execute floating-point instructions simultaneously with integer-instructions. So they are now superscalar. The theory is that floating-point units will be busy doing the typical GPU math. Integer-instructions are primarily used for branching / looping constructs (and not much else in graphics-heavy code), so they are usually independent operations.


> VLIW seems like a dead end.

VLIW has been amazingly successful in DSP. All the world's baseband processors use VLIW machines. Many of the accelerated video decoders are also VLIW machines under the covers.


While I recognize that DSPs are power-efficiency champs, I'm not aware of any DSP (or VLIW architecture) being used in top supercomputing lists, like TOP500 or GREEN500.

Its all CPU-based, GPU-based, or that weird "PEZY" from Japan (which is just... weird). Who knows what the future holds... but SIMD is currently making a big change to the supercomputing landscape (both in CPU-design with AVX512 and GPU-design with NVidia).


I'm pretty sure that's because they are hard to program for.

Although automatic compilers and instruction schedulers for VLIW machines exist, they aren't nearly as good as hand-tuning. VLIW machines have a niche where a fairly small amount of code is both very arithmetic-intensive, predictable, and executed in a power-sensitive environment.

In HPC and most business workloads, you just can't dedicate the labor required to making your code work well on them.


> I think VLIW might have an opportunity for power-efficient compute.

Well, I think if such advantage is real, it leads to taking over the servers, HPC, and obviously, mobile.

I imagine some hybrid approach will wind in the end. I do expect transparent graph processors to come from VLIW with real time optimization done by software, OoO style. This combo feels unbeatable, but all the real problems hide in the details anyway.


From my understanding, Mobile chips have VLIW DSP chips in them still. So it seems like VLIW designs they have some advantage in minimizing energy usage.

Qualcomm Hexagon (aka: Snapdragon 855) is still used for a lot of mobile applications, and its definitely a VLIW architecture. http://pages.cs.wisc.edu/~danav/pubs/qcom/hexagon_hotchips20...

But VLIW isn't popular anywhere else by my understanding.


> (which, incidentally, also makes MILL immune to Spectre because it doesn’t need to speculate)

Mill still provides speculation, but it is exposed as part of the user-visible architecture. Before Spectre was publicized, the Mill team proposed using speculation in a way that would make Mill systems vulnerable to Spectre:

https://news.ycombinator.com/item?id=16125519

There is an obvious fix for this (avoid feeding speculated values into address calculations), but they didn't say how much it costs in terms of performance.


Is Mill happening? Has there been silicon yet? I have been reading about it for so long, but have not seen any peer reviewed work on it, silicon or soft cores.

I would like to see more exotic architectures out there, but I think I speak for many when I say that I am starting to question if the Mill architecture is going to land in a major way.


If it happens or not, the Mill team still produced some of the most interesting talk videos on youtube that you can watch. I would highly recommend anyone who hasn't watched the series to do so, it's very in depth and interesting.

Personally I hope that Mill happens. If it will beat existing CPUs left and right or not can then finally be answered.



Latest I saw on the forum was that they are writing a microkernel os so that they can run benchmarks and more importantly tests.


He talks about 'the laws of physics' meaning that RISC can't scale, neglects prefetching completely and then talks about the vaporware mill CPU as being some sort of solution because it does 'deferred loads that take into account memory latency'


That sounds a bit dismissive.

Deferred loads do have a massive advantage; your compiler knows best when loads are needed, so it can, for example, run the load for the next array value while still processing the current one, allowing the CPU to not stall at all at highest efficiency (because the compiler can know instruction latencies and counts out when to start the load).

Prefetching is an optimization of caching, it still doesn't beat the speed of signal in a silicon CPU, nor does it solve the scalability issues mentioned in the article.


> compiler knows best when loads are needed

History has shown repitedly that in fact compilers do not know best outside some very specific use cases. Whether a load need to be issued early and how early is a dynamic property of a program and it is hard to compute statically.

Prefetching and deferred loads have existed for a long time but they didn't help, for eg. Itanium.


How is loading a value from memory when an appropriate opcode comes in a dynamic property? Mill seems to be doing fine with their compiler and since Itanium compilers have gotten a lot smarter.


Whether a load is fulfilled from L1, L2, L3 or main memory is a dynamic property of the program and very hard to predict statically.

Aliasing is also a dynamic property, but read ahead hardware should be able to handle it (but failure might stall the whole pipeline).

Saying that Mill doing is fine might be tiny bit optimistic seening that the whole thing is still vaporware. Do they have a working compiler now?


Atleast judging from the latest posts on their forums, they have a working C compiler and are working on a C++ compiler to get a basic kernel for test harnesses running.

I don't see why it matters to a deferred load being issued if the data is in L1, L2 or L3. The compiler emits a load, the caching hardware handles the rest, once the load arrives the pipeline unstalls.

The compiler can atleast assume a L1 cache latency and then put some cycles between the load and execution. The earlier you can issue the load the better, if too early you might ruin it if you get preempted. Either way, it should not be hard to solve in VLIW.


You have your facts almost completely backwards. Compilers don't know best and any company who has banked on that for general purpose CPUs has been burned hard. Your explanation of what a deferred load is, is actually more of a description of prefetching.

The compiler also does not necessarily know instruction latencies since they change from one CPU to the next.

Out of order execution already does the job of deferred loads. Loads can be executed as soon as they are seen and other instructions can be run later when their needed memory has made it to the CPU. This is why Haswell already had a 192 instruction out of order buffer. OoO execution also schedules instructions to be run on the multiple execution units and ends up doing what compilers were supposed to do with VLIW CPUs.

> "Prefetching is an optimization of caching, it still doesn't beat the speed of signal in a silicon CPU, nor does it solve the scalability issues mentioned in the article."

None of this is true. Prefetching looks at access patterns and pulls down sections of memory before they are accessed. Caching is about saving what has already been accessed. I'm not sure what you mean by 'beating the speed of signal' but if you are talking about latency, that is exactly what it deals with. By the time memory is needed it is already over to the CPU. The article talks about issues that are due to memory latency (which much of modern CPUs features deal with on way or another) and prefetching directly confronts this. Instruction access that happens linearly can be prefetched.


There's SW prefetch instructions and HW prefetching engines. SW prefetch instructions have largely been a bust. Linus is famous for ranting about them.

https://yarchive.net/comp/linux/software_prefetching.html

HW looks at access patterns (as you say) and does at least as good a job.


Yep, I've tried to use prefetch instrinsics multiple times and I've never been able to beat the CPU and speed things up.


>The compiler also does not necessarily know instruction latencies since they change from one CPU to the next.

The instruction latencies are actually all documented and modern compiler do take care when generating assembly code to optimize the output a bit to not stall the pipeline too often.

I don't see how compilers, which have greatly improved since Itanium, shouldn't be able to tell when a load is best issued and be able to keep the pipelines full.

>None of this is true. Prefetching looks at access patterns and pulls down sections of memory before they are accessed.

That pretty much sounds like an optimization to caching for me. Granted, it's not precisely a caching algorithm but it's more of a cache predictor, but cache in the end.

Compiler know best how to order instructions to ensure that loads happen when best needed and sufficiently beforehand that the data arrives in time without having to stall any part of the pipeline. The CPU solves it at runtime, why can't the compiler do it at compiletime when all the same information is available?


The bandwidth of memory bus is still a limiting factor.


I find latency to be the much bigger problem. For the types of workloads where bandwidth is limiting you tend to find they run much better on GPUs.


I don't think memory bandwidth is the limiting factor for a modern CPU. If instruction fetching was not only consuming most bandwidth but also dominated it, then there would be little benefit to SIMD if you could just do multiple instructions instead as the bandwidth is taken up anyways (notably in multicore systems).

That plus to my knowledge, there is no single core performance difference at same clock between a 4-core and 12-core CPU in AMD CPUs, which would be the case if instruction fetching is maxing out memory bandwidth.


Reminder: all current, performant CISC CPUs just internally (in microcode) compile instructions down to smaller uOPs, which are then executed by a RISC-like core. CISC chips are just RISC chips with fancier interfaces.


CISC CPUs also take multiple instructions and combine them into larger uOPs through macro-op fusion.

But whether or not the CPU's uOPs are RISC are not isn't really relevant here. The article is talking about pressure on things like L1. The argument is that CISC becomes almost a form of compression. If the CPU internally splits it into multiple uOPs that's fine - you still got the L1 savings, and those uOPs can potentially be more specialized. The CPU doesn't need to look ahead to see if the intermediate calculation is kept or anything like that.

As in, it's overall more efficient to take a macro op and split it than take micro ops directly.


It used to be an architectural distinction but now it's just marketing, if even that much.

When PowerPC started adding complex vector instructions (AltiVec) and the Pentium turned x86 into a RISC core with translation layer around it you knew the distinction was pretty much dead.


I don't agree with the proposition that vector instructions (SIMD) are inherently non-RISC. RISC is about whether the "I" is a reduced instruction set, not whether or not "D" is multiple.


This is nothing new and unnessesarly competative.

RISC-V is designed to make pipelining very efficent but there has always been a limit. RISC-V just helps you get close to that limit with limited complexity.

Beyond that, part of RISC-V will be the 'V' standard extension that will give you access to a advanced vector engine that is a improvment on many of the ways we do SIMD now.


The irony here is most modern CISC design are breaking instructions to RISC-like μOps. Moore's law also means - you have more transistors for the same area, now figure out how to use them creatively to increase performance. Workloads are constantly evolving and hardware evolves with it to make those workloads fast.


> The irony here is most modern CISC design are breaking instructions to RISC-like μOps.

... as well as combining two (or even more?) instructions into one μOp.

https://en.wikichip.org/wiki/macro-operation_fusion


CISC since around the turn of the millennium is basically a custom tuned high decode speed data compression codec for RISC-like micro-ops. It's been a very long time since anyone designed a CISC processor that actually ran (non-trivial) CISC instructions directly in silicon.

The root of CISC's persistent dominance over true RISC instruction sets is that memory bandwidth is far lower what would be needed to feed micro-ops directly into the CPU. It makes sense to solve that by compressing the instruction stream. RISC looks far better on paper in every other way if you ignore memory bandwidth and latency issues.

That being said, I've wondered for many years about whether a more conscious realization of this might lead to a more interesting design. Maybe instead of CISC we could have CRISC, Compressed Reduced Instruction Set Computer? Instead of CISC you'd have some kind of compression codec that defines macros dynamically. I'm sure X64 and ARM64+cruft are nowhere near optimal compression codecs for the underlying micro-op stream. If someone wants to steal that idea and run with it, be my guest.


The other advantage of the CISC is that it acts like a higher-level API. Many early RISC designs suffered because they were so low-level that early implementation details (like wait states) had to be "emulated" in later processors for compatibility.

It might not be advantageous to just compress a RISC stream of instructions instead of higher level instructions made up of micro-ops for that reason alone.


Dynamically swapping compression like that probably isn't worth it, as now the decode tables are extra state that needs to be compared against, all while inside a critical path.

But most modern RISCs do take a sort of Huffman encoding perspective on ISA design, starting with SH, into Thumb(2), and into RVn-C. I do agree that there's farther we can probably go; stuff like memory referencing ALU ops can be thought of as a way of addressing PRF registers without using any bits in the instruction stream for instance.


U mean like thumb2?


Although the micro-operations are dispatched to many parallel execution units, so it's really better described as VLIW.


No, that's just superscalar. The defining feature of VLIW is that the the compiler schedules the dispatch.


The ports/schedulers have quite different capabilities however:

https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

Using the Intel Vtune tools you can see how each port is utilized, so you could in theory change your code to mix instructions for best utilization beyond what reordering the CPU can do itself, so I can see some analogy with building a VLIW instruction group.

There's a crazy amount of performance counters you can look at (the perf tool can do that too, but just try running "perf list" to view available counters).


Note that the post seems to be discussing about the old-school RISC reasoning, more than what we consider "RISC" in the modern world. Now RISC is just a style of ISA (and/or more informally some description of some internal aspects of pipelines of CPUs (regardless of the ISA), behind the decoder and register renaming - and even then that's very far from all aspects)). And btw just because an instruction is called “Floating-point Javascript Convert to Signed fixed-point" with "Javascript" in its name does not disqualify it for appearing in a "RISC" ISA (or even an old-school RISC uarch). At all.

The (proven) modern high perf microarchitecture for generalists CPUs is pretty much the same for everybody nowadays (well some details vary of course, but the big picture is the same for everybody), so a RISC ISA is not necessarily very interesting anymore, but also not necessarily a big problem. However if we take things that still can have in impact on the internals, I would prefer RMW atomic instructions to LL/SC most of the time (arguably LL/SC is the "RISC" way to go). Hell I would even sometimes want posted atomics...

So back to the topic: if RISC is to be interpreted as microcode like/near programming, yeah the RISC approach failed for state of the art high perf in the long run, and it even was a very long time ago that it did -- but was it even thought by anybody that it would win? Not so sure. It was more a nice design point for another era, that only lasted a few years. Anyway the term has become overloaded a lot, and has produced crucial results, even if indirectly. And I agree a lot with the opinion that Skylake/Zen-like uarch is the way to go, and even VLIW is dubious (or even already known as a failure if we take Itanium as the main example) -- I don't even think the Web or anything can save it, to be honest. But conflating CISC with the presence of an instruction useful for Javascript is nowhere near the right interpretation of the term "RISC", regardless which is chosen. I mean at a point you absolutely can have a wide variety of execution units, without that disqualifying your from having the main "RISC" aspects.


Surprised with FPGAs that there isn't a wave of new assembly languages. Or that processors in memory haven't taken off.

Parallel prefix, vector addition, vector transpose, 32x32 dense matrix multiply ...


The basic reason is price, which appears to be down to commercial decisions, rather than manufacturing cost.

FPGAs which are large and fast enough to make interesting CPUs are readily available, but they are far too expensive, compared with buying an equivalent CPU core. For perspective, the FPGAs you can rent on Amazon are listed at around USD $20,000 per individual FPGA chip!

Much cheaper and smaller FPGAs exist, but they aren't generally cost-effective and performance-compatitive against an equivalent design using a CPU, even with the advantages provided by custom assembly languages and hardware extensions.

There are times when it's worth it. People do implement custom assembly languages on FPGAs quite often, for custom applications that benefit from it. I've done it, people I work with have done it, but each time in quite specialised applications.

(CPU manufacturers use arrays of FPGAs to simulate their CPU designs too.)

Processors in memory is another thing entirely. These are actively being worked on.

GPUs using HBM exist, where the HBM RAM is a stack of silicon dies laid directly on top of the GPU die, with large numbers of vertical interconnections. These behave similarly to processors-in-memory, because there are a lot of processing units (in a GPU), and a lot of bandwith to reach the memory from all of the units.

Some studies show diminishing returns from increasing the memory bandwidth further, with the present GPU cores and techniques, so it's not entirely clear that intermingling the CPU cores with the RAM cores on the same die would bring much improvement.

There is a physical cost to mingling CPU and bulk RAM on the same die, which is that optimal silicon elements are different for CPUs than for bulk RAM, so manufacturing would be either more expensive, or make compromise elements.


> HBM RAM is a stack of silicon dies laid directly on top of the GPU die

nitpick: actually they're off to the side, connected via the interposer:

https://images.idgesg.net/images/article/2019/02/amd-vega-ra...


Thanks!


About 1 ns per foot latency at speed of light. Having the CPU only six inches from the RAM is 1 ns round trip latency. Only way to lower that cost is physically moving the CPU closer.


I was under the impression that CISC processors have utiliized instruction pipelining for a long time already.

Here's what Wikipedia has to say about it:

The terms CISC and RISC have become less meaningful with the continued evolution of both CISC and RISC designs and implementations. The first highly (or tightly) pipelined x86 implementations, the 486 designs from Intel, AMD, Cyrix, and IBM, supported every instruction that their predecessors did, but achieved maximum efficiency only on a fairly simple x86 subset that was only a little more than a typical RISC instruction set (i.e. without typical RISC load-store limitations). The Intel P5 Pentium generation was a superscalar version of these principles. However, modern x86 processors also (typically) decode and split instructions into dynamic sequences of internally buffered micro-operations, which not only helps execute a larger subset of instructions in a pipelined (overlapping) fashion, but also facilitates more advanced extraction of parallelism out of the code stream, for even higher performance.[1]

And Intel has an article on how instruction pipelining is done which covers CISC designs as well.[2]

1. https://en.wikipedia.org/wiki/Complex_instruction_set_comput...

2. https://techdecoded.intel.io/resources/understanding-the-ins...


When RISC first came out nobody knew how to do pipelining with a CISC ISA. But a half decade and many engineering-years of effort later Intel was able to bring out a pipelined x86 processor. But it probably would have taken even longer with a CISCier ISA than x86 which lacks indirect loads and stores.


> The problem is that a modern CPU is so fast that just accessing the L1 cache takes anywhere from 3-5 cycles.

Access to RAM on a MOS 6502 is about 2-6 cycles, isn't it? The total amount of RAM in a 6502 system is usually roughly the size of a modern L1 cache as well. It would be interesting if you could just program with an L1 exposed and structured as a main memory rather than a cache ...


> It would be interesting if you could just program with an L1 exposed and structured as a main memory rather than a cache ...

During the boot of the computer, you can. At least you could a few years ago, I'm not sure about brand new processors. Anyway, the boot was (and maybe still is) done in multiple phases and at first the RAM is not available. So at first the code only uses registers. But this is extremely limiting (I'm not sure there are even compilers capable to emit such code, so you have to write that shit in assembly language). So one of the first thing to do, even before you have the memory controller programmed and calibrated, is that you flip a bunch of model specific registers, run a tiny loop to touch a range of addresses, and update the MSRs again to prevent the cache lines from being evicted, and BOOM! You go into the Cache-As-Ram mode. Later you will initialize the memory controller, and leave Cache-As-Ram. I'm not sure if you can hard-limit to L1 though (well obviously you can if you restrict to a small enough range). Probably L1+L2 is used most of the time? Not really sure...

Maybe you can take a look at coreboot and actually program the application you want in there instead of under a traditional OS. Seems fun :P


> Access to RAM on a MOS 6502 is about 2-6 cycles, isn't it?

One cycle. Of course, the cycles are a lot slower.

> It would be interesting if you could just program with an L1 exposed and structured as a main memory rather than a cache ...

The Cell processor, used in the PlayStation 3, provided something like that. What happened in practice was that programmers trying to get code working soon enough to be useful, just ended up leaving most of the theoretical performance on the table, which is why the next generation of consoles scrapped the idea in favor of a conventional x64 CPU.


> Access to RAM on a MOS 6502 is about 2-6 cycles, isn't it?

The 6502 accesses the memory bus every cycle, although for some cycles of some instructions, the value read isn't used; as a result, RAM and ROM in a 6502 system needs to be fast relative to the CPU.

Perhaps you're thinking about the instruction latencies, which are 2-6 cycles, depending on the instruction and the addressing mode and what not. (BRK (0x00) aka software interrupt is 7, though)


With the embarrassment of riches in silicon real estate, I'm surprised there aren't SoC chips with a hundred or more megs of on-chip RAM. If we have room for 32 cores, then there have to be applications for 8 cores and RAM on-chip.

My old 486 from the early 90's HDD almost fits in modern L3 cache.


That RAM costs both in terms of space and power/heat. Especially high-speed cache RAM. That would make the SoCs huge and expensive for no particularly good reasons.

Also large caches need a lot of logic to ensure coherency between multiple cores - with a small cache the probability of conflict isn't that big, so you can afford to keep it simple. With a huge cache such conflicts between cores would be pretty much assured and you would have to dedicate a lot of silicon just for managing access.


Zen 2 has 16MB L3 per CCX, so e.g. 70MB total on the 12-core. Not a hundred, but quite a lot. It's Huge.


Just like the saying "never bet against JavaScript" has been proven true, "never bet against CISC" is also true. Just when you think it's become way too complicated, expensive, and inelegant, it keeps chugging along.


Might the solution be to give the programmer "manual" access to all levels of memory, so that we can choose which memory to use when, from what core. For the same reason that the OS should not decide which core runs what thread; you cannot progress unless you give people the power to improve things.

In extremis this also means the mono-kernel has to go, but then we're talking big a change.

Anyway I don't think the improvements are going to come from hardware or software only, we need to improve both simultaneously together, which is complex and often requires one (or atleast very few) person(s) to do the job.


The value of RISC these days is that the ops are sufficiently small that computer-assisted humans can write not a JVM but an x68VM-code interpreter distributing execution across a number of functional units simultaneously

Even that is super hard given that the leading producer of hardware interpreters for this VM shipped hardware with the SPECTRE vulnerability.


...RISC-V supports SIMD though.


Thats not the point of the article. I doesn't say that RISC-V doesn't support SIMD, it argue that in the real world, we prefer big instruction like SIMD that do a lot of thing "at once", which is not what RISC-V as been designed for even though it support some.


RISC-V from the beginning was designed to work perfectly fine with vector processing. The lab that developed RISC-V made RISC-V to have an architecture that would allow them to use different vector units and experiment with different vector architectures. The Berkley Parallel Comouting Lab is where the worked on RISC-V.

The RISC-V V extention is the result of this work and exploration into parallel architecture.

RISC-V was to be modular and the V extention is just as much part of RISC-V as the F extention.


I honestly don't have any opinion about it as i am not knowledgeable about this thing. I was just re-stating what the article is saying, and it didn't say that RISC-V doesn't have SIMD.


I think SIMD is actually quite compatible with the RISC philosophy, since SIMD is just a straightforward extension of normal arithmetic primitives.


What about SGI Onyx

https://m.slashdot.org/story/12659

Seems that mips was scalable back then

There was a story headline about a 16 core RISC V just the other day


That's not the scalability the article is talking about. It's about instructions per cycle, and the fact that this gets pretty firmly capped by the bandwidth and latencies of the cache hierarchy, so even with ~6 parallel execution units it's pretty rare to be able to fill more than 2 of them in a cycle. And this is true, but it has absolutely nothing to do with instruction architecture.

The rest of the article is talking about how VLIW "solves" this problem, which is does not. Given an existing parallelizable problem, a VLIW architecture can encode the operations a little more efficiently, and the decode hardware required to interpret the instructions can be a lot simpler. But that's as far as it goes. If you have a classic CPU which can decode 4 instructions in a cycle, then it can decode 4 instructions in a cycle and VLIW isn't going to improve that except by reducing die area.

VLIW also forces compilers to make a choice between crazy specificity to specific hardware architectures, or an insanely complicated ISA with a batching scheme a-la Itanium. This is largely why it failed, not that "compilers weren't smart enough".

There's nothing wrong with RISC. Or classic CISC. Even VLIW isn't bad, really. The sad truth is that ISAs just don't matter. They're a tiny bit of die area and a few pipeline stages in much larger machines, and instruction decode is largely a solved problem.

Programmers just like to yell about ISA because that's what software touches, and we understand that part.


Amen.



And POWER is also pretty scaleable, else I guess that they would use something else for supercomputers.


from the article -

People still call ARM a “RISC” architecture despite ARMv8.3-A adding a FJCVTZS instruction, which is “Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero”.

The fundamental issue that CPU architects run into is that the speed of light isn’t getting any faster. Even getting an electrical signal from one end of a CPU to the other now takes more than one cycle.


No idea why you've been modded down. Yeah, ARM arch is not that RISC anymore, there are a tons of instructions. Some pretty specific like that one. I don't see much of point of polarizing CISC vs RISC anymore in the first place. Both are borrowing from each other and gradually blending together.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: