Hacker News new | past | comments | ask | show | jobs | submit login

> a modern x86 decoder is smaller than a modern arm decoder

That's because the ARM ISA is not small either by any stretch of the imagination. On the other hand, the instruction listing of the base RISC-V ISA and the standard extensions can fit on a single powerpoint slide.

http://riscv.org/workshop-jun2015/riscv-intro-workshop-june2...

I wasn't involved in any of the recent tape-outs, so I can't say exactly how big the decoder is. But it's quite small relative to the other chip components. Currently, the integer pipeline of the chip is roughly the same size as the FPU, and these two together are roughly the same size as the L1 cache. All of those components together are smaller than the L2 cache (depends on the size of the L2 cache, though). So decoder size doesn't really matter in the grand scheme of things.

Decoder speed probably does matter, though. Currently, we can decode an instruction in a single cycle (1 ns). The x86 decoder, on the other hand, can take multiple cycles depending on instruction. But maybe this isn't a fair comparison since the instructions are decomposed into uops. I have no idea about the performance of ARM decoders.




How can you be multiscalar with a decoder that only does 1 ops/cycle? Intel does 6:

> From the original Core 2 through Haswell/Broadwell, Intel has used a four-wide front-end for fetching instructions. Skylake is the first change to this aspect in roughly a decade, with the ability to now fetch up to six micro-ops per cycle. Intel doesn’t indicate how many execution units are available in Skylake’s back-end, but we know everything from Core 2 through Sandy Bridge had six execution units while Haswell has eight execution ports. We can assume Skylake is now more than eight, and likely the ability to dispatch more micro-ops as well, but Intel didn’t provide any specifics.

http://www.maximumpc.com/idf-2015-san-francisco-skylake-deep...


I think he meant that the decoding latency is 1 cycle, not that per 1 cycle the core can only decode one instruction.

That is, each baby takes 9 cycles to form, but per 9 cycles the population can have more than one baby.


He was talking about latency, you're talking about throughput.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: