Hacker News new | past | comments | ask | show | jobs | submit login

That's a current project for me right now. When I took computer architecture in university, it was fairly light on hard details about things like superscalar and out of order designs. Sure there's the concepts, but there are so many interesting problems it didn't cover.

E.g. how do you, at a hardware level, actually do the reservation stations for a out of order design. How do you actually implement a register file with enough read and write ports to satisfy such a design without taking up a whole die for it?

I know there are a few Linux capable soft-core RISC-V designs out there (VexRisc, etc.) and microcontroller class ones (PicoRV32, etc.). If my goal was implement a system and it needed one of those things, sure, I'd use an off the shelf one. But I really want to understand how the CPUs work, and the best way to do that is doing it myself without looking at the answer key.

Turns out register files are complicated and fascinating. I'd never come across "register file banking" in my architecture courses. Makes what I had to deal with in CUDA make a lot more sense now.




That sounds awesome! You should definitely post a Show HN on it.

I am going to comment on this though: But I really want to understand how the CPUs work, and the best way to do that is doing it myself without looking at the answer key.

I am right there with you this, however with experience I've come to appreciate that there is a lot of complexity in this topic and I personally have a limit on how steep a learning curve I'm willing to climb in my spare time. As a result I've taken to trying to isolate topics to learn around things that are known to work. Here is an example ...

In 2015 I discovered you could get software radios for $25 and there was a bunch of open source software one could use to play with them. I wanted to write my own modulator and de-modulator code but kept running up against too many unknowns so having it be impossible to debug. Was it my code? Was it the radio setup? Was it the signal format?

I didn't start making real progress until I got a real Spectrum Analyzer and Vector Signal Generator. This let me work from a "known good source" and generate signal that I could compare on my Spectrum Analyzer with the signal the VSG generates. THAT let me find bugs in my code and once I understood more of the basics of the DSP that was going on then I could branch into things like front end selectors and polyphase filtering.

So I applaud your effort, and more power to you if you punch through and get this done. It will be huge. But if someone reading this were to think this is the only, or best, way to do something I would encourage them to recognize that one can break these problems apart into smaller, sometimes more manageable chunks.


Another detail I never see covered is the implementation of variable length instruction decoders. Most every book seems to assume a classic RISC design with 32-bit wide instructions, single-issue and non-microcoded. Are there any advanced undergraduate or graduate level books that cover details like this?


Modern Processor Design: Fundamentals of Superscalar Processors is a great book. It uses the P6 uarch (so Pentium Pro through Pentium III) as a case study to contrast against the PowerPC 620 and is a great place to start as it covers a lot of the details you're asking about. That P6 arch is the basis for modern Intel cores after they dropped P4/NetBurst when Dennard scaling hit a wall. Yes there's been updates, but the book is still basically on point.

Real quick overview for some of the archs I know (that happen to all be x86), cache lines coming in from the I$ fill into a shift register. Each byte of the shift register has logic that in parallel reports "if an instruction started here, how long would the instruction be or say IDK". That information is used to select the instruction boundaries, which are then passed to the full instruction decoders (generally in another pipeline stage). After the byte lengths recognized are consumed, new bytes are shifted in, and the process starts over. This separation between length detection and full decode lets you have 16 or whatever length decoders but only three or four full decoders. Additionally the rare and very complex instructions are generally special cased and only decoded by the byte 0 length/instruction decoders. And even then, sometimes even the byte 0 decoder takes a few cycles to fully decode (like in the case of lots of prefix bytes).

I imagine superscalar processors for other CISC archs have very similar decoders, maybe just aligned on halfwords rather than bytes if that's all the arch needs (like for 68k and s/390).


Right, most lecture notes I've seen on the internet only go as far as implementing a simple 32-bit RISC machine where everything happens in one cycle. For variable-length instructions I think you'll need a multicycle state machine: fetch the first instruction byte on the first cycle, then fetch the next instruction byte if needed on the second cycle, etc.

I came across these notes a while ago when trying to implement something similar: https://my.eng.utah.edu/~cs6710/slides/mipsx2.pdf (it's an 8-bit MIPS, so it needs 4 cycles to fetch an instruction).


Note that you can easily 'cheat' on this, especially if you already have the hardware to support unaligned reads. Just load-unaligned a (say) 64-bit 'instruction' and then only increase IP/PC by however many bytes you actually used.


That will carry a pretty heavy performance penalty though.


Yes, that's what makes it cheating, rather than good parsimonious design that intellegently reuses preexisting resources.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: