Hacker News new | past | comments | ask | show | jobs | submit login

There are plenty of scenarios where synchronization overheads between cores dwarf the performance gain, but OoO execution can help.

But maybe instead of having more cores, we should expose the different execution units within a CPU core to the architectural level? That however brings back memories of Itanium, and the general fact that compilers just can't do static scheduling well enough.




I've started to think Itanium might have been sort of on the right track but ahead of its time and in some ways poorly executed.


I still don't think so. Exposing these microarchitectural concerns to the architectural level limits flexibility. In order for compilers to efficiently schedule multiple execution units, the compiler needs to know the exact latency of all instructions. That may be doable for arithmetic, but varies greatly from one generation of processor to the next. And compilers definitely cannot know the latency of a load: from a few cycles in L1 cache, to a few thousand cycles in DRAM, to millions of cycles if there's a page fault. And these things vary a lot, not just between processor generations but within the same processor generation.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: