Is there a reason that machine languages don't allow the programmer/compiler/run...

Timmons · on Aug 10, 2015

There are a number of reasons this is not openly available. Just a few: - It may lock the programmer into a specific CPU - The CPU cache/pipeline is less generic than you might think. Making setting a standard way to change it from within a program problematic. - Multiple programs may be running at the same time. Having the a program dictate cache behaviour would effect all running applications. - Arbitrary changes to behaviour would have performance implications when the change is made. In a multiple application situation that could mean flipflopping behaviour.

Generally though the benefit of the manual control wouldn't be worth the cost/risk/loss of platform support for most applications as CPU behaviour is very good and it would probably be cheaper to just upgrade to a more expensive CPU than to spend the time on developing improvements to the algorithm.

In the rare case where you do need specific behaviour and it is worth while it would be worth speaking to the vendors rather than doing it yourself.

cfallin · on Aug 10, 2015

In addition to what others have said (mainly, it would be specific to a particular version of the CPU, which is probably the biggest thing -- imagine if no x86 binary older than ~1 year could run on your chip), one more point:

The programmer doesn't necessarily have better knowledge than the CPU, because the programmer can't see or respond to runtime behavior. There are a lot of cases where behavior is input-data-dependent, or simply too complicated to reason about analytically a-priori. The big wins in computer architecture over the past two decades have all been in mechanisms that adapt dynamically in some way: dynamic instruction scheduling (out-of-order), branch prediction based on history/context, all sorts of fancy cache eviction/replacement heuristics, etc.

Itanium tried "VLIW + compiler smarts" and the takeaway, I think, was that it was just too hard to build good enough static analysis to beat a conventional out-of-order CPU.

sklogic · on Aug 11, 2015

The problem with the OoO is that it's a way too expensive (in terms of power and area) in many cases. You won't ever see OoO in GPUs and DSPs, unlikely in the microcontrollers. So, VLIW and the other "stupid core, smart compiler" approaches are still legitimate and will always remain valuable.

cfallin · on Aug 11, 2015

Yup, in some domains it definitely still makes sense. GPUs work well for highly-data-parallel applications (they're essentially vector machines, modulo branch divergence) and VLIW-style DSPs work because the code is numerical and easy to schedule at compile time.

I've worked mostly in the "need performance as high as possible for general-purpose code" domain, so I may be biased!

jcranmer · on Aug 10, 2015

Ever hear of this process called MIPS? The design behind MIPS (Machine without Interlocking Processor Stages) was originally based on the notion that every time the hardware needed to introduce a bubble, the CPU would instead continue executing instructions--giving rise to load slots and branch delay slots.

The resulting finds from practice were that these new slots were hard to fill (i.e., mostly nops), and, instead of making hardware simpler, they made hardware much more complex if you ever decide to adjust microarchitectural details.

There is some limited support for poking the cache, though (prefetching, flushes, bypass, load streaming), but all of those techniques tend to be very coarse-grained.

sliverstorm · on Aug 10, 2015

They have been produced, but x86's dominance has always been about fungibility and idiot-proofness while maintaining acceptable performance. It's never been the fastest possible or cheapest possible architecture; it's the architecture whose success has been built upon hiding all the complicated bits behind the curtains, so that software people can worry about software and not hardware.

There actually are some performance upsides to that too. It allows the chip architects to radically change the underpinnings of the chip with impunity, doing whatever works best on the current process node and using the latest advancements from computing research.

PythonicAlpha · on Aug 10, 2015

Another reason might be security: There are already attacks known, that are using the CPU cache and/or instruction pipeline to break the boundaries of virtualization. With today's virtualizable processors, it becomes more and more important, that the users can not access to much of the processors implementation details.

sklogic · on Aug 10, 2015

There are architectures that require manual pipeline control. MIPS, for example, or any VLIW architecture (including many of the modern GPUs). Cache is also often programmable (see scratch memory in Sparc, local memory in GPUs, etc.)