
The “high-level CPU” challenge (2008) - pw
http://yosefk.com/blog/the-high-level-cpu-challenge.html
======
userbinator
_Suppose you want to support strings and have a string comparison instruction.
You might think that "it's done in the hardware", so it's blindingly fast. It
isn't, because the hardware still has to access memory, one word per cycle. A
superscalar/VLIW assembly loop would run just as quickly; the only thing you'd
save is a few bytes for instruction encoding._

Actually if you've actually looked at things like SSE memory
copying/comparison routines, it's not "a few bytes", it's more like a factor
of 100x+. REP MOVSB is 2 bytes; a SSE memcpy, highly unrolled - incidentally,
also a bad idea for modern CPUs - is easily a few hundred. I've seen ones over
a kilobyte(!) The former is essentially the same speed as the latter, but
makes for significantly less instruction cache utilisation, which is
increasingly important today.

 _You see, RISC happened for a reason._

When memory bandwidth was not the bottleneck, it was a good idea. Now, not so
much. Even ARMs which are considered "RISC" are growing into uop-decoding
decoupled OoO machines like x86, and adding more instructions with each new
generation. I like to mention this article, where the "purest" RISC, MIPS,
turns out to be the least power-efficient:

[http://www.extremetech.com/extreme/188396-the-final-isa-
show...](http://www.extremetech.com/extreme/188396-the-final-isa-showdown-is-
arm-x86-or-mips-intrinsically-more-power-efficient)

I'd say that CPUs are certainly getting more high-level and CISCy, but not in
the same way as their original proponents imagined. If you don't think so, try
beating POPCNT, CRC32, AESENC, etc. with a sequence of simpler instructions...

~~~
renox
I find strange your point about ARM vs MIPS given that ARMv8 seems much more
'MIPSy' than previous ARM ISAs..

~~~
sklogic
Why? It's very intentionally _not MIPSy_. ARMv8 removes anything that may
dictate any microarchitecture decisions, while MIPS exposes it in an ISA
(delay slots and all that). Even predication had been killed for this reason
(it's not good for OoO anyway).

~~~
_yosefk
Yeah, but

* mips didnt have predication to begin with

* arm didnt have delay slots to begin with

* and then afaik 64b arm uses more registers and removes shifter operands

So definitely more mipsy.

~~~
sklogic
If "simpler" = mipsy then yes. If "mipsy" = "closer to hardware" then
definitely not.

~~~
_yosefk
MIPSy = closer to MIPS, the actual ISA.

------
_chris_
2008? So... anything happen since? I assume nobody took him up on his offer!

"Alan Kay: [...] "Just as an aside [...] a benchmark from 1979 at Xerox PARC
runs only 50 times faster today. Moore’s law has given us somewhere between
40,000 and 60,000 times improvement in that time. So there’s approximately a
factor of 1,000 in efficiency that has been lost by bad CPU architectures.""

Is he complaining that machines today aren't a bunch of micro-coded, multi-
chip computers like in the good old days, or does he have a better proposal to
the modern multi-core, superscalar OoO processors coupled to GPGPU multi-
threaded engines?

Because complaining about Moore's Law feels a bit meaningless if DRAM is still
100ns away.

~~~
_yosefk
In 2015 I find the claim that you could beat Intel by 1000x with a better arch
much more idiotic than I did in 2008. Because apart from it being idiotic from
a technical standpoint I'm now dumbfounded by the extent of its idiocy from
the business angle. If you're so smart, what's stopping you from making a
trillion dollars off it?

You need to be an Alan Kay to utter such bullshit repeatedly and still have
crowds of worshippers taking each of your utterances as gospel while chanting
that you singlehandedly invented computing as we know it.

~~~
sklogic
It's trivial to beat a generic CPU by 1000x with a specialised ASIC or even an
FPGA in many cases (think cryptography, for example).

~~~
adwn
Make that _a few cases_ for FPGAs. Modern desktop CPUs are incredibly fast and
run circles around modern FPGAs, even for many specialized, repetitive,
parallel, integer-computation-bound algorithms.

Of course, there are a couple of applications where FPGAs are actually faster
than CPUs.

Source: I'm an FPGA developer with a software engineering background.

~~~
kragen
That's fascinating and startling! Why is that? Is it that the FPGA's cost per
bit operation is just so many times larger than the CPU's that the FPGA can't
compete? Where do GPUs fit in?

~~~
adwn
Several factors:

1) FPGAs are much slower due to the overhead of the flexibility they provide.
With a lot of effort, you can get some FPGA designs (not even all of them!) to
maybe 400 MHz on the more expensive FPGAs, and then it will do much less per
cycle compared to an ASIC (CPUs are ASICs), so you need deep pipelines, with
all the cost, complexity, and restrictions this entails.

2) CPUs have a lot of memory on-chip in the form of cache (several MiB), and
they have sophisticated prefetch logic. In FPGAs, memory comes in several 100
blocks of, say, 4 kiB (depending on FPGA model). All memory management is
purely manual.

3) CPUs tend to have a much higher external memory bandwidth.

4) Complex operations that are not directly provided by the FPGA fabric, like
floating-point arithmetic or large memories with more than two ports, are
expensive in terms of area and performance.

5) Modern desktop CPUs are incredibly well optimized: high clock frequency,
out-of-order execution, superscalarity with several execution units, SIMD,
automatic cache management, branch prediction, ... It's just very hard to beat
that, and this is only possible for certain, very restricted applications.
Highly parallel DSP (Digital Signal Processing) with integer/fixed-point
arithmetic comes to mind.

------
sourceless
[https://www.cs.york.ac.uk/fp/reduceron/](https://www.cs.york.ac.uk/fp/reduceron/)

It's still fairly young technology but there has been work on processors for
HLLs. The one above (if my understanding is correct) is effectively a
combinator processor, and its input language is a subset of Haskell.

(and now I notice it's already linked in the article comments, but I'll keep
it here for those interested.)

~~~
_chris_
Yah, the author does a follow-up blogpost (albeit, only a few days later, and
not years, hehe).

[http://yosefk.com/blog/high-level-cpu-follow-
up.html](http://yosefk.com/blog/high-level-cpu-follow-up.html)

~~~
sourceless
I don't 100% agree with the author that the reduceron is a "Haskell Machine"
\- it seems very much to be the motivation for the project, but I think it's
more of a 'strongly typed functional language machine'.

I'd be inclined to think that the more general machine Yosef wishes for would
come with time and more languages implemented. Heck, maybe restricting the
kind of languages best supported is a necessary feature - you can't say that
von Neumann doesn't best support imperative & procedural languages after all.

------
chetanahuja
A plausible tl;dr of the post could be: "Dreaming up high level instruction
sets is useless because today's CPU's spend most of their time waiting for
memory access (accept for rare applications doing lots of math on a small
amount of data ... e.g. bitcoin mining)

~~~
userbinator
_Dreaming up high level instruction sets is useless because today 's CPU's
spend most of their time waiting for memory access_

...which doesn't make much sense, because the benefit of small but powerful
multicycle instructions is precisely so you can spend the time decoding and
executing them (and less time fetching them) while the CPU is waiting for some
memory access. One factor that many people seem to forget is that instructions
also need to be fetched from memory, and the less bandwidth that takes, the
better.

~~~
chetanahuja
Not defending the OP's thesis but clarifying my tl;dr summary of it: Fetching
instructions is not the main problem, it's the data that the instructions have
to act upon (Hence the exception for the compute heavy, data-light
applications.) And since most of the compute power in the world today is
basically processing and moving massive amounts of data around, I can see how
that thesis would make sense in the general case.

------
sklogic
The future might be in an FPGA fabric tightly coupled with generic RISC cores
(things like Zynq SoCs, but with a tighter integration, allowing a direct
access from FPGA to the CPU internal registers, pipelines, etc.). Bigger macro
cells (like entire ALUs, generic caches, etc.) might be useful too.

This way hot spots can be optimised (even in a JIT engine in runtime) into
hardware circuits.

I played a bit with some crazy ideas like mixing HDLs with a low level C code
(see the Mandelbrot example here:
[https://github.com/combinatorylogic/soc/tree/master/backends...](https://github.com/combinatorylogic/soc/tree/master/backends/small1/sw/long_hdltests)
) - this sort of things can become practical with a tighter integration
between a CPU core and FPGA fabric (in my case it was a soft core, of course).

------
moron4hire
All the EEs I know think 1000 lines of code is "a lot". I've never met any EEs
that used source control. I've never met any that tracked issues beyond
forwarding emails back and forth. Their idea of documentation is separate PDFs
for electrical properties and programming APIs.

So if us software engineers don't know the hardware side well enough to design
hardware, I'm not really sure electrical engineers understand the software
side well enough to design hardware, either.

Not saying there aren't EEs that understand software. Just saying I've never
met any.

EDIT: wait, I know one EE who uses source control. I finally got my wife to
start using Git. She kept looking over my shoulder at home, asking me "what is
_that_?" pointing at my changelog in SourceTree. "Oh, you know, it's just an
annotated record of every change every member of the project has ever made,
including when they did it, in such a way that we can share these changes with
each other without having to have long discussions about which particular
folder on which particular network drive is the 'latest' code. Oh, you think
that would be helpful in your work? Who would have thought?"

~~~
sklogic
Every EE company I worked for used source control (perforce, git or svn). And
a typical Verilog module is far longer than 1000 lines.

~~~
moron4hire
If it weren't a meme, it wouldn't be a meme.

------
Symmetry
Well, the obvious place to start is via using separate memory channels. If you
have separate load/store queues for the stack and heap you can have
substantially higher max performance for the same area/power since the cost of
that structure scales super-linearly. But that means you can't have arrays
with a mix of pointers to heap and stack allocated objects.

That's one of the things the Reduceron (mentioned elsewhere) does.

~~~
_yosefk
I plan to write about it; the thing is, the Reduceron is not a more performant
machine when all is said and done. Memory costs you a lot in area and power
and throwing channels at the problem needlessly and saying youve solved the
von neumann bottleneck is a worse idea than trying to avoid banging at the
memory needlessly; using C is a part of a working strategy to acheive the
latter.

~~~
Symmetry
Breaking up the load/store queue would basically just be a way of increasing
parallelism at the core level for free. You wouldn't be changing the amount of
memory in caches, just increasing the rate at which you can issue loads or
stores and the number of outstanding accesses the core could support. Well,
you might consider a split L1D but there are lots of tradeoffs with that and
you'd have to simulate to figure out if that'd be a good idea.

------
sriku
Aren't GPUs a good enough example of hardware that can be programmed to do
things way faster than what is possible with CPUs of the same era?

Also what about FPGAs? If compiling a specific program down to an FPGA is
possible, what kinds of optimizations can we bring in? (I am indeed waving my
hands here, but we already have had genetic-proframming optimized FPGAs for
over 15 years in research.)

~~~
_yosefk
Isn't it obvious after all these years that gpus or fpgas give hlls nothing??
Nobody actually runs hlls there, even! And everyone had years and years to
try! On these machines in reality hll performance blows much much harder than
on cpus and at what point does it become obvious that nobody doing something
must be due to reasons that can be learned and understood instead of ignored?

I wrote tfa to explain some of those reasons and nobody seems to read it, they
just keep saying what they always do.

Someone is wrong on the Internet!!

~~~
sriku
I didn't state my point clearly. My point was not that GPUs give some new
capability to HLLs, but that GLSL itself is higher level for a restricted
domain than what is afforded by a general purpose processor ... and, owing to
their specialization, GPUs are faster than normal CPUs.

FPGAs blur what it means to "run a HLL", since you can optimize the circuitry
to a point where there is no explicit representation of the "code" or
"microcode" anywhere so the question becomes pointless. Heck even what part of
_physics_ is used is blurred! [1,2]

[1]
[https://en.wikipedia.org/wiki/Evolvable_hardware](https://en.wikipedia.org/wiki/Evolvable_hardware)
(Ref: Adrian Thompson's work)

[2] [http://www.damninteresting.com/on-the-origin-of-
circuits/](http://www.damninteresting.com/on-the-origin-of-circuits/)

------
rthomas6
Surely there is _some_ improvement to be made by automatically parallelizing
map and filter functions, and for side-effect free languages, automatically
parallelizing functions that do not share state? Maybe this is more on the
compiler side than the hardware side, but wouldn't this potentially provide a
significant speedup if you have a lot of cores to take advantage of? A speedup
that you can't get from procedural programming languages?

------
baq
> in fact lots of standard big O complexity analysis assumes a von Neumann
> machine – O(1) random access.

i don't see the logic here.

~~~
lsd5you
Memory lookup can be considered O(log(m)) where m is the amount of memory you
are using. e.g. If the addressable memory space exceeds the word size then it
would take two cycles to do a load from memory.

In practice memory lookup is O(1) but there is hard limit to the amount that
can be used. Except of course this is not true - there are costs to using more
memory - because of the cache hierarchy.

~~~
kragen
If we're _really_ going to talk about memories of over 2⁶⁴ bytes, in the limit
it's O(m^⅓) if you're talking about latency, because larger memories need to
be physically further away, on average.

~~~
repsilat
The newest physics is even more pessimistic, putting it at O(m^½) because of
the holographic principle.

Essentially, the theory goes that the amount of data you can cram into a
spherical volume of space scales in proportion to its surface area, IIRC with
a density on the order of bits per Planck area.

[https://en.wikipedia.org/wiki/Bekenstein_bound](https://en.wikipedia.org/wiki/Bekenstein_bound)

[https://en.wikipedia.org/wiki/Holographic_principle#Limit_on...](https://en.wikipedia.org/wiki/Holographic_principle#Limit_on_information_density)

~~~
kragen
I don't understand the Bekenstein bound well enough to comment very
intelligently, but isn't that a bound on the information _content_ of your CPU
rather than its information _throughput_ or latency?

~~~
repsilat
> a bound on the information _content_ of your CPU

Sure, though I'd have said "RAM". When you have a bound on information
density, you naturally get a bound on worst-case memory latency provided you
obey a fixed speed of light.

~~~
kragen
Oh! Good point!

