
To reinvent the processor - Veedrac
https://medium.com/@veedrac/to-reinvent-the-processor-671139a4a034
======
timerol
This talks about two major performance concerns in CPUs: control flow and
memory access. The transition between the two was my favorite part of the
article. It's both funny and accurate.

"If the previous section gave the impression that, say, control flow is a
particularly tough problem, but inventive methods can give powerful and
practical tools to tackle it… good, that was the intent. In this section I
want you to take that expectation and dash it against the rocks — memory is
hard."

~~~
hinkley
In the 90’s there was a group experimenting with putting small processors
directly on the memory chips and doing data processing there.

~~~
flohofwoe
This is also the idea behind the Pixel Planes rendering architecture from the
80's, they called it "logic-enhanced memory chips". It's not a "universal CPU"
on the memory chips, but each chip can "process" and store 128 pixels, where
the processing is implemented with a hardwired linear expression.

Here's a video on Pixel Planes 4 from 1987:

[https://www.youtube.com/watch?v=7mzpZ861wEw](https://www.youtube.com/watch?v=7mzpZ861wEw)

~~~
leereeves
Sounds rather similar to GPU architecture.

~~~
thechao
Pixel Planes is the ancestor to GPUs with the identical relationship of the
titans to the gods in Greek mythology.

------
Veedrac
I am reading the comments, and am open to critique or general suggestions. “I
found this part confusing” is also very welcome.

This is quite an information dense work, and a little rough around the edges,
though I am pretty happy with it overall.

~~~
h2odragon
I almost understand some of these words :) Truly this is fascinating and makes
me want to learn about the many things i do not grasp yet. Thank you.

> Register hazards are an unavoidable consequence from using a limited number
> of names to refer to an unbounded to number of source-level values

yet I seem to recall seeing compiler writers lament the paucity of registers
too. Perhaps "registers" itself as a concept is something we should trade in
for named memory spaces or something.

gimme instruction level access to all of what is now cache. (chuck all the
cache handler logic btw). let me page in and out sections of that to main
memory or the inter cpu or io busses, but otherwise its all storage and the
real difference is latency.

if we want to burn hardware there could be special graded coherency memory
areas or transactional or content addressable.

~~~
deepnotderp
> Perhaps "registers" itself as a concept is something we should trade in for
> named memory spaces or something

Please no, however hard you think register renaming is, memory disambiguation
is harder.

~~~
h2odragon
Hey I'll argue for self modifying code if you let me :). Don't let it be
ambiguous. I may be trying to express "don't add abstractions" let the
hardware limits show, in ignorance of what the hardware limits are.

------
raphlinus
I like the branch-predict / branch-verify split. It reminds me strongly of the
ldrex/strex strategy in ARM for splitting (for example) a compare-exchange in
two parts, so that in the non-contended case it basically just flows straight
through.

Having recently done a lot of GPU programming, I'm wondering what your
thoughts are on a GPU-like approach, with tons of in-order cores. It's hard to
program, but maybe we need to figure that out in order to get performance.

~~~
api
I lean strongly toward your second paragraph. If we can make many core
programming easier I think we can dispense with a ton of complexity and just
scale out cores with transistor counts.

I think an under researched area is the application of deep learning to
compilers. What could be done there for parallelism?

We are still kind of stuck in programming models developed back when core
frequencies were scaling geometrically. We hit that wall in the early 2000's
and still haven't quite assimilated it.

We are tackling it at the macro scale of multiple discrete systems via modern
devops but the systems (e.g. Kubernetes) are baroque, clunky, and hard to
manage.

~~~
jcranmer
> I think an under researched area is the application of deep learning to
> compilers. What could be done there for parallelism?

I don't think it's under-researched, because we already know that it's only
ever going to do a lousy job. The problem of autoparallelization is mostly
stymied by difficulty in computing the legality of transformations, and
secondarily by the problem of tuning parameters [1]. We've realized that it's
easier to just get the users to tell us about these things, sidestepping the
problem for the most part.

> We are still kind of stuck in programming models developed back when core
> frequencies were scaling geometrically.

The programming model for parallelism is dataflow. We already know that: the
tricky thing of parallelism is communication, so annotating all of the edges
of communication makes the scheduling problem much, much easier.

[1] Theoretically, machine learning is good at tuning parameters. In practice,
you get good results with the dumbest algorithms _already_ , and the available
headroom for performance is generally swamped by the fact that small changes
such as function size can dramatically alter the performance characteristics
of code elsewhere in the application.

------
gumby
The mill folks seem nice (at least the ones I've met) but after 14 years with
no silicon and nobody poaching their ideas for other designs, is there
actually a there there?

Why is there no SiFive for Mill?

~~~
ansible
SiFive work with the RISC-V architecture, which is completely open,
intentionally so.

The Mill folks have been careful to patent everything interesting before
publicly talking about it.

~~~
gumby
patent-driven own goal.

~~~
ansible
Well, they want to make money. The design is revolutionary enough that if it
does make it into production, there's a decent chance of success.

------
Ericson2314
I fear the general difficulty of this means nothing will happen...until
something really different happens. Let's get our programs in really high
level languages in the meantime that will compile to wildly different
architectures.

I've compiled Haskell to CPUs and Haskell to FPGAs, but idioms don't overlap
except for the most foundational libraries (Functor, Applicative, Monad).
That's still great, but that's nowhere yet near "let me compile my CRUD app to
my FPGA." More FRP for the CPU could change that, though. FPGAs can do self-
rewriting circuits since they are programmable.

------
bogomipz
The author states:

>"A Skylake CPU has 348 physical registers per core, split over two different
register files."

Just looking at some literature on recent Intel I can only see a fraction of
this 348 number. I see the following:

16 general purpose registers 6 segment registers 1 flags register 8 x87
registers 16 SSE registers

Could someone explain where this 348 registers figure in the post comes from
exactly?

~~~
pkaye
Those are the architectural registers. Physical registers also include those
extra ones needed to implement register renaming to eliminate data
dependencies.
[https://en.wikipedia.org/wiki/Register_renaming#Architectura...](https://en.wikipedia.org/wiki/Register_renaming#Architectural_versus_physical_registers)

~~~
bogomipz
Thanks, right I was confusing ISA level registers with those at the
implementation/micro arch level. Cheers.

------
ezconnect
I think the future of CPU will be FPGAs for one software. Hardware will be so
cheap that your desktop or whatever computer you have will have lots of FPGAs
or logics to run a certain amount of software. You can also allocate some of
it for general computing so you can run legacy software.

~~~
zackmorris
I have a computer engineering degree and what you are saying about using FPGAs
for general purpose computing was correct when I graduated 20 years ago. It
was 10-100 times more correct a decade after that, and is another 10-100 times
more correct today. So somewhere between 100 and 10,000 times more correct now
than in the 90s. And between 1000 and 1,000,000 times more correct in 2029!

But we're still using the same 3 GHz (effectively single threaded) chips today
as then. Sure, RAM frequency is higher, but so is latency. Computers today are
closer to 10 times faster than computers of the 90s, not 10,000 times faster
like they should be. Except for video cards, which really are 100 or 1000
times faster, because they break with the single-threaded model.

FPGAs would work great for general purpose computing, but we're still missing
a proper language for programming them. VHDL/Verilog is more like assembly
language than the functional language we would need to do it effectively.

I haven't really kept up on this because it's been too depressing, but here
are some promising approaches:

[https://github.com/dgrnbrg/piplin](https://github.com/dgrnbrg/piplin)

[https://catherineh.github.io/programming/2016/12/26/haskell-...](https://catherineh.github.io/programming/2016/12/26/haskell-
on-a-xilinx-fpga)

[https://www.eetimes.com/document.asp?doc_id=%201329857&page_...](https://www.eetimes.com/document.asp?doc_id=%201329857&page_number=6)

[https://news.ycombinator.com/item?id=14546535](https://news.ycombinator.com/item?id=14546535)

So we'd need:

* An HDL wrapper that guarantees that any hardware description downloaded to an FPGA won't short it out. [might already exist, but needs proven examples/unit tests of pathological edge cases]

* A Lisp to HDL compiler, preferably with optimization. [time-consuming, but not difficult]

* A high-level functional or vector or stream language (Clojure/Elixir/MATLAB/Erlang/Go) to Lisp transpiler. [somewhere between trivial and straightforward, might already exist]

* An example implementation of a CPU written in a functional/vector/stream language, probably MIPS [not difficult, just time consuming to convert an existing spec to code]

It would be good to keep the transistor count under 1 million (say 100,000
gates), and possibly start with an 8, 16 or 32 bit implementation and emulate
64 bit and higher operations in microcode. Most CPU gates are wasted on out-
of-order execution and cache, which aren't as important in parallel and
stream-based/DSP computing:

[https://en.wikipedia.org/wiki/Transistor_count](https://en.wikipedia.org/wiki/Transistor_count)

When I graduated, I saw a brand new chip where 75% of the die was for cache,
although I can't remember which model number it was. Chips today are probably
worse, since their speed to transistor count ratio has mostly gone down.

It looks like FPGAs might have stopped reporting gate count since they've
doubled down on the proprietary route. A Xilinx Virtex-7 2000T FPGA from 2011
had 6.8 billion transistors which could implement roughly 20 million ASIC
gates:

[https://www.eetimes.com/document.asp?doc_id=1316816](https://www.eetimes.com/document.asp?doc_id=1316816)

So 8 years later with Moore's law that should be 5 doublings (32 times), or
roughly 200 billion transistors and 640 million ASIC gates on a 2019 FPGA. The
article says it takes about 340 transistors to form a gate on an FPGA, which
seems high to me. But conservatively, 640 million gates would allow us to put
6,400 of our 100,000 gate cores on one FPGA.

Maybe someone ambitious can figure out how much RAM could be allocated per
core in a 2D grid, based on the total core count. I'd like at least 1 MB per
core, but that might conservatively require something like 10 million gates,
which might limit us to 64 cores in RAM alone:

[https://www.quora.com/How-many-transistors-flip-flops-and-
ga...](https://www.quora.com/How-many-transistors-flip-flops-and-gates-are-
required-to-store-one-byte-of-data-in-a-memory-chip-online)

It’s unfortunate that there is so much hand waving around transistor, gate and
logic cell count. But it looks like we could put something like 64, 640 or
6,400 cores on a single FPGA today depending on how much RAM we allocate per
core, and if we had the right software. And this would be a true parallel
computer, probably running between 100 MHz and 1 GHz. So we could play around
with things like copy-on-write (COW), content-addressable memory and
embarrassingly parallel computation in all the areas that today's computers
are weak at (things like ray tracing, voxels, genetic programming, neural nets
and so on) and at least see how classic (UNIX-style) programming compares with
the bare-hands/burdensome languages like CUDA and OpenCL.

If an FPGA can do all that, then surely it could devote some of its gates to
the kind of reprogrammable logic you mentioned (perhaps for video transcoding,
bitcoin mining, emulating video game consoles, things of that nature).

~~~
erichocean
> _Computers today are closer to 10 times faster than computers of the 90s,
> not 10,000 times faster like they should be._

This isn't even close to true for many classes of problems.

For 3D rendering software running on CPUs, this is observably not true: todays
machines are computing 10,000x faster than what was possible in 1994, when
_Toy Story_ was rendered. Not to mention massively larger data sets are being
rendered now thanks to storage, memory, and general I/O improvements.

In fact, _path tracing_ (now the dominant 3D rendering method, including at
Pixar) was considered elegant but impractical in the 90s because it required
way too many CPU cycles and way too much memory (the whole scene, roughly,
needs to be in memory at all times). It took a 1000x (now 10,000x) increase in
computer performance across all dimension to make path tracing viable.

Does any of that sound like just a roughly 10x improvement in 25 years?

~~~
zackmorris
I was excluding video cards, and comparing a 1999 computer (say a 300 MHz 1
core Intel Pentium II with 100 MHz RAM) to a 2019 computer (say a 3 GHz 8 core
Intel I9 with 1 GHz RAM). There is some wiggle room there on clock and bus
speeds and width, but all within a similar order of magnitude:

[https://www.cpubenchmark.net/singleThread.html](https://www.cpubenchmark.net/singleThread.html)

This chart doesn’t go as low as we need, but we can extrapolate the Pentium II
to a score of about 50 and the I9 to 2500 ( although if we include the I9’s 8
cores, we get a score more like 20,000
[https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i9-9900K...](https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i9-9900K+%40+3.60GHz&id=3334)
). So we’re looking at about 50x faster single-threaded performance between
1999 and 2019.

That means that since 2019 clocks are 10x faster than 1999 clocks, there has
been only a 5x performance increase per cycle (most of this probably due to
longer pipelines, larger cache and wider busses which are evolutionary rather
than revolutionary technologies).

Yes, there have been major price decreases. But only minor increases in
performance. Computers have exponentially more transistors, but only linearly
better processing power and bus speed. Had performance kept up with the 100x
increase in transistor count per decade (10,000x every 20 years), The I9 with
8 cores should have scored about 50 _10,000_ 8 = 4,000,000. But it only scored
20,000, so it’s only reaching about half a percent of its potential
performance.

We don’t hear any of this talked about much because it’s an inconvenient
truth. But from my perspective, Moore’s law ended around 2000. If it was up to
me, I would scrap the rat race towards incrementally faster single-threaded
performance. One divided by half a percent indicates that chips today should
have roughly 8*50 = 400 cores to adequately make use of their potential
processing power. For a hardwired (non-FPGA) chipset, I’d reach higher than
that and target 1024 cores running at 3 GHz, with as much as 16 GB of on-chip
RAM running at 1 GHz (16 MB per core so they can be programmed with local
copies of their own operating system), for a chip with roughly the same
transistor count as today, and costing under $1000.

My training was for designing a MIPS-style chip running at roughly 100 MHz. I
don’t see any mystery in designing the scaled-up chip I’m suggesting. In fact,
it would be far simpler and cheaper to design because we’d only have to do it
once for perhaps a 4-6 stage pipeline with little or no branch prediction or
cache logic and then repeat that in a 2D mesh. A single core would run up to
5x slower than an I9’s, but we’d have over 1000 more of them to do whatever we
wanted with.

~~~
erichocean
In case I wasn't clear, the kind of 3D rendering I'm describing _doesn 't use
graphics cards_. It's all on the CPU.

As for ideas to make a faster CPU, sure, knock yourself out. But there's no
question that computers today are much closer to 10,000x faster after 25 years
of development on actual, real-world code being run to earn billions of
dollars. Your approximately 10x faster claim after 25 year is just wrong, at
least with the software systems I'm familiar with.

------
ravenouswolves6
Also worth a mention, the Swarm architecture:
[https://people.csail.mit.edu/sanchez/papers/2015.swarm.micro...](https://people.csail.mit.edu/sanchez/papers/2015.swarm.micro.pdf)

------
Animats
Link leads to a login page.

