
The Reduceron, a CPU designed for functional programs - omaranto
http://www.cs.york.ac.uk/fp/reduceron/
======
fragsworth
From the abstracts and conclusions, I'll summarize a few main points that I
got from this:

1\. The functional CPU is implemented in an FPGA (a type of programmable chip)
at 96Mhz

2\. It performed about 25% as fast as a Core 2 Duo clocked at 3Ghz, which,
while slow, is an order of magnitude better than procedural implementations on
FPGAs

3\. It took advantage of significant parallelism on the circuit level which is
not available to modern processors

It seems that if some effort went into perfecting these chips as we do with
procedural chips, we could see vast performance increases. Maybe they could be
implemented as an additional unit on the computer to take advantage of
functional programs.

~~~
DanielBMarkham
Thanks for the recap.

Seems to me like matching up FPGAs with FP is a killer application for FP. You
should even scale that out with multiple "cores" representing multiple
programs running on the same machine. I'm just guessing, but seems like you
should see an order-of-magnitude increase on similarly-specified chips. Plus
power requirements should dramatically decrease, the O/S should simplify, etc.

Hope somebody picks this up and runs with it.

------
mbell
The x86 architecture is a modified Harvard architecture where close to the CPU
(L1 cache) memory is divided into 'instructions' and 'data', further from the
CPU the memory is joined. L2, L3 and RAM are generally 'unified' or can
contain either 'instructions' or 'data'.

This thesis proposes that separating memory to 'instructions', 'stack', and
'heap' results in a performance increase for functional languages.

Additionally this is targeted at software that makes a large number of
function calls such that the time expense of function calls with current
architectures is higher than the actually computational time expense of the
code.

Personal opinion: Maybe this is faster at the inner most levels of memory if
the code makes lots of function calls. At the outer most levels of memory many
things rely on the ability of 'memory' to be treated as either 'data' or
'instructions'. JIT compilation for instance. This would imply that to run
general code there needs to be a separation process similar to what occurs
between the L2 and L1 caches in current processors. I'm not sure this
ultimately would result in a performance increase for general purpose
processors.

~~~
Symmetry
Presumably if they were doing anything as complicated as a modern CPU they
would have a unified last level of memory and some mechanism for guaranteeing
synchronization.

Honestly, even C execution could probably be speeded up by having separate
stack and heap memory pipelines. To do that efficiently on an OoO machine
you'd probably need three sets of instructions for memory accesses to the
stack - two sets for where you know ahead of time which part of memory you're
dealing with and one for where you don't. The first two would be there to help
out the scheduler.

Thinking about the C ABI a bit more and how you can have a function call with
a pointer to who-knows-what place in memory maybe this isn't actually such a
great idea in practice for C, but it should be great for languages which can
provide more guarantees.

~~~
thesz
>Honestly, even C execution could probably be speeded up by having separate
stack and heap memory pipelines.

Later versions of Alpha AXP architecture did almost exactly that. They reorder
memory access based on addresses. Alpha could execute load after store without
blocking, providing load address did not clash with store address.

It helped them a lot, given that they had 80 registers at their disposal.

~~~
Symmetry
Any modern x86 processor can do that too. The thing is, the complexity of the
circuitry involved increases faster than linearly so being able to break it
out into two sets of load/store queues would let you increase bandwidth by a
fair bit.

------
sparky
The authors touch on the hardware benefits you get from doing away with
pointers, but I think you could go much, much further. Yes, you know which
things are heap-allocated and which are stack-allocated, so you have some
language-level disambiguation that way, but within a functional operation like
map, reduce, fold, etc., you have perfect knowledge about which data is
touched in what way, so you could potentially do something like use a
compiler-managed scratchpad for it instead of a cache. The compiler would have
perfect timing, bank conflict, etc. information, so it should be able to an
awesome job of scheduling and use much less energy than a hardware-managed
cache. Will be interesting to see how much of this kind of thing is done as we
come up with ways to compile high-level languages directly down to
performance- or efficiency-oriented hardware (not through C). Most modern
architectures are designed to do a good job when the compiler has very little
idea what the code is doing; you can save a lot of area and power if you know
more.

------
Jach
Awesome! And hey, the dissertation references John Backus' Turing Award
Lecture. :) That's the only reference my 3-week project shares though. (I was
trying to build a formal functional programming machine on an FPGA that shared
a lot of similarities with Lisp.) Here's a link for Backus:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.2622&rep=rep1&type=pdf)

------
patrickgzill
I quickly skimmed the paper but did not see any references to the Symbolics
LISP machine CPUs - which as I understand it had programmable or
reconfigurable microcode, which supported the LISP machine features. Even
device drivers could be written in Lisp.

~~~
ramchip
AFAIK, this assumes a purely functional language, namely Haskell. I'm not sure
about the variant of Lisp used in the LISP Machines, but it's probably rather
different (eg. uses assignment, etc).

