From the abstracts and conclusions, I'll summarize a few main points that I got from this:
1. The functional CPU is implemented in an FPGA (a type of programmable chip) at 96Mhz
2. It performed about 25% as fast as a Core 2 Duo clocked at 3Ghz, which, while slow, is an order of magnitude better than procedural implementations on FPGAs
3. It took advantage of significant parallelism on the circuit level which is not available to modern processors
It seems that if some effort went into perfecting these chips as we do with procedural chips, we could see vast performance increases. Maybe they could be implemented as an additional unit on the computer to take advantage of functional programs.
Seems to me like matching up FPGAs with FP is a killer application for FP. You should even scale that out with multiple "cores" representing multiple programs running on the same machine. I'm just guessing, but seems like you should see an order-of-magnitude increase on similarly-specified chips. Plus power requirements should dramatically decrease, the O/S should simplify, etc.
The x86 architecture is a modified Harvard architecture where close to the CPU (L1 cache) memory is divided into 'instructions' and 'data', further from the CPU the memory is joined. L2, L3 and RAM are generally 'unified' or can contain either 'instructions' or 'data'.
This thesis proposes that separating memory to 'instructions', 'stack', and 'heap' results in a performance increase for functional languages.
Additionally this is targeted at software that makes a large number of function calls such that the time expense of function calls with current architectures is higher than the actually computational time expense of the code.
Personal opinion: Maybe this is faster at the inner most levels of memory if the code makes lots of function calls. At the outer most levels of memory many things rely on the ability of 'memory' to be treated as either 'data' or 'instructions'. JIT compilation for instance. This would imply that to run general code there needs to be a separation process similar to what occurs between the L2 and L1 caches in current processors. I'm not sure this ultimately would result in a performance increase for general purpose processors.
This is an extension on another paper from a few years ago where they described the split architecture; this paper describes some optimizations for the hardware and the compiler they use with it. The new optimizations increase performance by a factor of about 6 compared to their earlier work.
Additionally, I don't think that this processor executes code quite linearly. Its hardware can detect and break down functional code and run it in a parallel manner; they make full use of their multiport split memory to do something like eight times as much work/cycle as a (heavily pipelined!) Core 2 Duo. I admit that it probably won't work on iterative code, but there's enough functional code floating around that this could see some use as a coprocessor.
Presumably if they were doing anything as complicated as a modern CPU they would have a unified last level of memory and some mechanism for guaranteeing synchronization.
Honestly, even C execution could probably be speeded up by having separate stack and heap memory pipelines. To do that efficiently on an OoO machine you'd probably need three sets of instructions for memory accesses to the stack - two sets for where you know ahead of time which part of memory you're dealing with and one for where you don't. The first two would be there to help out the scheduler.
Thinking about the C ABI a bit more and how you can have a function call with a pointer to who-knows-what place in memory maybe this isn't actually such a great idea in practice for C, but it should be great for languages which can provide more guarantees.
>Honestly, even C execution could probably be speeded up by having separate stack and heap memory pipelines.
Later versions of Alpha AXP architecture did almost exactly that. They reorder memory access based on addresses. Alpha could execute load after store without blocking, providing load address did not clash with store address.
It helped them a lot, given that they had 80 registers at their disposal.
Any modern x86 processor can do that too. The thing is, the complexity of the circuitry involved increases faster than linearly so being able to break it out into two sets of load/store queues would let you increase bandwidth by a fair bit.
The authors touch on the hardware benefits you get from doing away with pointers, but I think you could go much, much further. Yes, you know which things are heap-allocated and which are stack-allocated, so you have some language-level disambiguation that way, but within a functional operation like map, reduce, fold, etc., you have perfect knowledge about which data is touched in what way, so you could potentially do something like use a compiler-managed scratchpad for it instead of a cache. The compiler would have perfect timing, bank conflict, etc. information, so it should be able to an awesome job of scheduling and use much less energy than a hardware-managed cache. Will be interesting to see how much of this kind of thing is done as we come up with ways to compile high-level languages directly down to performance- or efficiency-oriented hardware (not through C). Most modern architectures are designed to do a good job when the compiler has very little idea what the code is doing; you can save a lot of area and power if you know more.
Awesome! And hey, the dissertation references John Backus' Turing Award Lecture. :) That's the only reference my 3-week project shares though. (I was trying to build a formal functional programming machine on an FPGA that shared a lot of similarities with Lisp.) Here's a link for Backus: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72....
I quickly skimmed the paper but did not see any references to the Symbolics LISP machine CPUs - which as I understand it had programmable or reconfigurable microcode, which supported the LISP machine features. Even device drivers could be written in Lisp.
AFAIK, this assumes a purely functional language, namely Haskell. I'm not sure about the variant of Lisp used in the LISP Machines, but it's probably rather different (eg. uses assignment, etc).
1. The functional CPU is implemented in an FPGA (a type of programmable chip) at 96Mhz
2. It performed about 25% as fast as a Core 2 Duo clocked at 3Ghz, which, while slow, is an order of magnitude better than procedural implementations on FPGAs
3. It took advantage of significant parallelism on the circuit level which is not available to modern processors
It seems that if some effort went into perfecting these chips as we do with procedural chips, we could see vast performance increases. Maybe they could be implemented as an additional unit on the computer to take advantage of functional programs.