1. The functional CPU is implemented in an FPGA (a type of programmable chip) at 96Mhz
2. It performed about 25% as fast as a Core 2 Duo clocked at 3Ghz, which, while slow, is an order of magnitude better than procedural implementations on FPGAs
3. It took advantage of significant parallelism on the circuit level which is not available to modern processors
It seems that if some effort went into perfecting these chips as we do with procedural chips, we could see vast performance increases. Maybe they could be implemented as an additional unit on the computer to take advantage of functional programs.
Seems to me like matching up FPGAs with FP is a killer application for FP. You should even scale that out with multiple "cores" representing multiple programs running on the same machine. I'm just guessing, but seems like you should see an order-of-magnitude increase on similarly-specified chips. Plus power requirements should dramatically decrease, the O/S should simplify, etc.
Hope somebody picks this up and runs with it.
This thesis proposes that separating memory to 'instructions', 'stack', and 'heap' results in a performance increase for functional languages.
Additionally this is targeted at software that makes a large number of function calls such that the time expense of function calls with current architectures is higher than the actually computational time expense of the code.
Personal opinion: Maybe this is faster at the inner most levels of memory if the code makes lots of function calls. At the outer most levels of memory many things rely on the ability of 'memory' to be treated as either 'data' or 'instructions'. JIT compilation for instance. This would imply that to run general code there needs to be a separation process similar to what occurs between the L2 and L1 caches in current processors. I'm not sure this ultimately would result in a performance increase for general purpose processors.
Additionally, I don't think that this processor executes code quite linearly. Its hardware can detect and break down functional code and run it in a parallel manner; they make full use of their multiport split memory to do something like eight times as much work/cycle as a (heavily pipelined!) Core 2 Duo. I admit that it probably won't work on iterative code, but there's enough functional code floating around that this could see some use as a coprocessor.
Honestly, even C execution could probably be speeded up by having separate stack and heap memory pipelines. To do that efficiently on an OoO machine you'd probably need three sets of instructions for memory accesses to the stack - two sets for where you know ahead of time which part of memory you're dealing with and one for where you don't. The first two would be there to help out the scheduler.
Thinking about the C ABI a bit more and how you can have a function call with a pointer to who-knows-what place in memory maybe this isn't actually such a great idea in practice for C, but it should be great for languages which can provide more guarantees.
Later versions of Alpha AXP architecture did almost exactly that. They reorder memory access based on addresses. Alpha could execute load after store without blocking, providing load address did not clash with store address.
It helped them a lot, given that they had 80 registers at their disposal.