Author here. I think today that apart from making my case in a bit of an obnoxious tone, I also somewhat overstated it: while it's true that many "high-level" constructs do have a cost that will not magically go away due to any logic built into hardware, at least not fully, it is also ought to be true that a lot can be done in hardware to make software's life easier given a particular HLL programming model, and I'm hardly an expert on this. My true interests are in accelerator development so starting at the GPU and further away from the CPU and so lower level and gnarlier than C in terms of programming model.
I will however say that the Reduceron and in general the idea of doing FP in hardware in the most direct way are a terrible waste of resources and I'm pretty sure it loses to a good compiler targeting a von Neumann machine on overall efficiency.
The way to go is not make a hardware interpeter, that is no better than a processor with a for loop instruction added to better support C. The trick is to carefully partition sw and hw responsibilities as in the model to which C+Unix/RISC+MMU converged to.
I'm curious whether you think the ideal boundary between SW/HW might've shifted in the last ~40 years, as the things we use computers for have drastically changed?
I know basically nothing about hardware, but I know the software layer from the OS/compiler up through the UI. There's a fair bit of evidence that things we've traditionally assumed belong in the kernel actually belong in userspace, and they're being reinvented in userspace as a result. For example, most modern languages & frameworks put some form of scheduler in the standard libs - we're reimplementing the abstraction of a thread as promises or fibers or async/await or callbacks. Many big Internet companies disable virtual memory in their production servers, because once the box begins swapping you might as well count it as down. Many common business apps program to a database, not a filesystem, and then the database uses block-based data structures like B-trees and SSTables but then has to implement them on top of filesystems.
At the same time, the classic OS protection boundary is the process, but the unit of code-sharing in the open-source world is the library. As a result, the protection mechanisms that OSes have gotten very good at are largely useless at preventing huge security violations from careless coding in a library dependency.
Most of these came from computers being used outside of the original domains that the system software developers assumed, eg. nobody in the 1970s would've imagined 10 million GitHub users of widely different skill levels all swapping code. Knowing what we do now about the big markets for computation, are there additional operations we'd want to put in hardware, or things currently done in hardware that should be moved to software?
There is a fascinating talk by Cliff Click "A JVM Does That?" At the end he shares some opinion about what should be done by JVM or OS and what should change.
I remember a talk, where he was also talking about hardware. It does not seem to be this one. For example, a time register would be useful. Syscalls like clock_gettime are too slow. CPU info like cycle counts fail with dynamic frequency scaling.
Nit: disabling swapping is not the same this as disabling virtual memory. Virtual memory is just something that allows swapping, but does not require swapping.
> There's a fair bit of evidence that things we've traditionally assumed belong in the kernel actually belong in userspace, and they're being reinvented in userspace as a result.
That's basically the pitch given by unikernels like Mirage: functionality which was traditionally implemented by an OS, like storage, is turned into a library that gets compiled into the application like any other. If an application wants to access a filesystem on the disk it can use an appropriate library; if instead it wants to manage the data being stored on (some section of) the disk directly, it just needs a different library. That way, applications like relational databases can claim their own section of the disk and read/write it directly, to avoid the performance and reliability (e.g. caching/flushing) penalties of going via a filesystem.
I think all of your points are very valid, but they're focusing on one part - virtual memory - which you suggest to remove (and this can be done today by not using that part of the hardware, and in this case the penalty of having that unused hardware is AFAIK fairly small.) My original point was that adding (or changing) hardware to accommodate HLLs is not going to buy you as much performance as people think, and this is focusing on a different part sw/hw boundary (basically what should compilers/interpreters/runtimes be doing vs what should be handled at the ISA level, versus your points which talk about what protection mechanisms we want and who among hardware, software and OS should do what here. I guess I should have said C/RISC and keep Unix/MMU out of it as I did in TFA.)
What will actually happen with protection mechanisms I don't know; certainly Unix-style mechanisms are used to ever more places with say HSA's idea of accelerators and CPUs being aware of the same virtual memory maps. Compatibility is a very strong force here. On the other hand there's a lot of stuff happening with the memory protection disabled as you described. My predictions here are going to be less educated than many others', to be honest, because I deal with embedded systems whereas most of the exciting stuff here happens in servers, I'd guess (but I can tell that in automotive embedded systems of all places not only do Unix-style processes gain traction right now but so do hypervisors with actual multiple OSes, some of them POSIXy, sharing chips. So this is a data point showing a trend in the "more of the same" direction.)
A promising alternate architecture that places some previously features in hardware [1]. The execution model still closely matches current architectures.
See also different approaches to programming that make space-time tradeoffs more explicit [2], and use natural-law like principles to distribute computing across a simpler but highly connected computing fabric.
Alan Kay takes part in discussion here occasionally. He seems like a pretty easygoing guy (but you might want to reign it in a bit) so you could probably just email him...
Alan Kay likes to fail at least 90÷ of the time (shows you aim high enough) and says the industry is too dumb to digest good ideas. I like to succeed at least 90÷ of the time so that the dumb industry keeps employing me. I'm afraid we have irreconcilable differences. (And, this is me reigning in A LOT right here. Don't get me started...)
To be clear, you mean 90%, correct? If so, the ÷ symbol (which I just learned is called an obelus) is typically used for division—I've never seen it used to mean percent. Is this a locale difference? A keyboard issue? I've seen that Android users will sometimes mistype ℅ for % due to their proximity on a certain keyboard, for example.
Can you elaborate a bit on why do you think Reduceron or using FPGA along with CPU is not a good idea? I thought that since the clocks aren't gonna be much higher, that is the future. That maybe compilers will start generating some kind of VHDL that can make the app you spend your most CPU time on much faster (theoretical possibilities seems great with big enough FPGAs).
Speaking as someone who programs FPGAs for a living, they are good for three things (I'm simplifying a bit here):
* interfacing with digital electronics, implementing low-level protocols, and deterministic/real-time control systems
* emulating ASICs for verification
* speeding up a small subset of highly specialized algorithms
Of those, only the last one would apply in the context of this thread. However, caused by their structure and inherent tradeoffs, they are completely incapable of speeding up general purpose computation. As for specialized computation, if they heavily rely on floating-point ops, a GPU will nearly always be faster and cheaper.
I think the new Stratix might well beat GPUs, not?
The Reduceron specifically tries to quickly perform application of lambda expressions that GHC will try to avoid generating in the first place. The Reduceron speeds things up using several memory banks etc. but it still does things that shouldn't be done at all and the overhead is there at least in area and power.
I think FPGAs are too expensive to be used for general purpose computing. If on top of the chip price you add the development time it's just not cost effective. A high-end FPGA will cost you thousands of dollars and you won't be able to easily convert software code to HDL. A very high end GPU will be cheaper and easier to develop for.
There are situations where a FPGA is better suited of course (very low latency real time signal processing for instance) but for general purpose computing FPGAs are not exactly ready for primetime IMO.
> I think the new Stratix might well beat GPUs, not?
Adding to what simias said, even an FPGA with built-in floating-point primitives can beat a GPU (in floating-point-heavy computations when the measure is performance/cost) only if the algorithm doesn't fit well onto the GPU architecture – for example, if you can make use of the highly flexible SRAM banks on the FPGA. I suppose there exist such workloads, but they're rare.
Also, keep in mind that no FPGA comes even close to the raw external memory bandwidth of modern GPUs.
Even #2 is questionable. I had a front row view of a company that made a chip very quickly and one way they were able to do it was to not bother emulating ASIC in FPGA. There are some very nice and open-source hardware development tools that basically obviate that need.
Once you get to the point where you need ASIC emulation, there are no open-source tools that are up to the task.
You don't need emulation for simple stuff like Bitcoin miners or other small and/or highly regular chips. You use it if you develop a large SoC taking tens of millions of dollars to develop. It takes months after finishing your HDL code and before you get the first silicon from the fab, and you don't want to wait that long before you can start testing your custom software.
So, no, #2 isn't questionable, it's routine practice. In fact, the largest FPGAs by Xilinx and Altera are structured explicitly with that use case in mind.
Emulation via FPGA or dedicated HDL emulator (a special supercomputer designed for running Verilog/VHDL, very fast, very good, very expensive) is also essential for functional verification of things like CPUs.
For example booting linux and running a simple application can take many billions of cycles. You simply can't simulate that many you need something faster (You can simulate a few billion overnight with an appropriate server farm, but that's across many tests using many simulator instances).
Unless you're NVIDIA, Intel, AMD, or Qualcomm, Apple, Samsung (you get the idea) why would you ever want to build anything besides "small and/or highly regular chips"?
I think there's a lot of interest in designing "highly regular" chips. And it's definitely possible to go quite far with open source tools. I've seen a 16-core general purpose chip with full 64-bit IEEE FP, ALU, and memory instructions operating at 1 MHz (real speed) as a gate-level simulation on a desktop computer. This could potentially be "running linux" at a reasonable (if sluggish) speed.
> Unless you're NVIDIA, Intel, AMD, or Qualcomm, Apple, Samsung (you get the idea) why would you ever want to build anything besides "small and/or highly regular chips"?
What's your point? The ASIC emulation market exists, there are several companies that build and sell ASIC emulators, and Xilinx and Altera cater to that market with dedicated FPGA devices. I'm not sure why you're arguing here.
I'm just a curious hardware development newbie passing by but would you be willing to share the open-source development tools used? It would be really interesting to take a peek at something that was used to develop a chip very quickly. Most of the hardware stuff seems to be quite complicated and not all that open.
I'm not the original poster, but perhaps he's thinking of https://chisel.eecs.berkeley.edu/ I believe it can (or could?) generate C++ code which compiles into a program that simulates your design.
If your architecture meets these requirements, I'll consider a physical implementation very seriously (because we could use that kind of thing), and if it works out, you'll get a chip
Fabbing someone else's idea sounds expensive. What did you have in mind?
People put experimental digital blocks in ASICs all the time, it isn't necessarily that expensive. Its a bit like taking an extra pair of shoes on holiday - in general, I'm a bit concerned about reaching the airline's baggage weight limit. If you ask me to add your shoes to my bag before I start packing, I'm going to say no. If you ask at the end, and I've got some space left, then fine.
However, he's probably talking about an FPGA implementation. That'd be sufficient to prove the concept. Once you've gone that far, you can normally do some simulations to predict the energy consumption on a real chip.
I was at the time of writing and still am an accelerator architect and I'd gladly use someone's idea in a mass market product (ASIC) if they didn't mind. However, it is also true that working on any real product means that many valid ideas useful in some contexts will not be useful for me, and I guess this is true for many ideas for speeding up higher-level programming models, and perhaps it was misleading of me to fail to point this out. (As I said I don't love the tone of that article, it is unfortunately very effective as my articles written in that tone around 2008 tend to resurface more often than articles written in nicer, more balanced tone and with way more technical details from around say 2012-2013. What is my takeoff wrt future writing I'm still not quite sure.)
What do you mean by "for loop instruction added"? How would that look from a developer's perspective and what could be done in hardware to improve efficiency?
I used that as an example of a bad idea; I don't have details on this bad idea but you could have an instruction looking at an init, bound and increment registers and a constant telling where the loop ends and voila, the processor runs for loops without needing lower-level increment and branch instructions, and it shaves off one instruction (not a cycle, necessarily, but an instruction):
FOR counter_reg, init_val_reg, bound_reg, step_reg, END_OF_LOOP
...
END_OF_LOOP:
I was saying that this obviously not-so-good idea is not much different in spirit from building hardware for quickly creating and applying lambda terms, which is what the Reduceron does. Lowering lambda calculus to simpler operations so that lambda expressions are not represented in a runtime data structure at all much of the time, the way GHC and other compilers approach the problem, is a better idea.
This is OT, but - most of linear algebra dies quickly even in single precision, meaning that your equation system solver produces a solution that doesn't really solve these equations, etc. One exception is neural networks where Google's TPU is just the start and in general GPUs while beating CPUs leave a lot of room for improvement.
Actor-based dynamic language author here. (Doesn't matter which one; I think I speak for all of us.) Thank you for being honest with us; we are not a very performance-oriented group sometimes.
We're generally in favor of things which accelerate message passing between shared-nothing concurrent actors. Hardware mailboxes or transactional memory are nifty. OS-supported message queues are nifty; can those be lowered to hardware in a useful way?
I'm not sure whether today's coherent caches, atomic operations etc. are a poor fit for what you want to do leaving much room for improvement (I'm sure someone familiar with say the Go stack will be able to say more; I can say that for computational parallelism everything is fine with current hw but there 100K tasks would map to dozens of threads, tops and if you want 100K concurrent actors maybe things look differently, at any rate I don't see how the shared-nothing part creates a problem hardware can solve here, I think maybe there are problems in the (lots and lots of concurrent actors) part but I'm not sure.)
Incidentally, IMO shared-nothing is an inherently inefficient model for multiple actors cooperating to perform a single computation, and nothing done in hardware can fully eliminate the cost introduced by the model (and if something can be done is can be done by code analysis transforming the code into a more efficient shared memory model.) This is not to say that there's no value in such a system - far from it, just that it's a poor fit for something things which can only mapped onto it with some overhead that hardware cannot eliminate.
To my knowledge, this is still the only hardware-assisted message passing scheme that is virtualisable (ie compatible with a "real" OS like Linux).
Hardware mailboxes are great, but time-sharing OSs can't deal with finite hardware resources that can't be swapped out easily. Software-based queues die a fiery death thanks to cache coherency - reading something that another core just wrote will block you for hundreds of cycles.
Virtualizable hardware-assisted message passing is awesome. (MIPS for instance had a big fat ISA extension for hardware-assisted message passing and cheap hardware multithreading which Linux couldn't use and they then threw out the window exactly when they introduced hardware virtualization of the entire set of processor resources.)
As to software-based queues dying a fiery death - in what scenarios? As I said in a sister comment, I (think that I) know that things work out in computational parallelism scenarios where many tasks are mapped onto a thread pool, TBB-style, that is, I don't think the hardware overhead is ridiculously large in these systems. Where do things go badly? 100K lightweight threads communicating via channels, Go-style?
Whoa - MIPS virtualised their message passing hardware? How?!
Software-based queues die a fiery death when the latency of a send/receive is critical, because you end up stalling on a really slow cache-coherence operation. So, for example, anything like a cross-thread RPC call takes ages (you wait for the transmission and wait for a response, so it's much slower than a function call and often a system call - the Barrelfish research OS suffers a bunch from this). There are also algorithms you just can't parallelise because you can't split them into large chunks, and if you split them into small chunks the cost of communicating so frequently destroys your performance. (Eg there was a brave attempt to parallelise the inner loop of bzip2 - which resists coarse parallelisation thanks to loop-carried dependencies - this way).
Software based queues perform just fine on throughput, though - if you're asynchronous enough to let a few messages build up in your message queue, you'll only pay the latency penalty once per cache line when you drain it (and with a good prefetcher, even less than that).
The examples you cite are actually both instances of software cunningly working within the limits of slow inter-core communication. Work-queue algorithms typically keep local, per-core queues and only rebalance tasks between cores ("work stealing") infrequently, so as to offset how expensive that operation is. Lightweight threads with blocking messages (like Go or Occam or some microkernels) work by turning most message sends into context switches within one core - when you send a message on a Go channel, you can just jump right into the code that receives it. Again, they can then rebalance infrequently. (For an extra bonus, by making it easy to create 100k "threads", they hope to engage in latency-hiding for individual threads - and once you're in "throughput" mode it's all gravy).
> Whoa - MIPS virtualised their message passing hardware? How?!
No, I meant to say that they simply obsoleted that part of their architecture when they added virtualization, because they couldn't virtualize it.
> Eg there was a brave attempt to parallelise the inner loop of bzip2 - which resists coarse parallelisation thanks to loop-carried dependencies - this way.
So you say you can do hardware-assisted message passing that can be virtualized and can speed up bzip2 by parallelizing? How few instructions per RPC call does it take for you to still be efficient vs today's software-based messaging? (This is getting fairly interesting and it should be particularly interesting to serious CPU vendors.)
This is getting deep in an ageing thread - do you want to take this to email? (It's in my profile)
Pipelined bzip2 wasn't in the evaluation for my research, but I bet remote stores would get considerably better results than software queues. Parallelising one algorithm is something of a stunt, and gets you just a single data point. Instead, I did a bunch of different benchmarks (microbenchmarks for FIFO queues and synchronisation barriers; larger benchmarks including a parallel profiler and a variable-granularity MapReduce to measure how far remote stores could move the break-even point for communication vs computation; and an oddball parallel STM system that I'd previously demonstrated on dedicated (FPGA) hardware). I got around an order of magnitude on all of them (some a little less, some much more).
Looking back, I seriously regret not taking more time to sit down and write it up more clearly, because I do think this should be interesting to serious CPU vendors. However, by that point I had reached the point of "I'm fed up with this PhD; I'm going home now". As I knew I didn't want to stay in academia, I published in a mediocre venue rather than revising for a better one, and went off to Silicon Valley instead. Your comments have made me re-read my old work, and it's painful to wonder how much further it could have gone if I had explained it better.
If you do them in hardware, they always come bounded. At most n elements of size m bytes and both numbers usually single digit. If you want to lift that limitation, it usually is just as slow as doing it in software.
Unbounded queues are arguably not a good idea (although single digit bounds are possibly too low?), at the very least there would probably need to be some concept of back pressure.
Have you tried or heard anyone trying to tinker with caching attributes (ie. uncached, write-combining, etc) for message passing buffers? I think you do need to be in kernel mode to be able to change the attributes, but you can access the memory in userspace once set up.
Sounds to me like you could have some improvoment in shared-nothing message passing by avoiding traffic on the CPU core interconnect due to unnecessary caching.
That said, I only have experience of tinkering with caching attributes in CPU to GPU communication and I'm not very well familiar with the internals of CPU interconnects so take my words with a grain of salt.
There are various levels of caching, which have very different performance characteristics with different access patterns from multiple cores. Don't dismiss it outright. You can reduce the pressure on the cache coherency protocol by making different tradeoffs.
Yes there are, and there were a bunch of interesting research machines in the 90s that played with weird cache modes for shared data. (If you're interested, I can go dig out the literature review section of my thesis for you. There were some weird and wacky schemes, none of which saw industrial deployment).
But the bottom line is that the cache options available in modern desktop/server processors won't really help. They all basically disable the cache in one way or another. (They're really intended for controlling memory-mapped devices.) And while the latency of fetching something out of another core's L1 cache is nasty, going all the way out to DRAM for every access is really, really slow.
So, I'll stand by my summary, albeit less caustically. While it's an interesting idea, you really don't want to do what you suggest.
It seems to me that Intel TSX and general work on improved atomics is what you want. The high level constructs themselves probably shouldn't be directly in hardware.
The history of "higher level" instructions isn't good. The DEC VAX had an assembly language intended to make life easier for assembly programmers, but it slowed the machine down. The Intel IAPX 432 had lots of bells and whistles, but was really slow. The RISC machines with lots of registers turned out not to be all that useful, and too much register saving and restoring was required. RISC is a win until you want to go superscalar and have more than one instruction per clock. Then it's a lose. Stack machines that run some RPN form like Forth or Java code have been built, but don't go superscalar well.
A useful near-term feature would be zero-cost hardware exceptions on integer overflow. This is an error in both Java and Rust, and tends to be turned off at compile time because it has a performance penalty. The problem is that people will want to be able to unwind and recover, which means exact exceptions and a lot of compiler support for them.
If you could figure out how to do zero-cost subscript checking, that would be a step forward. That check needs additional info about bounds, which usually means a delay.
I used to be a fan of schemes for safely calling from one address space to another. i386 almost has this, with call gates, which don't quite do enough to be useful. A few machines have had hardware context switching, but that hasn't been a big performance improvement. All that has to be tightly integrated with the OS or it's a lose. It's an enhancement to Plan 9, not anything anybody uses.
The same is true of fancy schemes for inter-CPU communication, but that probably needs more attention. Like it or not, we have to figure out what to do with large numbers of non-shared-memory CPUs. Some way to set up memory-safe message passing between non-shared-memory CPUs without involving the OS after setup would be useful.
An IOMMU that allows drivers in user space with minimal performance degradation is a good thing. Those exist.
> Stack machines that run some RPN form like Forth or Java code have been built, but don't go superscalar well.
I've been interested in Forth (and related stack) processors for a while, and my armchair observations over a few months have suggested that the (much-vaunted) performance gains associated with such processor designs are apparently not straightforward to map or relate to or take advantage of.
I remember (unfortunately not sure where right now) reading how the GA144 was built at a time when 18-bit memory was the current trending novelty and that it's not really a perfect processor design. I'm still fascinated by it though (sore lack of on-chip memory notwithstanding).
What sort of scale are you referring to when you say "superscalar"? 144 processors? 1000? Do stack-based architectures remain a not-especially-practical-or-competitive novelty, or are they worth pursuing outside of CompSci?
(FWIW, everything else you've written is equally interesting, but slightly over my current experience level)
Pure Forth machines were interesting when the CPU clock and the memory ran at about the same speed, and the number of gates was very limited. Chuck Moore's original Forth machine had a main memory, a data stack memory, and a return stack memory, each of which was cycled on each clock. It took only about 4K gates to make a CPU that way.
Today, the speed ratio between CPU and main memory is several orders of magnitude. The main problem in CPU design today is dealing with that huge mismatch. That's why there are so many levels of cache now, and vast amounts of cleverness and complexity to try to make memory accesses fewer, but wider.
The next step is massive parallelism. GPUs are today's best examples. Dedicated deep learning chips should be available soon, if they're not already. That problem can be implemented as a small inner loop made massively parallel.
Where's the Forth machine getting its operands from?
Sure, if you constrain your programs to use a tiny area of memory you might be able to achieve theoretical speed, but what workloads can you achieve with that?
If you were to write a browser in Forth, presumably it would still have to store all its DOM in DRAM?
You probably want to parallelize the browser for energy efficiency. Then you can distributed the DOM across the scratchpads of hundreds of cores maybe?
So each core has its own JS engine, or when you iterate across the DOM with a selector you have to query across all the nodes? This doesn't sound great.
(The "pile of cores with scratchpads" exists e.g. Tilera and Netronome, and they're a right pain to program for)
Superscalar is not about the number of processors, but a single cores ability to run multiple instructions in parallel. The most well known example of this is SIMD.
SIMD is an example of super scalar design. It is running multiple instructions in parallel, they all just happen to be the same instruction. I could have said MIMD, but that is not a term that is well known.
I wouldn't call SIMD superscalar. The complexity of a superscalar design is being able to track multiple instructions, their dependencies and their out of order completion [1]. Classical SIMD machines run every lane in lock step.
[1] not OoO issue, that would be a proper Out Of Order CPU.
It's a classical example of how much definition matters. Though, if we define superscalar to mean the instructions can run independenly (i.e. not in lockstep), then I agree that a single SIMD unit is not superscalar. But a design with 2 SIMD units that operate on 2 different data steams independently would be be a superscalar design.
I cannot call myself an expert, but I do have some experience in the domain. The classic example of ILP in a superscalar design is:
a = b + c
d = e + f
g = a + d
Where calculation of a and d is executed in parallel.
Wikipedia is not entirely clear either, but the entire page gives the impression that it should send instructions to multiple execution units in parallel, in which SIMD would be a single execution unit.
> Where calculation of a and d is executed in parallel.
Why then not to try to reason from the assembler/machine code standpoint?
The parallel calculation above could be done in different ways:
a) the compiler would emit two "scalar" ADD instructions following one another (allocate registers so independent execution is possible).
b) the compiler would coalesce both additions b+c and e+f into one vector operation (let's assume data layout makes such optimization useful), and emitted only one "vector" (SIMD) instruction.
In the case a) the two scalar instructions would be fetched "sequentially" by the prefetch unit, but executed in parallel, in two separate instances of the adder ==> "super-scalar".
In the other case, the vector operation would not be paired with another instruction which is part of the computation above you mentioned.
EDIT: corrected typo in operands of the coalesced additions.
> RISC is a win until you want to go superscalar and have more than one instruction per clock.
Uh? ARM, SPARC, POWER are doing just fine in the superscalar domain (heck, power8 is 8 wide!). I don't think they have any particular advantage with regard to CISCy x86 (other than a simpler, more scalable, decoder) but don't have any disadvantage either.
> A useful near-term feature would be zero-cost hardware exceptions on integer overflow.
I see this mentioned regularly. Has anybody actually measured that performance penalty? I'd really like to see the benchmarks and the code samples (both the high-level and the machine code). Because at the machine code level it's just checking a single flag after the operation that could have resulted in an overflow, and only if there's an overflow more exception code has to be executed. And nobody should program the execution paths where the exceptions happen much more often than not? If there's a need for an explicit check (like when implementing bignum routines) there should be a real language feature for that, that is, something the language designers should do, not the CPU designers.
The Rust people did, it then became a discussion about sufficiently smart compilers.
I'd imagine from a compiler engineering standpoint, it't pretty hard to make this performance neutral. It makes all the artichmetic loops into non-leaf nodes in the call graph and so has second order effects in optimization passes.
* Expose the data dependency graph directly to the processor, rather than forcing the processor to infer it from the instructions.
* Annotate when data is read-only, to reduce communication between cores (for the sake of avoiding latency, not for bandwidth savings).
* Add a mechanism for much cheaper (if more limited) parallelism where a single core would work on multiple, related thunks in parallel, even if those units of work would be far too small to be worth coordinating with another thread to offload. This would likely be largely implicit, taking advantage of the first feature.
* Instructions for graph traversal. The CPU could (for some uses) order the traversal in a way that improves cache locality, based on how the graph is actually laid out in memory (and prefetch uncached nodes while working on cached ones).
* Something like map & reduce, where you can apply a small (pure) function to a list of data. Again, this would likely be done in parallel.
> What format would you propose? How is it different to instructions?
It would still be instructions, but either by reworking what their operands are and introducing a destination (instruction address + operand port). Another option that could maybe work is a second and mirroring instruction stream that contains only the dataflow/dependencies.
> What overhead would you save? On the cache coherence protocol level?
Current strategy is write-invalidate, I believe. In some heavy-contention situations (e.g. lots of spin-loops on one variable), using a 'write-once' instruction would traffic. Thinking more radically: With an overhaul of the entire virtual-memory system, a write-once instruction becomes potentially very interesting (Jack Dennis had an interesting paper on the subject: http://www.cs.ucy.ac.cy/dfmworkshop/wp-content/uploads/2014/...).
A VLIW doesn't "work on multiple, related thunks in parallel" but instead works on multiple instructions from a stream. What the GP is talking about would be something that can parralelize across function calls or loop iterations and there's now way in a conventional VLIW to do that since each loop iteration contains the same instructions.
1. Eliminate cache coherency protocols (replacing it with cache manipulation/inter-CPU communication instructions)
2. Eliminate virtual memory (replacing it with nothing)
I'm not a CPU designer, but my understanding is that removing features allows for a denser/faster CPU. Well, these are two features that a suitably high-level language has no need for, because a high-level language doesn't expose "memory" to the programmer.
Edit: Though, I 100% agree with what I believe is the core point of the author. We should not implement high-level features in hardware. In fact we should implement as little as possible in hardware, moving as much as possible into software. If Intel would let third parties generate microcode for their CPUs, we could move a lot further in that direction...
Boy, do I have a new processor architecture for you. For the past 3 years my company (REX Computing), has been working on a new architecture that not only covers both of your points, but removes hardware managed caching entirely and replaces it with software managed (physically addressed) scratchpad memory. By removing all of the unnecessary logic (MMU, TLB, everything required for virtual addressing, prefetching, coherency, etc) associated with the onchip memory system, we can fit more memory on the chip itself, have lower latency to that memory, and use a lot less power. As we fully expose the memory system to our software tools, we can make very good decisions at compile time that replicate most of the features you get in having a hardware managed caching system.
We have silicon back, and are now working with early customers in showing a 10 to 25x energy efficiency improvement for high performance workloads.
Definitely watching the video later when I have time. Thoughts from my initial impression:
1. Please put benchmarks up on HN when you can :)
2. Would be awesome if you could get this into the hands of low level kernel developers who can tinker with this and see what Linux is like on it - if it's capable of doing that much? (Need to watch the video!)
3. Conservativism FTW (IMO) when thinking about offers [that come in after you've done (1) :P]
1: Around the 30:40 mark in the video I transition to the live hardware demo, where I learned to never do live demos. While it did not work there in person, I do show pictures from it working earlier in the day along with the results from our 1024-point in place FFT assembly test. It was written and tested Monday and Tuesday of last week (finished the night before the presentation), so it is not fully optimized, but we are getting a very good 25 double precision GFLOPs/watt. As a comparison, Intel and NVIDIA's advertised FP64 numbers are around 8 to 12 GFLOPs/watt while they are on a more advanced process node (They are on 14/16nm, while our test silicon was on 28nm).
We're in the process of getting cross platform benchmarks (HPL, HPCG, FFTW, Coremark, etc) up and running, but I'm hoping we'll be posting results shortly.
2. Since Linux 4.2, there has been a mainline port for STM32 which does not have a MMU, so porting linux is technically possible, though not on our priority list. Something like uCLinux would probably be easier, but not as useful. I have no doubt our cores would be able to handle it, it is just that our current customers already expect their software to be running on bare metal.
I didn't watch the talk in full, so forgive me if you addressed this somewhere in the Q&A: What are the differences between your architecture and the Epiphany chips? As far as I can tell they have approximately the same amount of local scratch pad memory and also a 2D mesh routing infrastructure and general overall design philosophy, one major difference seems to be the Serdes design that you have.
Edit: Ah ok, one major difference seems to be that you have quad-issue VLIW as opposed to RISC and apparently you are 10x faster than Epiphany, that is really impressive.
On your edit: RISC and VLIW are not mutually exclusive. I would say we are a RISC architecture (Load/store based/everything happens on registers, fixed instruction word size, shallow pipeline) that happens to have a much simpler instruction decoder & higher instruction level parallelism since our compiler guarantees that the instruction bundle (the 64 bit Very Long Instruction Word containing 4 "syllables" of instructions itself) will only give instructions that will not conflict with anything it is given with or in the pipeline.
Other than we are faster (20% higher clock plus 2x higher IPC) and more efficient (Epiphany is single precision only, where we are twice as efficient as them, we also support FP64), there are three quick points:
1. We actually have double the local scratchpad of Epiphany per core, and our SPM banking and register port scheme actually enables us to operate all four functional units within a core simultaneously, while also having data go in and out of the core on the Network on Chip. With Epiphany, you are very limited in what instructions can run together primarily due to port conflicts... the biggest difficulty is that you can't do any sort of control instructions with anything else.
2. As far as the Network on Chip goes, we are able to guarantee all latencies through a strict deterministic static priority routing scheme. Epiphany had 3 levels of it's Network on Chip, one for stores, one for read requests (8x slower than stores), and one for off chip communications. We have a (patent pending) way of simplifying all of this greatly while reducing latency and having greater bit efficiency.
3. Off chip memory bandwidth is extremely important to us. Even on our test chip, we have 4x higher bandwidth than the Epiphany IV, plus lower latency... our chip to chip and memory interface also uses the exact same protocol as our NoC, simplifying things even further.
There are a handful of smaller things, though my biggest gripe with Epiphany has always been the lack of bandwidth both on and especially off chip. If you are targeting DSP and similar applications like both Epiphany and we are, you really really need to have the ability to saturate your networks and match compute capabilities with it.
Our current plans are for very simple SIMD modes that reutilize the same hardware to maximize Area and power efficiency. Right now, we can separately load/store the upper half and lower half of a 64 bit register with 32 bits of data, and while we did not have the time to implement it on this silicon, the plan for the future is to have a mode switch to allow the user to use the same set of instructions/hardware/registers to do double, single, or half precision floating point operations.
The other thing we are looking forward to directly test/compare are unums, specifically the new "type 3" ones known as Posits, which are useful (for some definitions of useful) all the way down to 4 bits, and have a greater dynamic range plus greater precision than IEEE floats while using let's bits and theoretically lower area/power on a chip.
I've had a chance to watch the video, and I have to say I'm really impressed (and really irritated so much of the content is over my head :) hardware design is really fascinating!). I think it's pretty amazing I that I got to learn about this new CPU literally a week after it went public!
I have very little exposure to networked chip designs so I'm not sure how widespread they are, but the Neo architecture reminds me somewhat of the GA144 Forth microcontroller, except "done right." For example, the GA144 lets chips communicate their own stacks (= data) to their immediate neighbors, but you can simply access the memory on another node with the Neo. That allows the worst-case scenario of "I need to access this data and I know it'll be a bit slow" - the compiler can simply work to try to organize memory so that happens as little as possible. The GA144 affords no such option; the other cores need to be instructed to send the data over. Cute design, but much harder to use (even ignoring the fact that the GA144 is Forth).
I'm curious if the address space accessible on a given Neo core is "windowed", aka if the address can be remapped. 17:28 says "static routing", but I'm unsure if this affects the memory map. I have the notion that being able to remap would allow for some interesting optimizations, but would add a nontrivial amount of overhead to the chip design.
I think it's awesome you're going to be as open as you can (1:14:40) and release the simulator (1:13:27)!
After thinking about the simulator for a bit, I thought it might be an interesting idea to get it into the education sector for generic "learn RTL" type classes: the hands-down fastest execution/turnaround time makes for a very good pitch and could be an interesting way to get your foot in the door of what I'm presuming is a well-established/entrenched industry, and from there that could be a nice way to springboard into other directions.
Plus, not only does the fastest simulator mean better student engagement, there's also the nice property that everyone who trains on it will collectively groan and weep when they encounter the current "cutting edge" at their first job... best case scenario, you'll wind up with a pile of people who learned on your simulator and want to do real-world stuff at that speed. :)
Idea you've probably thought of but which I'll mention just in case: offering your proprietary optimization stack as an in-house enterprise version for $oh_hey_everyone_can_retire_now, along with a $cheaper subscription compilation-as-a-service version that accepts LLVM bitcode and spits back binaries ready for the "load" command. This would neatly solve many IP issues at once because it's very, very hard to "rip off" an optimizing black box. That said, this makes for some interesting security and trust considerations.
While thinking about all of this I was distinctly reminded of the way POWER8 has propagated. I don't read too much about it on HN, and as Generic Tinkerer #23817 I'm not really too aware of its progress. I think I can safely assume the architecture itself has a laundry list of facepalms; my hope is invested in the fact that it's fairly unencumbered compared to x86, and I hope it goes far. Thing is, it's really hard for me to play with it. I know RunAbove used to have one or two (?), I think either IBM China or a university group in China is offering confusingly restricted public access to some, you can watch the POWER8 nodes in OpenSuSE's build farm. That's about it; there was also the Talos Secure Workstation, but that failed its crowdsourced seed round, which is a huge shame.
My point is that, the simulator is likely going to suffice for most people's interests and purposes, but for the few that want >1MHz (or ±300kHz on the kinds of machines that are likeliest to be widespread in generic education), well, it would be kind of cool for this not to wind up as yet another thing that only hotshots in big industries get to play with, because this thing runs at 1GHz already and looks really interesting.
I envision a possible solution as a simple, fast pipeline that accepts LLVM bitcode (or source code - that works too), compiles it, sends it to a dev board (sitting in a pool of of 3 or 4), and returns the result. Besides queue waits and actual CPU runtime, I can't see the extra steps taking too long, and in ideal circumstances users could probably do several iterations per minute. 4 or 5 cards could very probably fit in a 2U enclosure with 1U more for an x86 blade to run a web server. I have no idea if this sort of thing would be appropriate for this architecture, I'm only thinking about general exposure, educational access, etc - I'd expect that the industry has well-established communications channels and that most of the people that are likely to be the most interested in this architecture either know about it or will know very soon.
Regarding running Linux, I totally get the focus on bare metal DSP-type applications, but I do wonder if there are any potential performance gains to be had from a compiler and kernel (Linux or another) designed to understand software-defined memory management. I wouldn't be surprised if Linux wasn't ideal as a general-purpose OS; besides not really being great at hard realtime, I suspect an "SDMMU" would be difficult to elegantly wedge into the memory management because of architectural assumptions made by all of the code that might thwart the optimizations that could otherwise be afforded.
Finally - after hearing the bit about the CDC 6800 in 1:18:21, I thought I'd mention this just in case (might be boring/irrelevant): a few months ago I stumbled on a (real) PDP-11/70 running RSX-11M-PLUS and TCP/IP, with an open guest account that anybody can use. I mention this in case you're curious to study its I/O characteristics or whatnot (since it's a real machine). You could probably do pretty much any kind of test you wanted on it - I've found that its owners have no issues with rebooting/fixing crashes. Very happy to share details (to anybody!); my email is in my profile, I fear the machine would get trampled if I mention it publicly (on here) :)
(And shoutout to using Nedit. Haven't seen that in a while!)
This is so cool to hear. I've long been holding out hope that we won't be trapped in the current CPU paradigm forever. Thank you for the work you're doing.
Usually replacing software with dedicated hardware tends to speed things up and I don't think it's obvious that using explicit communication would actually speed things up. Most memory operations don't require any form of sharing and having the sharing that's required happen automatically seems efficient.
Getting rid of virtual memory is potentially a big win, especially for architectures where you can't make the L1D cache virtually indexed but physically tagged. And in general there are a lot of special cases you don't even have to think about if different memory addresses can't alias to the same memory. You do lose out on a lot of software tricks there, though.
> Most memory operations don't require any form of sharing
Maybe that was the point? Most of the time it isn't needed, but you are still paying for the logic to detect when it does, and you are paying for false sharing when the automagic gets it wrong.
#2 saves a little but not much and precludes unsafe low-level code completely. #1 I think impairs many HLLs, certainly every multithreaded imperative shared memory ones (and IMO nothing is close to these in terms of efficiency on multicore); which languages can work well without cache coherence? (I worked in C++ on multicore with no hw coherence btw. Quite the cruel and unusual punishment.)
>#2 saves a little but not much and precludes unsafe low-level code completely.
Sure, in the same way most people today don't run their entire application in ring 0, but instead run on top of a kernel. Privilege levels like that are another thing that high level languages don't need. A compiler or JIT can generate low-level code just fine.
>certainly every multithreaded imperative shared memory ones (and IMO nothing is close to these in terms of efficiency on multicore)
You're begging the question here. Why is shared memory multithreading the fastest way available? It's because cache coherence is heavily optimized in hardware. But, of course, cache coherence is built on shared-nothing message-passing under the hood. So why not just use message-passing directly? Plenty of languages where that's the native primitive.
> cache coherence is built on shared-nothing message-passing under the hood
Only sometimes, and only for nearer caches. Data still often gets shared by both sources reading and writing a shared memory location (whether a farther-out cache or main memory or disk).
It is not obvious to me that a compiler or JITter can generate low level code that is (A) free of bugs (compilers are notoriously buggy, as in hundreds of real bugs found by a single fuzzer like CSmith) and I don't want those bugs to hose my filesystem as it could on DOS machines, and (B) always as efficient as lower-level code. MS for instance did interesting experiments with removing virtual memory and I'm not saying it's necessarily a bad idea, just that it saves you very little and the real-world risks are higher than the theoretical risks.
Shared memory multithreading is fastest because explicitly communicating all of your tasks' inputs and outputs without any caching will transfer more data than using caches where several tasks can read the same thing from the same place. Cache coherence is not only built on shared nothing-message-passing under the hood, it's built on caches. If you cache your messages/accesses/whatever somewhere, you'll have coherence problems, if you don't, you'll have efficiency problems.
Just because you have a language with a native primitive doesn't mean this language doesn't throw efficiency out the window by building things around this primitive. Say, plenty of languages have cons cells for all the wrong reasons (I think Clojure is the one language that escaped this in its lineage) and still cons sells are fundamentally inefficient no matter what you do in hardware, this part I elaborated on in TFA. Just like having a language with a primitive solving the halting problem doesn't solve the halting problem, you can't have a language with arbitrary primitives and a magic machine making the sw+hw system as efficient as any alternative.
Regarding #2, eliminating context switch overhead and system call latency by running everything in one address space would be huge for some applications.
Enormous effort has been spent to allow HPC applications to talk directly to the network interface hardware without going through the OS (for instance using the Infiniband verbs API), because system calls are way too slow. This puts a lot of burden on the hardware vendors, because hardware implementation bugs might expose security vulnerabilities that could be exploited by malicious software, and there's no OS layer in between to sanity-check the hardware commands.
If a system call is just calling a function, then that's not an issue anymore; you can just use system calls for everything, because it's fast. The compiler might even inline them. The OS can insert any sanity checks it wants and expose a safe API to the application.
Regarding unsafe code, it's true than unsafe code blocks can violate the memory safety of the whole system, but that doesn't mean you can never use them, it just means that you need to use them carefully. An unsafe code block could be treated like kernel code or a setuid root binary. In a language like Rust, ideally many of the unsafe code patterns would be migrated into the standard library with a safe interface, and the unsafe low-level hardware interaction that the OS would need to do would be another set of unsafe code blocks. Normal application could be forbidden from using their own unsafe blocks (unless the superuser says it's okay), but they can still call into the kernel and into the standard library. (The way I imagine this working is that the operating system would have a policy of not executing any code that wasn't compiled by a trusted compiler, with compiler options dictated by some security policy. Code that was compiled on an un-trusted compiler can run inside an emulator.)
If having unsafe code blocks anywhere in the system seems wrong, consider that operating system kernels are essentially large unsafe code blocks. If we can abstract out the bits that truly need to be unsafe and write the rest as safe code, the end result may be far fewer corners for memory corruption bugs to lurk in, and an overall improvement in security and stability.
>> Regarding #2, eliminating context switch overhead and system call latency by running everything in one address space would be huge for some applications.
That sounds like basically running node.js in kernel space, which is entirely doable. I think people have put HLLs into kernel space before.
Yeah, it's similar idea. Rust makes this a bit more interesting because the memory safety doesn't come along with a massive performance penalty.
One could also go in the opposite direction and write an OS that just runs as a process inside another OS. Kind of like virtualization, but you don't need any of the specialized virtualization hardware features to do it. An advantage to doing this is that you could in theory setup the same environment on Windows, Linux, MacOS, or whatever, and be able to run one application on any of those platforms inside your process-level "guest OS" without modification. Sort of like Java, but with more OS-like features (like a shell, process and user management, a filesystem, and maybe a desktop environment), native code generation, and no GC.
> Eliminate virtual memory (replacing it with nothing)
That would require all code running on the system to be run through a security verifier. The only non-research language that's even attempted this at the system is Java. And you can get Java bytecode smartcards that do this.
It also precludes running Windows or Linux on it, reducing your market to the embedded market only straight out of the gate.
You could use the bus (NoC) for isolation instead of the MMU. For example, take away the capability to access memory from a core, while others can still communicate with it somehow.
Re. 2, it only makes sense for specialized workloads. Doing away with it, means the OS needs to keep track of who are allowed to access what memory, and that carries a far greater performance penalty.
It failed due to very poor performance. There is an excellent paper by Bob Colwell about why the performance turned out the way it did. Prior HN discussion: https://news.ycombinator.com/item?id=9447097
The i960 was a much better attempt. Baseline version had just enough smarts to improve safety or reliability while still overall a fast RISC. Got some customers in embedded.
To quote Ken Thompson (from memory) – "Lisp is not a special enough language to warrant a hardware implementation. The PDP-10 is a great Lisp machine. The PDP-11 is a great Lisp machine. I told these guys, you're crazy."
Intel itself [claimed](http://stackoverflow.com/a/530494) that the async CPU performs better than the sync one, but they didn't pursue the project further for lack of large-scale profitability.
That's specialized for just one language, though. In general you can always speed things up, sometimes by quite a bit, if you're willing to make your general purpose computer somewhat less general purpose.
Some of what the Mill folks are doing with hardware assisted stack operations might fall under the category of higher level instructions but those are for C just as much as any other language.
EDIT: Oh, and Linus likes to wax eloquent about the wonders of rep movs and I think he sort of has a point about having good facilities to call routines specific to the hardware, using instructions specific to the hardware that aren't exposed in the public ISA. But again, that's a high level function in hardware but it isn't specific to a high level language and it's mostly about accelerating C.
In large applications a bump pointer allocator plus generational gc is really fast (yes, stack allocation is fast too, but you can't always do it). A compacting gc avoids arenas, object reuse, or other awfulness. gc enables lock-free / CASed data structures; otherwise memory ownership is too complex to implement (though there's a Rust guy doing really cool stuff [1]). And gc in a threaded program is wildly easier. Unless you liked eg COM-style explicit ref-counting.
As for lisp plus large programs, the large program I worked on did end up with a bespoke (unfortunately) type-free internal language in order to orchestrate itself. Large = low millions of LOC of c++.
> I have images. I must process those images and find stuff in them. I need to write a program and control its behavior. You know, the usual edit-run-debug-swear cycle. What model do you propose to use?
Looks like you need a GPU, not a CPU. Much image processing stuff (and also much neural network stuff) is very suitable for the programming models of modern GPUs. For a first prototype buy an nVidia GPU and use CUDA. That will unlikely work for embedded stuff but if it’ll work OK on your PC with CUDA, there’s almost 100% chance you’ll be able to do that in OpenCL/DirectCompute/whatever.
I admit I'm not too well versed in hardware tech. One thing that comes to mind is using associative memory to implement objects, namespaces, etc. (e.g. http://www.vpri.org/pdf/tr2011003_abmdb.pdf ); although that general approach seems to be mentioned in the comments.
The article doesn't seem to consider the SOAR & SPARC family of RISC CPUs, they had instructions to do arithmetic on tagged integers and trap if the tags were incorrect. The feature has been dropped from 64-bit SPARC though.
Wait - Pure functional languages making lots of copies and lacking side effects is supposed to be a bad thing from a hardware perspective? As I understand it, synchronising shared memory is a massive source of complexity and stalls in modern processors. You get better performance in multithreaded code when you churn through new memory rather than mutate what you've already got, and even better in a stream processor where you know that is what is going to happen. Still, I suppose that's not a great argument for custom hardware because code that would benefit could most likely be shoehorned onto a GPU. Or maybe that's an argument that GPUs are already pretty close to being the custom hardware that we need.
> synchronising shared memory is a massive source of complexity and stalls in modern processors
Correct.
> making lots of copies is supposed to be a bad thing from a hardware perspective
Yes, because memory accesses are costly. You want your data to be packed tightly in memory instead of chasing pointers all the time. An array is more efficient than a linked list mostly due to caches.
The balance is key. You want one copy per core, but not one copy per data update.
I would love to know whether alternatives to floating-point, such as unums or quote notation, are worth considering for those languages which have rationals in their numeric towers.
Well unums in particular (especially the new type 3 "posits" ones) are very interesting and practical for hardware implementation. John Gustafson, the creator of unums, gave the first talk on type 3 unums at Stanford earlier this month, and there are already multiple hardware implementation efforts (one of which I am indirectly overseeing). While many people have recently hyped up low precision floating point for things like machine learning, Posits provide greater dynamic range AND greater accuracy for very few bits (as low as four, but with good ML starting 8 to 10 bits). On top of all of that, it costs less area and power in silicon than IEEE hardware.
Not to beat this analogy dead, but the reason Alan Kay et al. are so quick to discuss alternative computing methods should be quite obvious to anyone who doesn't limit their worldview to concepts that humans are already using.
Right now most processors are ridiculously general. They take a handful (ok, a couple thousand or so) instructions and they do their best to parallelize the instructions both on a single loop (core) and multiple cores. These instructions are of the "add, multiple, load, store" variety, with a few additional instructions for machine learning[1] and whatever HP wants[2].
This is it. This is the state of computing. How do bees work? Why can spiders hunt? When did crows start using tools? What makes us different than bonobos? How are all of these creatures so capable, yet so energy efficient?
We are taking a single solution, RISC/CISC architecture, and brute-forcing the hell out of it. Rather than build adaptive or purpose-built hardware, we're stuck on this concept of compile everything to x86/ARM and shrink the transistors (or try and offload parallel number crunching to the GPU).
What the author fails to realize is that computers are just fancy looping mechanisms. We use "HLLs" to compile abstract loops into instructions that run on general purpose machines. That's it.
The "apparently credible" people see the world in this light. They understand that the solution we've chosen is subpar, but the physics will make it work for some time.
A few other commenters have mentioned FPGAs. I'm not here to pitch a future on FPGAs; the die is still flat, the gates can only be reprogrammed so many times and they're generally "expensive."
I will say that we need better tools. FPGAs are a good start. Intel knows this[3]. Microsoft knows this[4].
With an FPGA you can dynamically program the exact logic a given operation will need. Whether it's real-time signal analysis, AI-built logic, or memcached, your logic will run exactly as specified.
Using purpose-built logic to run functions "natively" will drastically improve the efficiency of computation; both in time and energy.
It's really hard to build a horse that will fly to the moon. It's a lot easier to build a spaceship that can carry a horse to the moon.
If you want a "high-level" computer that can easily run high-level languages you can try building a stack machine. Check out the old LISP machines from the eairly 80s.
High level and fast for the time with some of the most interesting compiler design work.
Some other possible crazy ideas:
* Write instructions for forcing a read/write of memory into L1 cache
Allow me to tell the CPU to keep a chunk of memory im L1 for the lifetime of my program. I'll save you the transistors for figuring this out on the chip and it'll be easy to implement in a compiler. This is stuff that was done back in the NES and gameboy days (although by hand) with hi and lowmem.
* Put an easily interfaced FPGA on the die. Get rid of the stupid "one-size-fits-all" vectorization hardware we have today. We can just write our own vectorizations ourselves in the compiler level. Just make a big enough FPGA and we'll do all the math as fast as we want and in the parallelization we want.
This removes a very specific job oriented bit of transistors and allows it to be used for pretty much any problem that's complex enough to warrent attempting to use vectorization. One use of a combination of these ideas is writing an image filtering algorithm. It's a program that loops through the rows of my image, runs the "L1 cache" instruction from earlier, then passes it to my FPGA code which applies some complex filter, then writes it back to memory and continues. With traditional systems you'd be limited to 128 wide segments but I can presumably configure my FPGA segment to read the entire block of cache I want to load into and write back to it all as fast as possible.
This is an extremely hard thing to build but on-die FPGAs would change the face of computing. When they get big enough, who needs GPUs?
* Make core-shared fully atomic (large) registers for extremely fast but crude IPC
This is how most high level languages operate so I'd imagine this could definetly be made use of.
* Make the idea of a Core to Core MPI via interrupts and allow them to be configurally "synchronious" (page the program out until handled)
This allows the idea of calling a function on an object running in parallel with "this" object and it will be returned when this function is returned.
These are really crazy ideas. If I had inifnite money, time, and resources I probably couldn't pull something this "out there" off. I think the on-die FPGA is possible but impossible to get right. Most features on a modern processor that take up space are a conbination of functions that could all be done on an FPGA rather then wasting transistors on that specific function.
In a server do I really need 10 million transistors on my CPU for H264 decoding? Replace that with cache and a live-reprogrammable FPGA and we can do that all in software when and if we need it.
> keep a chunk of memory im L1 for the lifetime of my program
That is essentially what a scratchpad does. However, context switching is an issue, because the OS now has to maintain an even larger chunk of data in addition to the registers.
That should be handled at the OS level and the processor should provide facilities to directly store/page a program into a long term memory. The more the hardware can facilitate the lie that our program is living "forever" in ram the better. I have a feeling that we could also easily just add some memory chips into the computer board to just act as a super fast in-hardware swapfile to wait for things to be paged into memory then into disk.
The L1 cache should be configurable for the entier existance of a program and I'm pretty sure we'd see a lot less paging/pagefaults then.
How far away from your function is $some_global_variable? It doesn't matter if we can force the machine to always keep some_global_variable in cache.
But in short, it would definetly end up being like you say: people skrewing over other people, if there weren't other facilities in place to keep it safe and fast. It's a job for someone smarter then I.
The MIT Lisp Machine was basically a RISC architecture with an extra instruction to do multi-way switch, it ran an interpreter for the higher-level Lisp instruction set on top of this.
The claim that static types (in regards to Haskell) make things low level strikes me as wrong. It's probably a difference in the interpretation of the word "type", which is unfortunately used to describe many different things.
From a C perspective, the word "type" pretty much means "memory layout", e.g. the difference between "char" and "int" is that the latter (may) use more memory than the former; the difference between "int" and "float" is that the bits are interpreted in a different way; a struct describes how to lay out a chunk of memory; and so on. Static checks can be layered on top of this, but it mostly boils down to 'not misinterpreting the bits'. I've seen this concept distinguished by the phrase "value types".
In Haskell land, types have no particular relationship to memory size/layout; e.g. we don't really care what bit pattern will be used to represent something of type "Functor f => (a -> b) -> (forall c. c -> a) -> f b". Unlike C, there's no underlying assumption that "it's all just bits" which we must be careful to interpret consistently; instead, it's all left abstract, grammatical and polymorphic, leaving it up to the compiler to map such concepts to physical hardware however it likes.
I certainly think it's a mistake to think in terms of "high level == dynamic types"; there's the obligatory https://existentialtype.wordpress.com/2011/03/19/dynamic-lan... and I'd also consider Homotopy Type Theory to be a very high-level language. It's also a rather constraining simplification too, as it ignores the many other dimensions of a language.
Haskell's garbage collection is an obvious abstraction over memory which has nothing much to do with static/dynamic types (linear/affine/uniqueness types are very related, but are yet another overloading of "type" ;) ). As another example, I would count Prolog as more high-level than (say) Python since it abstracts over "low level" details like control flow; again, nothing to do with their types. Likewise, a message-passing language, operating transparently over a network would be more high-level than a language which communicates by (say) opening sockets on particular ports of particular IP addresses and serialising/deserialising data across the link when instructed to by the programmer; we'd be abstracting over the ideas of physical machines, locations and networks.
Calling Haskell low level because of its types ignores such other dimensions; in the context of hardware design, machine code, von Neumann architecture, etc. I'd say that abstracting over control-flow with non-strict evaluation is enough to make Haskell high-level.
I will however say that the Reduceron and in general the idea of doing FP in hardware in the most direct way are a terrible waste of resources and I'm pretty sure it loses to a good compiler targeting a von Neumann machine on overall efficiency.
The way to go is not make a hardware interpeter, that is no better than a processor with a for loop instruction added to better support C. The trick is to carefully partition sw and hw responsibilities as in the model to which C+Unix/RISC+MMU converged to.