
The “high-level CPU” challenge (2008) - panic
http://yosefk.com/blog/the-high-level-cpu-challenge.html
======
_yosefk
Author here. I think today that apart from making my case in a bit of an
obnoxious tone, I also somewhat overstated it: while it's true that many
"high-level" constructs do have a cost that will not magically go away due to
any logic built into hardware, at least not fully, it is also ought to be true
that a lot can be done in hardware to make software's life easier given a
particular HLL programming model, and I'm hardly an expert on this. My true
interests are in accelerator development so starting at the GPU and further
away from the CPU and so lower level and gnarlier than C in terms of
programming model.

I will however say that the Reduceron and in general the idea of doing FP in
hardware in the most direct way are a terrible waste of resources and I'm
pretty sure it loses to a good compiler targeting a von Neumann machine on
overall efficiency.

The way to go is not make a hardware interpeter, that is no better than a
processor with a for loop instruction added to better support C. The trick is
to carefully partition sw and hw responsibilities as in the model to which
C+Unix/RISC+MMU converged to.

~~~
BuuQu9hu
Actor-based dynamic language author here. (Doesn't matter which one; I think I
speak for all of us.) Thank you for being honest with us; we are not a very
performance-oriented group sometimes.

We're generally in favor of things which accelerate message passing between
shared-nothing concurrent actors. Hardware mailboxes or transactional memory
are nifty. OS-supported message queues are nifty; can those be lowered to
hardware in a useful way?

~~~
meredydd
Well, I never thought I'd be plugging my PhD research here, but:

"Asynchronous Remote Stores for Inter-Core Communication"
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.592...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.592.1178&rep=rep1&type=pdf)

To my knowledge, this is still the only hardware-assisted message passing
scheme that is virtualisable (ie compatible with a "real" OS like Linux).

Hardware mailboxes are great, but time-sharing OSs can't deal with finite
hardware resources that can't be swapped out easily. Software-based queues die
a fiery death thanks to cache coherency - reading something that another core
just wrote will block you for hundreds of cycles.

~~~
_yosefk
Virtualizable hardware-assisted message passing is awesome. (MIPS for instance
had a big fat ISA extension for hardware-assisted message passing and cheap
hardware multithreading which Linux couldn't use and they then threw out the
window exactly when they introduced hardware virtualization of the entire set
of processor resources.)

As to software-based queues dying a fiery death - in what scenarios? As I said
in a sister comment, I (think that I) know that things work out in
computational parallelism scenarios where many tasks are mapped onto a thread
pool, TBB-style, that is, I don't think the hardware overhead is ridiculously
large in these systems. Where do things go badly? 100K lightweight threads
communicating via channels, Go-style?

~~~
meredydd
Whoa - MIPS virtualised their message passing hardware? How?!

Software-based queues die a fiery death when the latency of a send/receive is
critical, because you end up stalling on a really slow cache-coherence
operation. So, for example, anything like a cross-thread RPC call takes ages
(you wait for the transmission and wait for a response, so it's much slower
than a function call and often a system call - the Barrelfish research OS
suffers a bunch from this). There are also algorithms you just can't
parallelise because you can't split them into large chunks, and if you split
them into small chunks the cost of communicating so frequently destroys your
performance. (Eg there was a brave attempt to parallelise the inner loop of
bzip2 - which resists coarse parallelisation thanks to loop-carried
dependencies - this way).

Software based queues perform just fine on throughput, though - if you're
asynchronous enough to let a few messages build up in your message queue,
you'll only pay the latency penalty once per cache line when you drain it (and
with a good prefetcher, even less than that).

The examples you cite are actually both instances of software cunningly
working within the limits of slow inter-core communication. Work-queue
algorithms typically keep local, per-core queues and only rebalance tasks
between cores ("work stealing") infrequently, so as to offset how expensive
that operation is. Lightweight threads with blocking messages (like Go or
Occam or some microkernels) work by turning most message sends into context
switches within one core - when you send a message on a Go channel, you can
just jump right into the code that receives it. Again, they can then rebalance
infrequently. (For an extra bonus, by making it easy to create 100k "threads",
they hope to engage in latency-hiding for individual threads - and once you're
in "throughput" mode it's all gravy).

~~~
_yosefk
> Whoa - MIPS virtualised their message passing hardware? How?!

No, I meant to say that they simply obsoleted that part of their architecture
when they added virtualization, because they couldn't virtualize it.

> Eg there was a brave attempt to parallelise the inner loop of bzip2 - which
> resists coarse parallelisation thanks to loop-carried dependencies - this
> way.

So you say you can do hardware-assisted message passing that can be
virtualized and can speed up bzip2 by parallelizing? How few instructions per
RPC call does it take for you to still be efficient vs today's software-based
messaging? (This is getting fairly interesting and it should be particularly
interesting to serious CPU vendors.)

~~~
meredydd
This is getting deep in an ageing thread - do you want to take this to email?
(It's in my profile)

Pipelined bzip2 wasn't in the evaluation for my research, but I bet remote
stores would get considerably better results than software queues.
Parallelising one algorithm is something of a stunt, and gets you just a
single data point. Instead, I did a bunch of different benchmarks
(microbenchmarks for FIFO queues and synchronisation barriers; larger
benchmarks including a parallel profiler and a variable-granularity MapReduce
to measure how far remote stores could move the break-even point for
communication vs computation; and an oddball parallel STM system that I'd
previously demonstrated on dedicated (FPGA) hardware). I got around an order
of magnitude on all of them (some a little less, some much more).

The writeup starts on page 59 of [https://www.cl.cam.ac.uk/techreports/UCAM-
CL-TR-831.pdf](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-831.pdf) and
the evaluation on page 65.

Looking back, I seriously regret not taking more time to sit down and write it
up more clearly, because I do think this should be interesting to serious CPU
vendors. However, by that point I had reached the point of "I'm fed up with
this PhD; I'm going home now". As I knew I didn't want to stay in academia, I
published in a mediocre venue rather than revising for a better one, and went
off to Silicon Valley instead. Your comments have made me re-read my old work,
and it's painful to wonder how much further it could have gone if I had
explained it better.

------
Animats
The history of "higher level" instructions isn't good. The DEC VAX had an
assembly language intended to make life easier for assembly programmers, but
it slowed the machine down. The Intel IAPX 432 had lots of bells and whistles,
but was really slow. The RISC machines with lots of registers turned out not
to be all that useful, and too much register saving and restoring was
required. RISC is a win until you want to go superscalar and have more than
one instruction per clock. Then it's a lose. Stack machines that run some RPN
form like Forth or Java code have been built, but don't go superscalar well.

A useful near-term feature would be zero-cost hardware exceptions on integer
overflow. This is an error in both Java and Rust, and tends to be turned off
at compile time because it has a performance penalty. The problem is that
people will want to be able to unwind and recover, which means exact
exceptions and a lot of compiler support for them.

If you could figure out how to do zero-cost subscript checking, that would be
a step forward. That check needs additional info about bounds, which usually
means a delay.

I used to be a fan of schemes for safely calling from one address space to
another. i386 almost has this, with call gates, which don't quite do enough to
be useful. A few machines have had hardware context switching, but that hasn't
been a big performance improvement. All that has to be tightly integrated with
the OS or it's a lose. It's an enhancement to Plan 9, not anything anybody
uses.

The same is true of fancy schemes for inter-CPU communication, but that
probably needs more attention. Like it or not, we have to figure out what to
do with large numbers of non-shared-memory CPUs. Some way to set up memory-
safe message passing between non-shared-memory CPUs without involving the OS
after setup would be useful.

An IOMMU that allows drivers in user space with minimal performance
degradation is a good thing. Those exist.

~~~
i336_
> _Stack machines that run some RPN form like Forth or Java code have been
> built, but don 't go superscalar well._

I've been interested in Forth (and related stack) processors for a while, and
my armchair observations over a few months have suggested that the (much-
vaunted) performance gains associated with such processor designs are
apparently not straightforward to map or relate to or take advantage of.

I remember (unfortunately not sure where right now) reading how the GA144 was
built at a time when 18-bit memory was the current trending novelty and that
it's not really a perfect processor design. I'm still fascinated by it though
(sore lack of on-chip memory notwithstanding).

What sort of scale are you referring to when you say "superscalar"? 144
processors? 1000? Do stack-based architectures remain a not-especially-
practical-or-competitive novelty, or are they worth pursuing outside of
CompSci?

(FWIW, everything else you've written is equally interesting, but slightly
over my current experience level)

~~~
Animats
Pure Forth machines were interesting when the CPU clock and the memory ran at
about the same speed, and the number of gates was very limited. Chuck Moore's
original Forth machine had a main memory, a data stack memory, and a return
stack memory, each of which was cycled on each clock. It took only about 4K
gates to make a CPU that way.

Today, the speed ratio between CPU and main memory is several orders of
magnitude. The main problem in CPU design today is dealing with that huge
mismatch. That's why there are so many levels of cache now, and vast amounts
of cleverness and complexity to try to make memory accesses fewer, but wider.

The next step is massive parallelism. GPUs are today's best examples.
Dedicated deep learning chips should be available soon, if they're not
already. That problem can be implemented as a small inner loop made massively
parallel.

~~~
qznc
> the speed ratio between CPU and main memory is several orders of magnitude

What if you compare scratchpad SRAM and an energy-efficient CPU?

~~~
pjc50
Compare in what way? e.g. the L1 cache is SRAM running at core speed with low
latency (4 cycles for Haswell).

~~~
qznc
A factor of 4 instead of orders of magnitude means a Forth machine might still
be worthwhile. No?

~~~
pjc50
Where's the Forth machine getting its operands from?

Sure, if you constrain your programs to use a tiny area of memory you might be
able to achieve theoretical speed, but what workloads can you achieve with
that?

If you were to write a browser in Forth, presumably it would still have to
store all its DOM in DRAM?

~~~
qznc
You probably want to parallelize the browser for energy efficiency. Then you
can distributed the DOM across the scratchpads of hundreds of cores maybe?

~~~
pjc50
So each core has its own JS engine, or when you iterate across the DOM with a
selector you have to query across all the nodes? This doesn't sound great.

(The "pile of cores with scratchpads" exists e.g. Tilera and Netronome, and
they're a right pain to program for)

------
piinbinary
Some assorted ideas:

* Expose the data dependency graph directly to the processor, rather than forcing the processor to infer it from the instructions.

* Annotate when data is read-only, to reduce communication between cores (for the sake of avoiding latency, not for bandwidth savings).

* Add a mechanism for much cheaper (if more limited) parallelism where a single core would work on multiple, related thunks in parallel, even if those units of work would be far too small to be worth coordinating with another thread to offload. This would likely be largely implicit, taking advantage of the first feature.

* Instructions for graph traversal. The CPU could (for some uses) order the traversal in a way that improves cache locality, based on how the graph is actually laid out in memory (and prefetch uncached nodes while working on cached ones).

* Something like map & reduce, where you can apply a small (pure) function to a list of data. Again, this would likely be done in parallel.

~~~
qznc
> data dependency graph directly to the processor

What format would you propose? How is it different to instructions?

> Annotate when data is read-only

What overhead would you save? On the cache coherence protocol level?

> a single core would work on multiple, related thunks in parallel

VLIW
[https://en.wikipedia.org/wiki/Very_long_instruction_word](https://en.wikipedia.org/wiki/Very_long_instruction_word)

> instructions for graph traversal

Your Intel CPU can already do prefetching. Others have special instructions
for it. Hyperthreading does the "work on cached stuff first" part.

> like map & reduce

Use the GPU.

~~~
yvdriess
Answering with my own take on the subject.

> What format would you propose? How is it different to instructions? It would
> still be instructions, but either by reworking what their operands are and
> introducing a destination (instruction address + operand port). Another
> option that could maybe work is a second and mirroring instruction stream
> that contains only the dataflow/dependencies.

> What overhead would you save? On the cache coherence protocol level? Current
> strategy is write-invalidate, I believe. In some heavy-contention situations
> (e.g. lots of spin-loops on one variable), using a 'write-once' instruction
> would traffic. Thinking more radically: With an overhaul of the entire
> virtual-memory system, a write-once instruction becomes potentially very
> interesting (Jack Dennis had an interesting paper on the subject:
> [http://www.cs.ucy.ac.cy/dfmworkshop/wp-
> content/uploads/2014/...](http://www.cs.ucy.ac.cy/dfmworkshop/wp-
> content/uploads/2014/08/DFM2014-9-On-the-Feasibility-of-a-Codelet-Based-
> Multi-core-Operating-System.pdf)).

> VLIW Only if you mean something like Mill
> ([https://en.wikipedia.org/wiki/Mill_architecture](https://en.wikipedia.org/wiki/Mill_architecture)),
> instead of Itanium.

------
catern
1\. Eliminate cache coherency protocols (replacing it with cache
manipulation/inter-CPU communication instructions)

2\. Eliminate virtual memory (replacing it with nothing)

I'm not a CPU designer, but my understanding is that removing features allows
for a denser/faster CPU. Well, these are two features that a suitably high-
level language has no need for, because a high-level language doesn't expose
"memory" to the programmer.

Edit: Though, I 100% agree with what I believe is the core point of the
author. We should not implement high-level features in hardware. In fact we
should implement as little as possible in hardware, moving as much as possible
into software. If Intel would let third parties generate microcode for their
CPUs, we could move a lot further in that direction...

~~~
trsohmers
Boy, do I have a new processor architecture for you. For the past 3 years my
company (REX Computing), has been working on a new architecture that not only
covers both of your points, but removes hardware managed caching entirely and
replaces it with software managed (physically addressed) scratchpad memory. By
removing all of the unnecessary logic (MMU, TLB, everything required for
virtual addressing, prefetching, coherency, etc) associated with the onchip
memory system, we can fit more memory on the chip itself, have lower latency
to that memory, and use a lot less power. As we fully expose the memory system
to our software tools, we can make very good decisions at compile time that
replicate most of the features you get in having a hardware managed caching
system.

We have silicon back, and are now working with early customers in showing a 10
to 25x energy efficiency improvement for high performance workloads.

I gave a talk at Stanford last week that covers the hardware architecture and
shows off our development hardware:
[https://www.youtube.com/watch?v=ki6jVXZM2XU](https://www.youtube.com/watch?v=ki6jVXZM2XU)

~~~
i336_
Wow. (Not op)

Definitely watching the video later when I have time. Thoughts from my initial
impression:

1\. Please put benchmarks up on HN when you can :)

2\. Would be awesome if you could get this into the hands of low level kernel
developers who can tinker with this and see what Linux is like on it - if it's
capable of doing that much? (Need to watch the video!)

3\. Conservativism FTW (IMO) when thinking about offers [that come in after
you've done (1) :P]

~~~
trsohmers
1: Around the 30:40 mark in the video I transition to the live hardware demo,
where I learned to never do live demos. While it did not work there in person,
I do show pictures from it working earlier in the day along with the results
from our 1024-point in place FFT assembly test. It was written and tested
Monday and Tuesday of last week (finished the night before the presentation),
so it is not fully optimized, but we are getting a very good 25 double
precision GFLOPs/watt. As a comparison, Intel and NVIDIA's advertised FP64
numbers are around 8 to 12 GFLOPs/watt while they are on a more advanced
process node (They are on 14/16nm, while our test silicon was on 28nm).

We're in the process of getting cross platform benchmarks (HPL, HPCG, FFTW,
Coremark, etc) up and running, but I'm hoping we'll be posting results
shortly.

2\. Since Linux 4.2, there has been a mainline port for STM32 which does not
have a MMU, so porting linux is technically possible, though not on our
priority list. Something like uCLinux would probably be easier, but not as
useful. I have no doubt our cores would be able to handle it, it is just that
our current customers already expect their software to be running on bare
metal.

~~~
orbifold
I didn't watch the talk in full, so forgive me if you addressed this somewhere
in the Q&A: What are the differences between your architecture and the
Epiphany chips? As far as I can tell they have approximately the same amount
of local scratch pad memory and also a 2D mesh routing infrastructure and
general overall design philosophy, one major difference seems to be the Serdes
design that you have.

Edit: Ah ok, one major difference seems to be that you have quad-issue VLIW as
opposed to RISC and apparently you are 10x faster than Epiphany, that is
_really_ impressive.

~~~
trsohmers
On your edit: RISC and VLIW are not mutually exclusive. I would say we are a
RISC architecture (Load/store based/everything happens on registers, fixed
instruction word size, shallow pipeline) that happens to have a much simpler
instruction decoder & higher instruction level parallelism since our compiler
guarantees that the instruction bundle (the 64 bit Very Long Instruction Word
containing 4 "syllables" of instructions itself) will only give instructions
that will not conflict with anything it is given with or in the pipeline.

Other than we are faster (20% higher clock plus 2x higher IPC) and more
efficient (Epiphany is single precision only, where we are twice as efficient
as them, we also support FP64), there are three quick points:

1\. We actually have double the local scratchpad of Epiphany per core, and our
SPM banking and register port scheme actually enables us to operate all four
functional units within a core simultaneously, while also having data go in
and out of the core on the Network on Chip. With Epiphany, you are very
limited in what instructions can run together primarily due to port
conflicts... the biggest difficulty is that you can't do any sort of control
instructions with anything else.

2\. As far as the Network on Chip goes, we are able to guarantee all latencies
through a strict deterministic static priority routing scheme. Epiphany had 3
levels of it's Network on Chip, one for stores, one for read requests (8x
slower than stores), and one for off chip communications. We have a (patent
pending) way of simplifying all of this greatly while reducing latency and
having greater bit efficiency.

3\. Off chip memory bandwidth is _extremely_ important to us. Even on our
_test_ chip, we have 4x higher bandwidth than the Epiphany IV, plus lower
latency... our chip to chip and memory interface also uses the exact same
protocol as our NoC, simplifying things even further.

There are a handful of smaller things, though my biggest gripe with Epiphany
has always been the lack of bandwidth both on and especially off chip. If you
are targeting DSP and similar applications like both Epiphany and we are, you
really _really_ need to have the ability to saturate your networks and match
compute capabilities with it.

~~~
crest
How would your Neo ISA compare to the proposed RISC V Cray style vector
extension for FFTs?

~~~
trsohmers
Our current plans are for very simple SIMD modes that reutilize the same
hardware to maximize Area and power efficiency. Right now, we can separately
load/store the upper half and lower half of a 64 bit register with 32 bits of
data, and while we did not have the time to implement it on this silicon, the
plan for the future is to have a mode switch to allow the user to use the same
set of instructions/hardware/registers to do double, single, or half precision
floating point operations.

The other thing we are looking forward to directly test/compare are unums,
specifically the new "type 3" ones known as Posits, which are useful (for some
definitions of useful) all the way down to 4 bits, and have a greater dynamic
range plus greater precision than IEEE floats while using let's bits and
theoretically lower area/power on a chip.

------
rogerbinns
Intel did try to introduce a high level CPU in 1981:
[https://en.wikipedia.org/wiki/Intel_iAPX_432](https://en.wikipedia.org/wiki/Intel_iAPX_432)

It failed due to very poor performance. There is an excellent paper by Bob
Colwell about why the performance turned out the way it did. Prior HN
discussion:
[https://news.ycombinator.com/item?id=9447097](https://news.ycombinator.com/item?id=9447097)

~~~
nickpsecurity
The i960 was a much better attempt. Baseline version had just enough smarts to
improve safety or reliability while still overall a fast RISC. Got some
customers in embedded.

------
jeffsco
I loved this (paraphrased) quote:

To quote Ken Thompson (from memory) – "Lisp is not a special enough language
to warrant a hardware implementation. The PDP-10 is a great Lisp machine. The
PDP-11 is a great Lisp machine. I told these guys, you're crazy."

------
CapacitorSet
This is a good time to mention asynchronous architectures:
[https://en.wikipedia.org/wiki/Asynchronous_circuit#Asynchron...](https://en.wikipedia.org/wiki/Asynchronous_circuit#Asynchronous_CPU)

Intel itself
[claimed]([http://stackoverflow.com/a/530494](http://stackoverflow.com/a/530494))
that the async CPU performs better than the sync one, but they didn't pursue
the project further for lack of large-scale profitability.

------
Symmetry
Well, the Reduceron seems to count as an example. I'm not sure I'm convinced
about it's performance, though.

[https://www.cs.york.ac.uk/fp/reduceron/](https://www.cs.york.ac.uk/fp/reduceron/)

That's specialized for just one language, though. In general you can always
speed things up, sometimes by quite a bit, if you're willing to make your
general purpose computer somewhat less general purpose.

Some of what the Mill folks are doing with hardware assisted stack operations
might fall under the category of higher level instructions but those are for C
just as much as any other language.

[https://millcomputing.com/](https://millcomputing.com/)

EDIT: Oh, and Linus likes to wax eloquent about the wonders of rep movs and I
think he sort of has a point about having good facilities to call routines
specific to the hardware, using instructions specific to the hardware that
aren't exposed in the public ISA. But again, that's a high level function in
hardware but it isn't specific to a high level language and it's mostly about
accelerating C.

------
x0x0
I think yosef misinterpreted JWZ.

In large applications a bump pointer allocator plus generational gc is really
fast (yes, stack allocation is fast too, but you can't always do it). A
compacting gc avoids arenas, object reuse, or other awfulness. gc enables
lock-free / CASed data structures; otherwise memory ownership is too complex
to implement (though there's a Rust guy doing really cool stuff [1]). And gc
in a threaded program is wildly easier. Unless you liked eg COM-style explicit
ref-counting.

As for lisp plus large programs, the large program I worked on did end up with
a bespoke (unfortunately) type-free internal language in order to orchestrate
itself. Large = low millions of LOC of c++.

[1]
[https://aturon.github.io/blog/2015/08/27/epoch/](https://aturon.github.io/blog/2015/08/27/epoch/)

------
greenyoda
The comments following this article (which span a period from 2008 to 2015!)
are also very interesting.

~~~
kevinwang
And interesting comments from 2008 and 2015 here:
[https://hn.algolia.com/?query=The%20“high-
level%20CPU”%20cha...](https://hn.algolia.com/?query=The%20“high-
level%20CPU”%20challenge&sort=byDate&dateRange=all&type=story&storyText=false&prefix&page=0)

------
Const-me
Great article.

> I have images. I must process those images and find stuff in them. I need to
> write a program and control its behavior. You know, the usual edit-run-
> debug-swear cycle. What model do you propose to use?

Looks like you need a GPU, not a CPU. Much image processing stuff (and also
much neural network stuff) is very suitable for the programming models of
modern GPUs. For a first prototype buy an nVidia GPU and use CUDA. That will
unlikely work for embedded stuff but if it’ll work OK on your PC with CUDA,
there’s almost 100% chance you’ll be able to do that in
OpenCL/DirectCompute/whatever.

------
chriswarbo
I admit I'm not too well versed in hardware tech. One thing that comes to mind
is using associative memory to implement objects, namespaces, etc. (e.g.
[http://www.vpri.org/pdf/tr2011003_abmdb.pdf](http://www.vpri.org/pdf/tr2011003_abmdb.pdf)
); although that general approach seems to be mentioned in the comments.

------
rjsw
The article doesn't seem to consider the SOAR & SPARC family of RISC CPUs,
they had instructions to do arithmetic on tagged integers and trap if the tags
were incorrect. The feature has been dropped from 64-bit SPARC though.

------
razorunreal
Wait - Pure functional languages making lots of copies and lacking side
effects is supposed to be a bad thing from a hardware perspective? As I
understand it, synchronising shared memory is a massive source of complexity
and stalls in modern processors. You get better performance in multithreaded
code when you churn through new memory rather than mutate what you've already
got, and even better in a stream processor where you know that is what is
going to happen. Still, I suppose that's not a great argument for custom
hardware because code that would benefit could most likely be shoehorned onto
a GPU. Or maybe that's an argument that GPUs are already pretty close to being
the custom hardware that we need.

~~~
qznc
> synchronising shared memory is a massive source of complexity and stalls in
> modern processors

Correct.

> making lots of copies is supposed to be a bad thing from a hardware
> perspective

Yes, because memory accesses are costly. You want your data to be packed
tightly in memory instead of chasing pointers all the time. An array is more
efficient than a linked list mostly due to caches.

The balance is key. You want one copy per core, but not one copy per data
update.

------
BuuQu9hu
I would love to know whether alternatives to floating-point, such as unums or
quote notation, are worth considering for those languages which have rationals
in their numeric towers.

~~~
trsohmers
Well unums in particular (especially the new type 3 "posits" ones) are very
interesting and practical for hardware implementation. John Gustafson, the
creator of unums, gave the first talk on type 3 unums at Stanford earlier this
month, and there are already multiple hardware implementation efforts (one of
which I am indirectly overseeing). While many people have recently hyped up
low precision floating point for things like machine learning, Posits provide
greater dynamic range AND greater accuracy for very few bits (as low as four,
but with good ML starting 8 to 10 bits). On top of all of that, it costs less
area and power in silicon than IEEE hardware.

Check out the talk:
[https://www.youtube.com/watch?v=aP0Y1uAA-2Y](https://www.youtube.com/watch?v=aP0Y1uAA-2Y)

------
kainolophobia
The author is stuck on building a better horse.

Not to beat this analogy dead, but the reason Alan Kay et al. are so quick to
discuss alternative computing methods should be quite obvious to anyone who
doesn't limit their worldview to concepts that humans are already using.

Right now most processors are ridiculously general. They take a handful (ok, a
couple thousand or so) instructions and they do their best to parallelize the
instructions both on a single loop (core) and multiple cores. These
instructions are of the "add, multiple, load, store" variety, with a few
additional instructions for machine learning[1] and whatever HP wants[2].

This is it. This is the state of computing. How do bees work? Why can spiders
hunt? When did crows start using tools? What makes us different than bonobos?
How are all of these creatures so capable, yet so energy efficient?

We are taking a single solution, RISC/CISC architecture, and brute-forcing the
hell out of it. Rather than build adaptive or purpose-built hardware, we're
stuck on this concept of compile everything to x86/ARM and shrink the
transistors (or try and offload parallel number crunching to the GPU).

What the author fails to realize is that computers are just fancy looping
mechanisms. We use "HLLs" to compile abstract loops into instructions that run
on general purpose machines. That's it.

The "apparently credible" people see the world in this light. They understand
that the solution we've chosen is subpar, but the physics will make it work
for some time.

A few other commenters have mentioned FPGAs. I'm not here to pitch a future on
FPGAs; the die is still flat, the gates can only be reprogrammed so many times
and they're generally "expensive."

I will say that we need better tools. FPGAs are a good start. Intel knows
this[3]. Microsoft knows this[4].

With an FPGA you can dynamically program the exact logic a given operation
will need. Whether it's real-time signal analysis, AI-built logic, or
memcached, your logic will run exactly as specified.

Using purpose-built logic to run functions "natively" will drastically improve
the efficiency of computation; both in time and energy.

It's really hard to build a horse that will fly to the moon. It's a lot easier
to build a spaceship that can carry a horse to the moon.

[1][http://lemire.me/blog/2016/10/14/intel-will-add-deep-
learnin...](http://lemire.me/blog/2016/10/14/intel-will-add-deep-learning-
instructions-to-its-processors/)

[2][https://en.wikipedia.org/wiki/IA-64](https://en.wikipedia.org/wiki/IA-64)

[3][https://www.bloomberg.com/news/articles/2015-06-01/intel-
buy...](https://www.bloomberg.com/news/articles/2015-06-01/intel-buys-altera-
for-16-7-billion-as-chip-deals-accelerate)

[4][https://www.wired.com/2016/09/microsoft-bets-future-chip-
rep...](https://www.wired.com/2016/09/microsoft-bets-future-chip-reprogram-
fly/)

~~~
pjmlp
I agree and hope VHDL has a more brighter future in FPGA world than Ada did on
the mainstream software landscape.

------
gravypod
If you want a "high-level" computer that can easily run high-level languages
you can try building a stack machine. Check out the old LISP machines from the
eairly 80s.

High level and fast for the time with some of the most interesting compiler
design work.

Some other possible crazy ideas:

    
    
       * Write instructions for forcing a read/write of memory into L1 cache
    

Allow me to tell the CPU to keep a chunk of memory im L1 for the lifetime of
my program. I'll save you the transistors for figuring this out on the chip
and it'll be easy to implement in a compiler. This is stuff that was done back
in the NES and gameboy days (although by hand) with hi and lowmem.

    
    
       * Put an easily interfaced FPGA on the die. Get rid of the stupid "one-size-fits-all" vectorization hardware we have today. We can just write our own vectorizations ourselves in the compiler level. Just make a big enough FPGA and we'll do all the math as fast as we want and in the parallelization we want. 
    

This removes a very specific job oriented bit of transistors and allows it to
be used for pretty much any problem that's complex enough to warrent
attempting to use vectorization. One use of a combination of these ideas is
writing an image filtering algorithm. It's a program that loops through the
rows of my image, runs the "L1 cache" instruction from earlier, then passes it
to my FPGA code which applies some complex filter, then writes it back to
memory and continues. With traditional systems you'd be limited to 128 wide
segments but I can presumably configure my FPGA segment to read the entire
block of cache I want to load into and write back to it all as fast as
possible.

This is an extremely hard thing to build but on-die FPGAs would change the
face of computing. When they get big enough, who needs GPUs?

    
    
       * Make core-shared fully atomic (large) registers for extremely fast but crude IPC
    

This is how most high level languages operate so I'd imagine this could
definetly be made use of.

    
    
       * Make the idea of a Core to Core MPI via interrupts and allow them to be configurally "synchronious" (page the program out until handled)
    

This allows the idea of calling a function on an object running in parallel
with "this" object and it will be returned when this function is returned.

These are really crazy ideas. If I had inifnite money, time, and resources I
probably couldn't pull something this "out there" off. I think the on-die FPGA
is possible but impossible to get right. Most features on a modern processor
that take up space are a conbination of functions that could all be done on an
FPGA rather then wasting transistors on that _specific_ function.

In a server do I really need 10 million transistors on my CPU for H264
decoding? Replace that with cache and a live-reprogrammable FPGA and we can do
that all in software _when_ and _if_ we need it.

~~~
smitherfield
_> Write instructions for forcing a read/write of memory into L1 cache_

Mightn't optimizing compilers create something of a "crabs in the bucket"
syndrome where all the running programs are trying to hog the L1 cache?

~~~
gravypod
That should be handled at the OS level and the processor should provide
facilities to directly store/page a program into a long term memory. The more
the hardware can facilitate the lie that our program is living "forever" in
ram the better. I have a feeling that we could also easily just add some
memory chips into the computer board to just act as a super fast in-hardware
swapfile to wait for things to be paged into memory then into disk.

The L1 cache should be configurable for the entier existance of a program and
I'm pretty sure we'd see a lot less paging/pagefaults then.

How far away from your function is $some_global_variable? It doesn't matter if
we can force the machine to always keep some_global_variable in cache.

But in short, it would definetly end up being like you say: people skrewing
over other people, if there weren't other facilities in place to keep it safe
and fast. It's a job for someone smarter then I.

------
chriswarbo
The claim that static types (in regards to Haskell) make things low level
strikes me as wrong. It's probably a difference in the interpretation of the
word "type", which is unfortunately used to describe many different things.

From a C perspective, the word "type" pretty much means "memory layout", e.g.
the difference between "char" and "int" is that the latter (may) use more
memory than the former; the difference between "int" and "float" is that the
bits are interpreted in a different way; a struct describes how to lay out a
chunk of memory; and so on. Static checks can be layered on top of this, but
it mostly boils down to 'not misinterpreting the bits'. I've seen this concept
distinguished by the phrase "value types".

In Haskell land, types have no particular relationship to memory size/layout;
e.g. we don't really care what bit pattern will be used to represent something
of type "Functor f => (a -> b) -> (forall c. c -> a) -> f b". Unlike C,
there's no underlying assumption that "it's all just bits" which we must be
careful to interpret consistently; instead, it's all left abstract,
grammatical and polymorphic, leaving it up to the compiler to map such
concepts to physical hardware however it likes.

I certainly think it's a mistake to think in terms of "high level == dynamic
types"; there's the obligatory
[https://existentialtype.wordpress.com/2011/03/19/dynamic-
lan...](https://existentialtype.wordpress.com/2011/03/19/dynamic-languages-
are-static-languages) and I'd also consider Homotopy Type Theory to be a
_very_ high-level language. It's also a rather constraining simplification
too, as it ignores the many other dimensions of a language.

Haskell's garbage collection is an obvious abstraction over memory which has
nothing much to do with static/dynamic types (linear/affine/uniqueness types
are very related, but are yet another overloading of "type" ;) ). As another
example, I would count Prolog as more high-level than (say) Python since it
abstracts over "low level" details like control flow; again, nothing to do
with their types. Likewise, a message-passing language, operating
transparently over a network would be more high-level than a language which
communicates by (say) opening sockets on particular ports of particular IP
addresses and serialising/deserialising data across the link when instructed
to by the programmer; we'd be abstracting over the ideas of physical machines,
locations and networks.

Calling Haskell low level because of its types ignores such other dimensions;
in the context of hardware design, machine code, von Neumann architecture,
etc. I'd say that abstracting over control-flow with non-strict evaluation is
enough to make Haskell high-level.

