
A 32nm 1000-Processor Array - lossolo
http://vcl.ece.ucdavis.edu/pubs/2016.06.vlsi.symp.kiloCore/2016.vlsi.symp.kiloCore.pdf
======
mrb
It is surprising the paper makes no comparisons with GPUs. So I will do it.

For starters it looks like they are talking about integer operations (I only
skimmed the paper and it mentions an ALU, not an FPU), whereas my GPU numbers
below are single precision floating point numbers. So it is apples vs oranges.

So, a modern 14-16nm GPU like the Tesla P100 or RX 480 does about 5 to 10
trillion ops/sec at 200-300 W, and approximately 30 pJ/op. So GPUs are about
5x less power efficient. The paper authors did a good job. However GPUs are
not optimized for absolute power efficiency, but mostly for performance per $.

~~~
duckingtest
(Dynamic) Power consumption goes up with square of voltage, so a comparison
with a gpu's max power efficiency point (almost certainly underclocked and
undervolted) could change the comparison significantly.

~~~
stcredzero
_Power consumption goes up with square of voltage_

This needs to become common knowledge amongst nerds and tech types.

~~~
theparanoid
It was more widely known, when every geek overclocked; no more.

~~~
stcredzero
I was underclocking when everyone else was overclocking. Because I wanted
_quiet_!

------
willvarfar
So how do you program such a beast? What progress is being made on that front?

Cache coherency seems really hard to give up on, and even CPU-GPU cache
coherency is becoming the expected norm, with even ARM delivering it.

~~~
RachelF
It's very hard.

In the mid 1980s there was a CPU called a "Transputer" [1] made some of the
people who moved to ARM. These CPUs could be connected together in huge
networks and directly talk to each other.

The network of CPU's could auto-discover its topology, but coding for so many
CPU's was difficult. Some specific algorithms scaled well with the number of
CPUs, but most did not.

[1]
[https://en.wikipedia.org/wiki/Transputer](https://en.wikipedia.org/wiki/Transputer)

~~~
jacquesm
Occam did it quite nicely.

I think the reason transputers didn't 'make it' is not because they were super
hard to program (it was only a little bit harder than programming a regular
computer), but because the price premium you paid for a transputer set-up was
too high and x86 got faster _very_ rapidly.

This is right around the time when the first 386 machines were launched and in
a very short time we went from 12-20 MHz 286 boxes (and some 68K machines for
the lucky ones) to 33 MHz 386 machines with a ton of RAM.

So the advantage that transputers had was eroded very quickly and I don't
think INMOS was ready to match pace.

Now that we've reached the end of the line for that kind of speed increase we
are seeing a renewed interest in multi-cpu fabric architectures, of which the
transputer was an instance.

~~~
bluedino
>> x86 got faster very rapidly.

Is ARM ( or something else) going to surpass x86 (amd64 whatever you want to
call it) in the near future?

~~~
digi_owl
Just a layman, but i get the impression as of late that things are running
into thermal and feature size issues.

And that these will affect all ISA equally.

------
imtringued
This reminds me of the Ambric am2045 with 336 cores from 2007. They put 40 of
these into an X-Ray machine for a total of 13,000 cores.

[https://en.wikipedia.org/wiki/Ambric](https://en.wikipedia.org/wiki/Ambric)

[http://www.embeddedinsights.com/epd/Diagrams/nethra-
am2045.j...](http://www.embeddedinsights.com/epd/Diagrams/nethra-am2045.jpg)

~~~
petra
There's (1) talking about getting ~60 tflops from ~130k compute elements by 3d
assembly of 128mb ram . But it requires microfluidics cooling so it would be
very expensive.

But in another paper(2), the authors say that offering this chip, even
expensively ,could start building the ecosystem and and environment to start a
Moore's law like race to solve the heat issue via photonics .

(1)[https://www.semanticscholar.org/paper/Fft-on-Xmt-Case-
Study-...](https://www.semanticscholar.org/paper/Fft-on-Xmt-Case-Study-of-a-
Bandwidth-intensive-Edwards-
Vishkin/4f9bd24ea869f64c4457f69ad6d9dc2b3ddb93ac/read/page/7/panel/1)

(2)[http://drum.lib.umd.edu/handle/1903/17153](http://drum.lib.umd.edu/handle/1903/17153)

------
white-flame
I think that [http://venraytechnology.com/](http://venraytechnology.com/) is
the only high CPU density design I've seen that seems useful for general
purpose, large-footprint computing, but their launch model (sell to a DRAM
manufacturer and cash out) didn't go anywhere.

So many of these tons-of-cores chips can't really do much per chip, and are
only suited for streaming algorithms like encryption, data packet routing,
video stream processing, etc. They also have nowhere near the memory bandwidth
to compare to GPUs, or to feed those many processing units with unique data
per unit.

Would somebody please think of the memory requirements? :-P (or Venray, please
seek investment and start pushing your designs yourself)

------
JoeAltmaier
Thinking out loud.

Instead of imagining applications per processor, I imagine this device could
map threads or message handlers to processors. It could work better with a
functional language or at least some language that didn't explicitly manage
parallelism in code but rather in the runtime. Offload the app writer to just
coding algorithm and not thread synchronization.

E.g. imagine each timer wait being a processor spinning; each I/o loop being a
processor that blocked/looped on an I/o pin state. With so many processors to
schedule, it wouldn't stall application progress to spin or block an
individual strand (until you ran out or processors). To make this efficient,
they'd want interrupts and semaphore state to be hardware-supported. Instead
of polling a memory location, block on a shared register masked where each bit
was a condition. So instead of a 'kernel call' it'd be an opcode or small
loop. Latency of wakeup then becomes about 1 machine cycle.

I imagine with the right runtime support this could be a useful device for a
large I/o server. It could reduce latency of processing each client message to
just the execution time. No time burned in kernel calls, process switching,
stack copying, interrupt/event latency.

~~~
endergen
I'm an ex game engine developer and I bristle anytime anyone thinks any
existing functional language is better for multicore. Specifically garbage
collection alone will make any language an order of magnitude slower generally
per a single core. Also the C/C++ game development community at least has
great approaches to multicore which makes C/C++ linearly scale with scores to
boot, see for example: [http://www.gdcvault.com/play/1022186/Parallelizing-
the-Naugh...](http://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-
Dog-Engine).

I love functional languages, more for thinking in them, prototyping ideas,
especially compilers/visualizers, and etc. But for any language that adds
garbage collection, immutable data structures (way more operations per write
and crazy memory thrashing/alignment issues), unless used sparsely or in a
mixed paradigm (ugh, except maybe scala/clojure) are going to pay a magnitude
of performance loss.

Mind you there are tricks around using more system languages (C/C++/Rust/D
etc) for a lot of the heavy lifting with the application core being functional
that gets you closer to the best of both worlds.

~~~
zbobet2012
Garbage collection is an idea. It can be slower, or faster, than other memory
management techniques depending on implementation and specific usage.

Functional programming languages do not _require_ a GC. They just largely have
it.

"Fibers" like you linked in the presentation (m:n green thread scheduling)
have been in use for decades. Many, many languages other than C++ have had
them for over a decade. Go is built on them.

Functional languages _can be_ better for multicore because of referential
transparency. As a game dev you are used to working on 4-8 cores. Some of us
work on 40-80 cores * 10k machines and have been for years. Much of your
complaints such as immutable data overhead make sense if it let's you work on
10x more cores at the same time. I will also point out that immutable data
_really_ is not 10x slower, unless you think those Haskell micro bench marks
are all lies.

~~~
valarauca1
>Functional programming languages do not _require_ a GC. They just largely
have it.

No they just require Infinite Memory [1] XOR GC.

Pure Functional programming has no concept of Alloc/Delloc. Let alone the
concept of binding/assignment can fail. These are real. To quote James Michens
[2]

>Pointers are real. They’re what the hardware understands. Somebody has to
deal with them. You can’t just place a LISP book on top of an x86 chip and
hope that the hardware learns about lambda calculus by osmosis. Denying the
existence of pointers is like living in ancient Greece and denying the
existence of Krackens and then being confused about why none of your ships
ever make it to Morocco, or Ur-Morocco, or whatever Morocco was called back
then. Pointers are like Krackens—real, living things that must be dealt with
so that polite society can exist.

[1] Infinite memory simply means more memory then the program can ever
consume... But the halting problem exists so you can't actually know how much
memory your program will consume :P

[2]
[http://scholar.harvard.edu/files/mickens/files/thenightwatch...](http://scholar.harvard.edu/files/mickens/files/thenightwatch.pdf)

~~~
nickpsecurity
PreScheme was a LISP to replace C for low-level programming. Used manual,
memory management instead of GC. Very fast and efficient.

[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3.4...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3.4031&rep=rep1&type=pdf)

Formally verified plus a Scheme48 interpreter as part of VLISP project.

[https://en.wikipedia.org/wiki/PreScheme](https://en.wikipedia.org/wiki/PreScheme)

ATS has no garbage collector that I'm aware of. I've seen it used in device
drivers and 8-bit MCU's.

[https://en.wikipedia.org/wiki/ATS_(programming_language)](https://en.wikipedia.org/wiki/ATS_\(programming_language\))

LinearML is GC-less, functional, parallel programming.

[https://github.com/pikatchu/LinearML](https://github.com/pikatchu/LinearML)

So, functional languages neither require GC's nor infinite memory. Two also
combined low-level efficiency with easy, formal verification vs C programs.
So, that's higher mapping of idea to code, high efficiency, and better safety
all at once.

"You can’t just place a LISP book on top of an x86 chip"

I believe I just did with PreScheme. Microsoft goes further to straight-up use
a theorem prover to do x86 coding.

[http://research.microsoft.com/en-
us/um/people/nick/coqasm.pd...](http://research.microsoft.com/en-
us/um/people/nick/coqasm.pdf)

~~~
valarauca1
Well yeah but these languages aren't purely functionally which was what the
original discussion was centered on. If you throw imperative elements into FP
they're very useful yes. But you've broken the functional paradigm.

Also if all these things are amazing why aren't anyway using them?

~~~
nickpsecurity
Lambda Calculus and Turing Machine are equivalent in power. You can, as many
academics have, model imperative programs as functional ones. You can compile
functional programs, as almost all compilers do, to imperative code in C or
assembler. We often make the distinction on syntax and structure a bit but
they're both passing state into functions that optionally produce state. Edit
to add that my forays in hardware show it's all functional (analog)
underneath: mathematical functions running continuously without memory
emulating abstract machines.

"Also if all these things are amazing why aren't anyway using them?"

Social and economic factors as usual. See Gabriel's essay Worse is Better:

[https://www.jwz.org/doc/worse-is-better.html](https://www.jwz.org/doc/worse-
is-better.html)

Just take C language. I have its history in detail and with citations. It was
literally an engineered language chopped up to run on bad hardware, chopped
again with arbitrary alterations on bad hardware, and slightly extended for
bad hardware again. Most people had bad hardware. Worked good on that. Spread
like a virus with gradual improvements. Still nowhere near what engineered
languages can pull off in various tradeoffs to consider. Yet, almost
everything is written in it now thanks to it working on half-assed hardware, a
MULTICS chop called UNIX doing so, UNIX distributed freely, and UNIX written
in C. Social & economic factors spread it like a virus plus improved it to
approximate solutions designed under cathedral model with better properties.

[http://pastebin.com/UAQaWuWG](http://pastebin.com/UAQaWuWG)

Meanwhile, alternatives sprang up that kicked both their butts in
capabilities. The LISP machines, functional's answer to whole systems, had a
flow and consistency you still can't match with modern stacks. B5000, a HW/OS
combo designed for safe languages, would've given hackers hell. Amiga's
combined SW and HW offloaders for excellent performance... like today's
servers & game consoles. BeOS screamed in concurrency, multimedia, and ease of
use while popular Windows and Mac boxes couldn't do but a fraction of it.
AS/400's and VMS boxes ran, ran, and then ran some more with at least one
person forgetting how to reboot them haha. Some in high-security survived NSA
pentesting while things that get easily smashed by amateurs prevail today for
_security-critical work._ I think it's clear a language or system's technical
superiority has almost no causal relationship with mainstream adoption by
laypeople or technical people.

Actually and sadly, it's usually better to bet on Worse is Better approaches
with slight improvements for success or adoption. Occasionally, you can bet on
The Right Thing with a win as Mozilla is doing with Rust. Heck, even Burroughs
B5000 lives on in Unisys mainframes. OpenVMS lives on in Windows NT family as
it cloned it for desktops minus strong focus on quality (sighs). ZeroMQ is a
good example in middleware. Nix applying database-style principles to package
management. Technically superior stuff occasionally mainstreams but not often.
Human nature usually wins. :(

------
nickpsecurity
It's important for people to know these are _eight bit_ CPU's. Least, that's
what every other incarnation has been.

[https://en.m.wikipedia.org/wiki/Kilocore](https://en.m.wikipedia.org/wiki/Kilocore)

Asking what to do with 1,000 32- or 64-bit cores != 1,000 8-bit cores.
Suddenly the value proposition doesnt seem so great. Such designs were used
successfully, though, in both neural networks and genetic algorithms. Could
probably handle stuff well that normally goes on DSP's.

------
mpweiher
This sort of thing, taken to even greater extremes, could be an interesting
match to Actor languages, with an actor mapped to a processor. The overhead
would be likely be _ridiculous_ , as most actors would be blocked most of the
time.

However, if it is true that we are now able to put a lot more transistors on a
chip than we can power, that might actually be a feature rather than a bug.
Assuming that power management is awesome and these actor-processors consume
little or no energy when blocked, the power/performance ratio for most
application should be awesome.

------
foota
Does anyone have a link or the name of another weird architecture that was
posted a while ago? (~3 months maybe?) I remember that there were a large
number of cores that all communicated with each other in some weird way and
that they like didn't have main system memory or something like that...

~~~
trsohmers
Pretty sure you mean us (REX Computing;
[http://rexcomputing.com](http://rexcomputing.com)), though I would not
consider our network on chip "weird" :P

Our big difference that you were trying to remember was our use of scratchpad
memory which is simply stated as this: We can radically reduce power
consumption, increase density, and increase speed of on chip memory (SRAM) by
removing the traditional hardware caching system. We instead use a purely
software managed memory system, through some very fancy (or do I dare say
"smart") compiler techniques that are enabled by a a very simplified
architecture and the ability to guarantee latency for all memory operations.
We can still have main system memory (meaning DRAM), it is just instead of
having a bunch of complex hardware that burns a lot of power and wastes a lot
of space in order to automatically fetch pages out of DRAM, we structure the
code for each core to efficiently pull only the data necessary when it is
needed.

~~~
gnufx
Perhaps you've noticed the architecture of the new top top500 system. There's
currently an HPC guessing game on filling in blanks in
<[http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-...](http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-
report-2016.pdf>) but of course CPU comparisons stop at Knights Landing.

One of the things I wondered about is how Linux is adapted to such an
architecture. I couldn't find anything from REX on operating system support.

~~~
trsohmers
Yep, have seen it... it seems to not be exactly what was described by the
Chinese at ISC or SC last year (their original description being much more in
line with a DSP), though there is not much data available. All of the articles
I am seeing today are calling it DEC Alpha "like", though the only real source
for this seems to be the wikipedia article for the family of processors, with
the latest version being in 2013. Dongarra even specifically says in the paper
you linked that it was _NOT_ related to the Alpha ISA, so it seems like all of
the media articles are incorrect ;)

As for REX, as dnautics said, we've been mostly focused on running raw compute
kernels on the current simulated versions (software and FPGA), and for the
soon to be in hand silicon (coming this fall)... one of our projects
internally is to port the L4 microkernel, and a telecom focused RTOS, but that
is as far as our operating systems plans go for the near future. I'd also love
to get a Plan 9/inferno demo running on it for fun, but we've got more
important work to do at the moment.

~~~
gnufx
Oh, so I guess the Sunway thing isn't as much like Neo conceptually as it
looked to me from what little information I had. (I wasn't thinking about the
ISA.) Of course the media are mainly useful for indications to chase primary
sources.

The point about the OS related to the Linux-based one for Sunway, but maybe it
only runs on the management processor anyhow, with just offload to the others.
I'd commented in that respect that we really don't want something like Linux
in an ideal world, so I'm pleased to see mention of L4.

Thanks for the comments, and good luck.

------
ableal
Dug around a bit for the paper, found this 2 page PDF:

[http://vcl.ece.ucdavis.edu/pubs/2016.06.vlsi.symp.kiloCore/2...](http://vcl.ece.ucdavis.edu/pubs/2016.06.vlsi.symp.kiloCore/2016.vlsi.symp.kiloCore.pdf)

------
samfisher83
The paper doesn't really explain how the its suppose to handle memory. You
have a 1000 processors and 12 modules with 64KB ram. Now if you have to access
real ram how will that be done? You would need a massive bus like a GPU so you
could keep all the chips with data to do work. How do you keep the processors
starving for data.

I understand its an academic project, but I wish they would explain what the
real life use of this thing would be?

Here is another project I found:
[https://en.wikipedia.org/wiki/Kilocore](https://en.wikipedia.org/wiki/Kilocore)

Not the same thing, but I wonder how much they sold? More like like 1000 small
processing units instead of full cpu.

------
greggman
I'm certainly hopeful this will lead somewhere useful. Intel tried with
Larrabee but apparently couldn't get the perf they wanted, especially with
comparisons to GPUs

It would be great to know what the trade-offs are with this architecture

~~~
alexnewman
you mean xeon phi. I think it's still around. It is plenty fast, but not much
better tdp than a gpu (which are commodity (read this as cheap)).

~~~
deadgrey19
Xeon Phi is around and used in various financial computations. The problem
with GPUs is not TDP, but programming flexibility. Programming a GPU is very
hard to do well for generic computations, whereas x86 like CPUs have millions
of programmers who can do a pretty decent job.

------
chongli
Would a language like Erlang make sense for a machine like this? Run 1000s of
communicating processes without any shared data.

~~~
loxs
I don't think Erlang's VM is that optimized. Currently (as far as I remember)
the upper limit is to run 1024 schedulers (threads) but I imagine it certainly
wont be 10x more efficient than running 100 schedulers on 100 core machine

~~~
qaq
Architecturally it's a perfect match for such CPUs. If you listen to Joe talks
he was talking about designing for 1000 core cpus more then 10 years ago. It
is his stated opinion that Erlang is perfect match for such situations.

~~~
loxs
Of course it's (much) better than anything else... though my experience is
that it's quite hard to scale your application that much... You always need
some shared resource and it gets ugly. And I also suspect Erlang has some
implementation quirks (for example the algorithm of deciding which scheduler
gets which process) that will prevent it from scaling that much. Yeah, I do
believe that some day we'll reach such scalability, probably just not today.

~~~
qaq
Each scheduler has it's own run queue there is higher level migration logic to
balance run queues based on statistics collected.

------
lossolo
link to press release: [https://www.ucdavis.edu/news/worlds-
first-1000-processor-chi...](https://www.ucdavis.edu/news/worlds-
first-1000-processor-chip)

------
beagle3
AIseek[0] had 10,000 processor chip in 2006, with simple connectivity and
simple ops that were perfectly matched for graph algorithms. Their demo[1]
would not be impressive today, but doing it in real time in 2006 was quite a
feat and basically impossible without specialized hardware. Unfortunately,
they're no longer in business.

[0] [http://www.extremetech.com/extreme/75766-new-ai-chip-
would-m...](http://www.extremetech.com/extreme/75766-new-ai-chip-would-make-
games-smarter) [1]
[https://www.youtube.com/watch?v=VeSQI2hinp0](https://www.youtube.com/watch?v=VeSQI2hinp0)

------
i336_
> _KiloCore processors ... store data and instructions inside i) local memory,
> ii) an arbitrary number of nearby processors, iii) on-chip independent
> memory modules, or iv) off-chip memory._

Ah, like GreenArrays' GA144, which contains 144 discrete cores - each with a
small amount of memory - that can communicate with each other. The rationale
with this chip is that you store data in the memory of one or more neighboring
core(s) if it won't fit in the ~64 bytes (IIRC) of the core you're working in.

------
mntmn
Where can I buy a sample?

------
rntz
Arguably the first 1000-processor chip was the CM-1, made in the 1980's:
[https://en.wikipedia.org/wiki/Connection_Machine](https://en.wikipedia.org/wiki/Connection_Machine)

They also bury the lede somewhat. Apparently the KiloCore is a 1.78
_Terahertz_ chip:

> The energy-efficient “KiloCore” chip has a maximum computation rate of 1.78
> trillion instructions per second

Edit: they later say it's 1.78 GHz. I guess "trillion" is just a mistake?

~~~
tlrobinson
Saying 1000 cores running at 1.78 GHz are a "1.78 THz chip" is a bit like
saying 1000 cars driving at 65 mph are a "65,000 mph car"

~~~
DigitalJack
It is. But while it sounds absurd, there are cases where it makes sense. For
loads that are parallelizable, both the 1.78 THz chip and 65KMPH car make
sense for similes.

I wouldn't say they _are_ a 1.78THz chip or 65KMPH car, but like them.

~~~
matt_d
It's a modern version of the megahertz myth:
[https://en.wikipedia.org/wiki/Megahertz_myth#Modern_adaptati...](https://en.wikipedia.org/wiki/Megahertz_myth#Modern_adaptations_of_the_myth)

I think this makes even less sense as we go to 1,000 processing elements --
even Amdahl's law is too optimistic for the workloads deployed in these
scenarios (i.e., having an embarrassingly parallel computation workload
doesn't help if we still need to access memory, including going through the
shared interconnect(s) to do this, while also keeping cache coherence in mind
--and most workloads have phases that have to be synchronized, assuming we
eventually want to write the results of our computation somewhere):
[http://blogs.msdn.com/b/ddperf/archive/2009/04/29/parallel-s...](http://blogs.msdn.com/b/ddperf/archive/2009/04/29/parallel-
scalability-isn-t-child-s-play-part-2-amdahl-s-law-vs-gunther-s-law.aspx)

