
Forth Inventor Chuck Moore's $20 144 core CPU now in full production - csmeder
http://www.greenarraychips.com/
======
ChuckMcM
Well if you ever wanted to be 'out there' Chuck's processor would be a great
place to start. Chuck Moore invented Forth and has been building machines that
can run Forth efficiently for years. Think of it as a Turing machine that can
do useful work. He has pushed the edge of computation per watt for years.

That being said, I've heard him talk about this chips for years and it is
great to see it finally see the light of day. If you were familiar with the
Transputer technology, this is a better take on that, if you are familar with
FPGAs you can think of it as an FPGA with processors instead of CLBs.

Things that it can do are similar to what Intel is doing with Larrabee or some
of the CUDA stuff that nVidia has done. It doesn't have a GDDR3 interface to a
GB of memory so you can't custom build your own GPU, but you could do your own
PhysX type engine with it.

It also makes a helluva differential cryptanalysis tool, or a signals analysis
tool in general.

~~~
joelthelion
It might make a good bitcoin mining platform?

~~~
iwwr
Not really, it's underpowered and mining lends itself better to vector-
oriented processors (like GPUs) rather than CPUs.

------
scottyallen
Not being much of a hardware geek, I'm having a hard time evaluating how fast
each core is.

From the site: "With instruction times as low as 1400 picoseconds and
consuming as little as 7 picojoules of energy, each of the 144 computers can
do its work with unprecedented speed for a microcontroller and yet at
unprecedentedly low energy cost, transitioning between running and suspended
states in gate delay times. When suspended, each of the computers uses less
than 100 nanowatts."

How does this instruction time compare to other modern processors?

It sounds to me like some of the benefit to be had here may be from the low
amount of power necessary per core. Power and associated cooling is a MAJOR
source of cost for datacenters, and so if this really is a significantly lower
power consumption (again, I don't know how much power the alternatives use),
then it could have a big impact on the cost of commodity computation power.

~~~
_delirium
If 1400 picoseconds is the time it takes for a clock cycle (and therefore the
minimum instruction time for 1-cycle instructions), that'd be about 700 MHz,
which actually seems pretty high compared to what I would've guessed.

~~~
neuraxon77
As far as I know, the cores aren't clocked. Instead Chuck designed his own
transistors using his own OKAD II VLSI tools to be efficient on the CMOS
process node with switching speeds being dictated by the transistor type and
interconnect electrical properties, then designed the cores so that they're
only switching those transistors when they do actual work.

~~~
Symmetry
Ooh, awesome. Clockless computing seems like a really nifty idea, but its
difficult. I imagine it might go mainstream when Moore's law finally runs out.

~~~
sliverstorm
"difficult" does not even begin to encompass the half of it. It is really cool
in theory, but in reality a ridiculously difficult challenge that existing
tools are in no way whatsoever up for.

~~~
thesz
Take a look at the Balsa design system:
<http://apt.cs.man.ac.uk/projects/tools/balsa/>

My colleague uses it for research purposes and he said that it is pretty
mature.

~~~
sliverstorm
I suspect if the only issue was translating architecture to logic, we'd be
doing it already

------
honkybozo
Glad to see this thread. G144A12 does better than you'd expect for 18 bit ALUs
doing 32 bit circular shifts and adds, but that costs enough extra
instructions that for this particular algorithm at any Bitcoin-useful
combination of throughput/energy/cost we can't compete with the genuine 32-bit
ALUs of the bigger ATI GPUs. Ya can't be perfect for _all_ problems _all_ the
time :) Nevertheless we'll be posting an app note eventually on SHA256 as an
illustration of techniques in pipelining. The $20 price is for small
quantities. Standard exponential decay curves apply for production quantities;
we want to see our chips in people's products and are priced to encourage
that. As for 20 somethings, nobody in our company gets a paycheck (yet) so
someone has to be willing to work for nothing, but if you have a _practical_
idea for an app note and want to work with us to get it done and published
please email greg at greenarraychips dot com and let's discuss it. Thanks for
your interest, folks - Greg Bailey, GreenArrays, Inc.

------
thesz
I certainly should voice my opinion here.

I've done analysis of GA144 before:
<http://news.ycombinator.com/item?id=1810641>

Most of the Chuck Moore design (I reviewed several, starting from M17) can be
described by quote from Devil Wears Prada: "the same girl- stylish, slender,
of course... worships the magazine. But so often, they turn out to be- I don't
know- disappointing and, um... stupid". Chuck Moore designs are all slick,
slender, stylish, worship Forth, but they turn out to be disappointing and
stupid in the end. The only beneficiary often is Chuck Moore itself. You just
cannot apply his experience to other places in the world of computing.

Let's see what we have here in GA144.

The memory inside all cores is way too small for general purpose programs,
even if you split your program into 144 parts and spread them across cores. 64
words of RAM, ie, 128 bytes. 128 bytes times 144 - 18Kbytes. Same for the
(program) ROM, and you should factor communication code in there.
Communication cost affects RAM as well.

They offer no compiler from high-level language like C. You had to learn a
specific dialect of Forth and some bizarre (albeit small) assembly language.

The only benefit for general population from this affair is the relative ease
of the desing of asynchronous hardware.

<http://en.wikipedia.org/wiki/Asynchronous_system>

~~~
bitcracker
Your are focusing on classic applications only.

The GA144 is so different from the classic way of computing that it requires
new approaches of development. One very interesting feature of GA144 is that
all 144 cores can share instructions over I/O lines. That means every core can
send instructions to their neighbors who execute them directly without
conversion.

I/O ist fast enough. I guess it should be possible to have (someday) an
external interface to SRAM which circumvents the low memory problem.

> They offer no compiler from high-level language like C.

That's right, and that's the real weak spot of GA144. Almost every
microcontroller board today comes with C or alike. Mr. Moore loves Forth but I
doubt that there are many developers out there who like to be forced to learn
Forth just for this single platform. I know Forth, it was one of my first
languages I ever learned. It is perfectly suitable for embedded systems but
you have to learn a lot to master it.

~~~
thesz
I protyped dynamic dataflow machine which (in theory) could be scaled to
hundreds of cores (corelets - something very small which does not have even
jump command). In my experiments readying information to be sent accounts for
hefty 30%+ of code.

<http://thesz.mskhug.ru/svn/hhdl/previous/HSDF/CoreletTest.hs>

The link above contains some simple "Hello, world!" program, in five "big
instructions" which contains 21 corelet instruction in total. 8 of those 21
instructions are send and front advancing instructions - their only purpose is
to establish communication between program parts. My machine sends pointers to
"big instructions", up to 32 bytes long (up to 32 instructions if you're
lucky) while GA144 could send only 4 instructions max.

30% of 4 instructions is 1 instruction. Another one instruction from those 4
is a loop or jump or something like that. So you have two instructions to
perform program logic. And this is best case.

So I again express dislike of GA144 as a computing machine. And I again
express my gratitude to Chuck Moore for proving that clockless design works.

~~~
bitcracker
> In my experiments readying information to be sent accounts for hefty 30%+ of
> code.

Unfortunately I don't have time to dive into your design but AFAIK the GA144
doesn't need 30% preparation code because every instruction can be executed
immediately by neighbor nodes.

That means (correct me if I am wrong) if core X has to evaluate a Forth
function of say five arguments then it could pass all five arguments to its
neighbors (without any preparation) by sending them the code addresses of the
arguments, wait until they have finished and then use their results to compute
the function result. These neighbor nodes themselves could evaluate (or
delegate) subexpressions to other (free) nodes and so on.

This form of parallelization would require an efficient shared memory access.
This problem needs to be solved because AFAIK I/O ports are accessible by the
edge cores only. It doesn't make much sense to transport each shared data
through several columns or rows of cores.

~~~
thesz
I think you're wrong about many arguments for command send to other core.

You can send a word, ie, "big command" composed from four MISC command. One of
MISC commands in "big command" can retrieve data from other core.

So most of the time you will wait to send a command or to receive some data.

------
Symmetry
In past Forth threads I've complained the the Forth model for a computer seems
just too at odds with how modern CPUs actually work for me to want to learn
it. Well, I look at this thing and its like the soft draft of the future
slipping underneath the door, whispering that maybe I should learn Forth after
all.

~~~
_sh
> maybe I should learn Forth after all

<http://factorcode.org/>

Make it easy on yourself.

~~~
wx77
That is probably making it harder on yourself as the recommended route for
learning factor is to read a forth book and read the factor docs. (Unless
something has changed and a book or big tutorial has been written)

With that said, Factor is really cool.

------
neopanz
Love the guy and his passion for its chips, but this 144-computer chip looks
like a solution in search of a problem: what is trying to solve? How's the
inter-core communication handled? On the other hand, creating weird chips like
these, just because you can, is awesome and stimulating a hacker's mind.

~~~
6ren
Some of the industries greatest advances have began as a solution in search of
a problem, such as the microprocessor (which its inventor, Intel, didn't think
much of compared with memory chips where the real money was). Most fail of
course.

If someone can make many-core, in this form, do something useful that can't be
done elsewhere (unlike DSP and GPUs, which are already many-core), it will
fundamentally upend computing.

~~~
queensnake
> greatest advances have began as a solution in search of a problem

.. and the laser(!)

------
csmeder
You can buy a 10 pack of evaluation chips from Green Array's web site or
single from schmartboard's website
[http://www.schmartboard.com/index.asp?page=products_csp&...](http://www.schmartboard.com/index.asp?page=products_csp&id=532)

------
jxcole
I think this sounds ridiculously cool and at $20 I could probably afford to
play around with it. But I have no knowledge of hardware, just programming. To
me chips just look like thin green rectangular prisms. How would I get it to
actually, you know, do stuff?

~~~
ars
You would probably want to buy the Evaluation Board which lets you control the
machine using USB ports and an ASCII console.

The thing is, it's just a computer. The real value of this is hooking it to
hardware to actually do something. So I suggest starting here instead:
<http://www.greenarraychips.com/home/documents/budget.html> and learning how
to work with hardware as well.

~~~
makmanalp
you can do this for about $30:
<http://www.greenarraychips.com/home/documents/budget.html>

"suggestion for a complete system"

------
neopanz
I think the only thing Chuck is missing is a few 20-somethings hackers, bent
on solving an 'impossible' problem with his chips.

~~~
csmeder
That's why I'm posting it here.

Some possible ideas

====================

\- Sound processing (think 144 cores preforming Fourier Transforms)

\- Pico Satellites (This chip has a surprisingly extreme low power
requirement).

\- Wireless communication (small power requirement means small batteries)

\- Computer Vision processing. Imagine toys or tools that can process visual
information faster than an Xbox.

\- Basically anything that can be done in parallel, that is suited for small
size and low power, but doesn't require much on chip memory.

~~~
ippisl
Had a rough look on it's instruction set and didn't see anything resembling
DSP instructions. So i'm not sure it would be great for all those uses.

On the other hand, picochip sells a 200-300 core DSP , that is being used for
the wireless industry.

~~~
csmeder
From this PDF
[http://www.greenarraychips.com/home/documents/greg/PB001-100...](http://www.greenarraychips.com/home/documents/greg/PB001-100503-GA144-1-10.pdf)
GreenArrays seems to think it would support DSP

"SUITABILITY: The GA144 is designed to support the largest and most demanding
computing challenges that can be addressed with a modest sized die in a
relatively inexpensive and easy to use package while still using well less
than 650 mW in most practical applications. The geometry allows for generous
numbers of parallel paths and/or pipeline stages, or for complex flowgraphs in
control, simulation, or DSP applications. Clusters of nodes devoted to
functions such as cryptographic algorithms are easily placed in the large
array, and the cluster needed to control external memory and run a high level
language from it is well out of the way but has good surface area for
interaction with other functions. Use it also as a universal protoytping
platform for applications destined to run on our smaller chips. "

~~~
sliverstorm
There is a huge difference between supporting DSP and being good at DSP. My
$0.50 MSP430, clocked at 32kHz, "supports" DSP.

~~~
dkersten
At the very least, I'd expect instructions for things like saturated addition

------
kiba
Let me get this straight: It's basically a microcontroller timed 144?

Can anyone explain to a nonhardware geek?

~~~
femto
It's the future, representing a convergence between Field Programmable Gate
Arrays (FPGAs) and the microprocessor.

Gate arrays are vast arrays of logic gates, which can be wired together in
almost arbitrary patterns by a sea of "fuses", typically controlled by state
stored in on-board SRAM. They are real time and blindingly fast due to their
massive parallelism, achieving supercomputer type speeds when applied to the
right type of problem and programmed well. They are more difficult to program
than a microprocessor. One way of looking at an FPGA is as an array of tens of
millions of very simple computing engines.

Over the years, the number of transistors on an FPGA has been rocketing up.
Generally these transistors have been put to use by providing more and more
simple logic blocks. We are now to the point where we have almost more gates
than we know what to do with, and the chip is being dominated by
interconnects. This has seen a move towards including a limited number of
elaborate hard wired blocks, such as CPUs and multipliers, in addition to the
array of logic.

The logical evolution is to stop providing more blocks, but make each block
more complex as transistor counts go up. Eventually we will see arrays of tens
of millions of microprocessors, rather than tens of millions of logic blocks.
There will be no distinction between a multicore CPU and FPGA.

It's worth noting that the first Xilinx FPGAs, thirty years ago, provided
arrays of around 144 logic blocks, similar to the processor count in this
chip. Extrapolate 30 years and we will have an array of 10 million
microprocessors.

~~~
VladRussian
>We are now to the point where we have almost more gates than we know what to
do with,

limits of von Neumann architecture (and its underlying mathematics of
recursive functions) as a mental framework for our thinking about computing.

~~~
calebmpeterson
You've piqued my curiosity; could you elaborate please?

~~~
tomjen3
I shall try, but this stuff gets really hairy, really really fast.

Von Neumann architecture is what almost all computers use today: you have
(very roughly) an ALU (arithmetic logic unit) hooked up to a memory bank which
stores both program data and the instructions the program consist of.

Now you can add a couple of cores to that, but you pretty soon start to run
into problems -- threads which try to access the same data, race conditions,
etc.

But the biggest problem is that under the Von Neumann architecture all memory
is shared so any thread can access any other threads memory. This puts rather
drastic limits on how much benefit you can get from new cores.

You also run into issues like the limited speed it is possible to access the
main memory banks with, etc, etc. This is possible to compensate to some
degree with caches, but they have their own problems.

But the fundamental problem with them is that they were from and of an era
where the clock speed kept increasing and increasing.

Today we have a situation where transistors gets smaller and smaller. But if
you are trying to use this to make a traditional CPU with these new
transistors, all this gets you is a really small chip.

What we need is an architecture inspired by something else. Personally I am
kinda hoping it will be some form of message sending -- you run a lot of small
(green) threads which each have their own memory as well as the ability to
send and receive packages of information to/from the other cores of on the
CPU.

You can have access to a (comparatively large but slower) shared memory bank
too (like RAM today).

I like it because it works well with how you would design a cluster of
computers (where you cannot afford the illusion of shared memory), how
computation is organized under the actor model (which I prefer to threads) and
it would be possible to implement with not that much new changes in the CPU.

~~~
calebmpeterson
That wasn't a "try" - that was a success. Thank you!

If I may attempt a paraphrase: CPU caches stop being a bandage for slow access
to RAM and become a valuable first class citizen for each core of the CPU when
coupled with the actor model.

Did I understand you correctly? Again, thank you.

~~~
tomjen3
Well you could do that today if you as a programmer could manually tell the
system "please load addr x, y and z into the cache".

But if the cores of the CPU starts to communicate with the actor model then
you wouldn't be using the memory close to the cores as a cache but as a
storage area for messages that haven't been sent/processed yet as well as
possible for thread local storage.

------
protomyth
The Propeller chip <http://www.parallax.com/propeller/> is another embedded
market chip in the same vein although nowhere near as many cores. Having
something like this make for some interesting embedded designs.

Also, check the site for the arrayForth stuff.

On another note, it would be interesting if something like these existed in
the 64-bit size. Larrabee is interesting, but if it was a simpler stack
machine at around this price point then perhaps some work could be done on
different ways to do parallelism.

~~~
ippisl
Tilera does a 64-bit CPU with 64/100cores.If they would offer access to it
using a pay-per-hour model, that would be interesting.

~~~
dkersten
I spoke to one of the Tilera guys a few years back about evaluating their
boards for use in a telco project I was working on at the time, but gave up
because I wasn't prepared to drop $15K on a development board just to evaluate
it.

The technology itself was extremely interesting, though, and would probably
have been a good fit for what I was doing. Also, the cost of the dev board
included technical support and from speaking to the guy, I got the impression
that they actually help you port your code to their platform.

------
neopanz
May be if you use 100+ of these (~14400 'computers') you can build a machine
with emerging behavior, brain-like?

~~~
ypcx
I knew it. The name "Moore" sounded suspicious to me immediately. Now I know
his real name is Chuck Testa!!!

On a more serious note, what we need much more of is not the processing speed.
What we need much more of is what I call "memory processability", which
roughly means "how many times per second can you process the whole memory" -
or something like that. Basically how much CPU is there per RAM. Indexing is a
great hack, but a hack it still is. Memory and processing cells must be merged
into a single, massively parallel chip.

~~~
thisrod
That was done in the 80's: <http://en.wikipedia.org/wiki/Connection_Machine> .
Richard Feynman had to help the designers with the hard bits.

Users recall them fondly. In the end, clusters became too cheap for custom
hardware like this to compete.

------
ww520
What are the software stacks that are available on the chip? Forth must be
there. Anything else?

------
sausagefeet
Is this something I would use in the place of something like an Arduino (for
example, putting a brain on an RC car)?

------
rbanffy
Wondering if it ciuld emulate an Apple Ii in real time and, within it, run
GraFORTH.

------
mung7
This is all new to me. Does anyone know if these chips could be used for
bitcoin mining?

However, I believe you would have to write completely new programs if it is
possible.

------
Don_Wallace
Imagine a Beowulf cluster of those.

