
Parallella, a $99 Linux Supercomputer - microwise
http://www.zdnet.com/parallella-the-99-linux-supercomputer-7000014036/
======
phaet0n
I'm really disappointed about how shallow the discussions about Adapteva are,
and have been, on HN.

To remind everyone, the H = hacker. This device is a godsend, as far as I'm
concerned. For the first time ever I get fully documented access to compute
array on chip. No the architecture wasn't designed for anything specific, like
graphics, but that means I don't get bogged down in details I don't care
about, like some obscure memory hierarchy.

The chip is plain, simple, low-power, and begging for people to have an
imagination again. Stop asking what existing things you can do with it, ask
what future things having something like this on a SoC would enable.

Also, you should really be thinking about the chip at the instruction level,
writing toy DSL to asm compilers. Thinking along the lines of, oh yeah I'll
use OpenCL so I can be hardware agnostic, is never going to allow you to see
what can be possible with it. If you read the docs you'll see what a simple
and regular design it is, perfect for writing your own simple tooling.

It's been a long time, but I feel like a kid again. Like when I first
discovered assembly on my 8086. Finally a simple device I can tinker with,
play, and wring performance out of.

Hallelujah! :)

~~~
wmf
_ask what future things having something like this on a SoC would enable._

I asked that and came up blank. And I haven't seen answers from anyone else,
either. Has Adapteva themselves shown any examples where their chip beats a
GPU?

~~~
podperson
In the comment thread on the article someone points out that the Adapteva chip
doesn't do double precision floating-point, which limits its usefulness (to
put it mildly). If the goal is to provide people with a low-cost platform to
experiment with parallel programming, surely a decent NVidia card gives you
less expensive (given you can plug it into a PCI slot and it will work) access
to more CPUs that run faster and do more.

~~~
eliasmacpherson
It took a long time for GPU's to get double precision floating point and
plenty of GPGPU work was done with them prior to that, so it's not a deal
breaker

Not sure if world first or AMD's first, but it was around this timeframe,
2007: "AMD Delivers First Stream Processor with Double Precision Floating
Point Technology" <http://phys.org/news113757140.html>

------
swalsh
There's so much negativity in this thread, wasn't the whole idea that these
guys had plans and an architecture to scale up to the order of a terra-flop by
2014, and 20 by 2022? And look! they're shipping. This first chip may not be
impressive, but I'll welcome a new player to the market who has big plans to
innovate.

~~~
rayiner
I don't get the negativity either. If you look at the architecture manual,
this is like a cheap Tilera. It's an interesting programming model (lots of
cores in a shared memory SMP with weak memory ordering), and the CPU's are
pretty vanilla RISC architectures. For $99, it's a great way to play with
something that has the properties of the kinds of CPU's you might see in a
future supercomputer.

~~~
new299
I wrote my notes up here last time this was doing the rounds:
[http://41j.com/blog/2012/10/my-take-on-the-adapteva-
parallel...](http://41j.com/blog/2012/10/my-take-on-the-adapteva-parallella/)

I'm pretty skeptical, having played with the Tilera I'm not sure it gives you
enough of a benefit to warrant the extra effort. The Parallella also looks _a
lot_ like a Tilera, I do wonder if there might be IP issues there down the
line.

I also still think our best bet for this kind of thing is multicore ARM
systems.

------
Xcelerate
As someone who uses supercomputers, I'm not sure I entirely understand the
market of this product. It's really cool and I'd love to have one to tinker
with, but due to its high parallelization, I see no benefit of using this over
a graphics card. I'm not sure if $99 can get you a GPU that reaches 90 GFlops
though... perhaps that's where the benefit lies.

EDIT: After reviewing their website, I notice they state

> One important goal of Parallella is to teach parallel programming...

In this respect, I can see how this is useful. Adapting scientific software to
GPUs can be difficult and isn't the easiest thing to get into for your average
person. This board, with its open-source toolkit and community could make this
process a lot easier.

~~~
trotsky
I think this may just be a novel way to sell a dev board for their custom
silicon and get some of that heavy kickstarter press coverage.

If you figure that what they're really trying to do is get people familiar
with it and see how well it might augment one of their existing ARM products
it starts to make a lot of sense.

For instance I have a low end 4 bay arm based nas. It's insanely modest specs
(1.6ghz single core + 512MB ram) actually are quite sufficient for most nas
tasks. But it's really more like a home server platform as they have all sorts
of addons that include things like CCTV archiving, DVR, ip pbx - you get the
picture. But if you really start treating it like a general purpose server you
quickly realize that some common workloads perform horribly on that arm core
and it's frustrating.

It can easily push 800mbps or so with nfs smb or cifs, but if you want to
rsync+ssh you're looking at less than a 10th of that because of the various fp
needs of that chain. Native rsync with no ssh/no compression does somewhat
better but still poorly due to its heavy use of cryptographic hash functions
for delta transfers.

There are plenty of other examples - file system compression, repairing
multipart files with par2 (kind of like raid for file sets). Face detection,
file integrity hashing) And if it could do on the fly video transcoding (don't
even think about it) it could happily replace another full system i have
running plex server.

There's probably a lot of devices that the designers default to arm but have
to skip features that are heavy fp. If somebody in the firm has played around
with a chip you can just drop in and not change your soc or toolchain that
starts sounding pretty good i'd guess - and likely still far cheaper than an
atom soc.

~~~
sliverstorm
Interesting observations. I know AMD is making an ARM chip
([http://www.anandtech.com/show/6418/amd-will-build-64bit-
arm-...](http://www.anandtech.com/show/6418/amd-will-build-64bit-arm-based-
opteron-cpus-for-servers-production-in-2014)). Have they said anything about
FP?

------
rys
I find it incredibly dishonest of Adapteva to equate it to a "theoretical 45
GHz CPU". There are much better ways to talk about the performance level of
their hardware than that metric, especially given the rest of the text in
their Kickstarter pitch is aimed at people who need to inherently understand
the hardware's execution model in order to program it effectively.

The computing industry has established language and metrics to discuss
computing performance and, while the waters often get muddied when the
hardware is wide, that's a step too far.

~~~
mtrimpe

      This board should deliver about 90 GFLOPS of performance, or — in terms 
      PC users understand — about the same horse-power as a 45GHz CPU.
    

That doesn't seem too outrageous to me.

Edit: They state the real fact and then give another figure explicitly stating
it's an attempt to translate this into a metric the average user can somewhat
relate to.

According to <http://en.wikipedia.org/wiki/FLOPS#Computing> it seems that
they're off by a factor of two, but I'm guessing that's just an honest
mistake.

Second edit: I was under the impression that this was the result of dumbing
down by a journalist, however it seems it's from Parallela itself. That is a
bit disingenuous indeed.

~~~
Tuna-Fish
A single Ivy Bridge core has 8 Flops/MHz of computing power. 45GHz Ivy Bridge
would be able to do 360GFlops.

~~~
helpbygrace
If you are correct for the first clause(8 Flops/MHz), 45GHz of Ivy Bridge core
has 360k Flops (8 Flops/MHz * 45GHz ==> 8 Flops * 45k).

~~~
stephencanon
8 double-precision flops/cycle/core is the correct figure for Ivy Bridge and
Sandy Bridge. With Haswell adding FMA, that figure doubles again(!)

~~~
mrb
Hum, no. Sandy/Ivy Bridge can only execute 4 double-precision instructions per
cycle per core, in the form of two SSE instructions per cycle (one instruction
doing adds, the other doing muls, executed by different units).

Doing 8 double-precision instructions per cycle would translate to either four
128-bit SSE instructions, or two 256-bit AVX instructions per cycle, which is
not possible (unless I did not keep track of the latest AVX capabilities).

------
mrb
_"this board should deliver about 90 GFLOPS of performance, or --in terms PC
users understand-- about the same horse-power as a 45GHz CPU."_

This is wrong.

A 4-core 3.0 GHz x86-64 processor delivers _more_ GFLOPS than the Parallela:
96 GFLOPS with SSE instructions, because each core can execute 8 single
precision instructions, 4 adds and 4 muls, each cycle. And yes, when Parallela
claims 90 GFLOPS, they mean single-precision.

For example, for the same price as Parallela, you can get a $100 Phenom II X4
965 (4-core, 3.4 GHz, 125W) delivering 109 GFLOPS. Count $200 to include
minimal mobo/RAM/PSU (if all you care about is raw GFLOPS).

The main advantage that Parallela has with their exotic architecture over
x86-64 is a better GFLOPS/Watt metric. But if you care about this metric you
should consider GPUs, which beat Parallela: [http://parallelis.com/parallela-
supercomputing-for-all-of-us...](http://parallelis.com/parallela-
supercomputing-for-all-of-us/)

Parallela may not beat anything on GFLOPS/Watt and GFLOPS/$, but if they can
maintain ease of development (x86-64's stronghold) while doing not too bad on
these 2 metrics (dominated by GPUs), they may be a good compromise and may
have a shot at succeeding in the HPC market.

~~~
m_mueller
Exactly right. ARMs lure isn't really the current performance for
supercomputing, it's rather the expectation that they will hit the next big
performance wall much later than x86 because of its simple architecture that's
suitable for the maximum amount of cores per die space. Give it 2-3 years and
we might have the big step in supercomputing architecture at hand.

------
6ren
Note: the $99 version has 16 cores, not 64 cores.
[http://www.kickstarter.com/projects/adapteva/parallella-a-
su...](http://www.kickstarter.com/projects/adapteva/parallella-a-
supercomputer-for-everyone#faq_40886) (+ 2 ARM cores)

------
trotsky
Can anyone explain what the practical differences between something like this
and a gpgpu approach? It doesn't sound particularly performant compared to
modern gpus otherwise. Maybe they add in some more general purpose
instructions for a little more flexibility?

~~~
vidarh
A typical GPU can execute a small number of threads on a large number of
streams of data carefully laid out in memory. Every time you want to do
something conditionally on just one data stream, you waste a lot of capacity.

In contrast, the Epiphany chips can execute individual threads on each core in
parallel on data either local to the core (fastest), on any other core, or in
separate main memory.

The current Epiphany chips aren't too spectacular, since the core count is
"low". They can "only" execute 16 individual instruction streams in parallel.
But that's on a chip the size of your finger nail, and their roadmap is aiming
for 1024 core chips.

They're effectively aiming for people to find ways of making effective use of
simple, small, power efficient cores for problems that are not "data parallel"
enough to be efficiently done on GPU's.

~~~
UnoriginalGuy
This might be a really stupid question:

How difficult would it be in the practical sense to keep all of the cores on
something like this "fed" with enough information to get benefits from its
concurrency?

I mean to "feed" all 64 cores enough data/code so they can all "do something"
concurrently is one hell of a job all on its own!

~~~
vidarh
Depends a lot on the type of problem, and I think that's going to be what
makes or breaks them. They have some good examples, but you're right, it's a
hard problem and one of the reasons it's so important for them to get these
dev boards out.

------
andyjohnson0
Previous discussions:

[1] <https://news.ycombinator.com/item?id=4635618>

[2] <https://news.ycombinator.com/item?id=4705487>

------
melling
I backed this project simply because it's a great idea to build a very small
highly parallel computer that runs on very little power. Maybe this one won't
hit it out of the park but it might give other people ideas. Building the
first one of anything is always hard. Add a little serendipity and we might
get an entirely new use for computers.

Just saying that I could do more with a $99 graphics card sort of misses the
point.

~~~
wmf
It's such a great idea that AMD Kabini already did it.

~~~
melling
Ok, so we should stop there and not encourage further development in the
market? I want a 1000 core "Raspberry Pi" that sells for $50.

Let me know if there's anything else that I can do to help.

------
yatsyk
How many megahashes this hardware computes?

~~~
DanBC
They say 90 GFlops.

(<https://bitcointalk.org/index.php?topic=26824.20;wap2>)

> For example a Radeon 6990 has 5.2 gigaFLOPS of computing power[1] and yields
> roughly 800 megahash/s in bitcoin mining.

That was in July 2011. Mining is harder now.

~~~
mas921
Radeon 6990 is 5.1 TERAflops (5099 GigaFLOPS) ... several orders of magnitude
faster than this thing

[http://en.wikipedia.org/wiki/Radeon_HD_6000_Series#Northern_...](http://en.wikipedia.org/wiki/Radeon_HD_6000_Series#Northern_Islands_.28HD_6xxx.29_series)

------
w34
While I find this quite exciting from a pure developer perspective, it also
reminded me that I haven't had anything I'd call a Desktop box in quite some
time.

If I were to ever get a Desktop machine again, it would have to be cheap and
light, definitely don't want anything clunky, otherwise a laptop seems
preferable to me. There do not seem that many products that would fill that
gap, Intel's NUC is too expensive, the Raspberry PI too slow. Apple's mini Mac
seems like the best proposition in this segment.

I wonder if the Parallela could not only be used as development center, but
also as a Desktop computer? It won't run any fancy games, that's clear, but it
may actually be usable for browsing, watching videos and office duties.

------
micheljansen
Wow, I remember seeing the original Kickstarter for this and thinking "this
will never see the light of day", yet here it is. I still find it a bit of an
odd product; neither for hobby or business, but it sure is cheap.

~~~
vidarh
It's a developer board. The product is the chips, not this board. This board
is there mainly to get a dev board in the hands of people who might want to
build cool stuff with it.

That they've actually managed to get it price competitive with a lot of cheap
ARM computers, despite sporting a Zynq (ARM SoC with built in FPGA) is
amazing.

~~~
sliverstorm
Can't help but wonder if they are in fact taking a loss, backed by Adapteva.

~~~
qdog
They seem to actually have support from some of the hardware manufacturers.
From Update #31 "Much gratitude goes out to the component manufacturers who
really “got it” (Xilinx, Analog Devices, Intersil, Micron, Microchip, Samtec
all deserve special thanks). Without their help we would be losing $100 per
board!"

So, the backers are getting a Very Good Deal, with the hopes that a successful
launch will make demand high enough to make the $99 viable with volume.

------
api
I can think of some amazing uses for this. I'm tempted to get one just to port
this old hack of mine to it:

<http://adam.ierymenko.name/nanopond.shtml>

~~~
fmdud
That's very cool!

------
mas921
a supercomputer is cluster of machines connected by high throughput, low
latency interconnect.

Hundreds of servers connected together with 1Gigabit is still a "grid cluster"
.. you need at least 10Gigabit Ethernet (over iWARP) or infiniband (RDMA) to
be considered a supercomputer.

This is marketing B.S.! this B.S. is "emphasized" by the 90GFLOPS = 45Ghz
thing. 90GFLOPS is by a single 45GHz "ALU" (perhaps an ALU doing Multiply-Add
- MADD op.) not a full fledged CPU (like the i7 or Xeon, which has 4-8 cores
with each core having 3 ALU's) as the readers might imply.

For example the i7 3770K does 121.6GFLOPS @ "only" 3.5Ghz (ref> table page 2
[http://elrond.informatik.tu-
freiberg.de/papers/WorldComp2012...](http://elrond.informatik.tu-
freiberg.de/papers/WorldComp2012/PDP2833.pdf))

measuring performance with Ghz is soooo Penitum III! the whole thing is very
misleading, and I don't like that!

Supercomputer? not even funny! Its a Super-"Raspberry Pi". That's it!

------
backprojection
How does this compare the the new Intel MIC (Xeon Phi) co-processor boards? I
think they claim 1TFLOP. Can we think of this as a low-powered alternative?

<http://en.wikipedia.org/wiki/Intel_MIC>

~~~
qb45
The general idea is similar - lots of cores with distributed SRAM memory and
some shared DRAM, all sitting on 2D mesh network. The main difference is that
Epiphany is made of custom simple RISC cores, while Xeon Phi uses 1st gen
Pentiums with huge SIMD FPUs slapped on for higher FP throughput (and TDP).

~~~
backprojection
Interesting.

It looks like (from info on wikipedia pages) the Xeon Phi 3100, gets about 3.3
GFLOPS/WATT, whereas the Epiphany E64G401 manages about 50 GFLOPS/WATT.

So something like 10 of these might compare to 1 xeon phi, and still be
cheaper in terms of hardware, and much cheaper in terms of power consumption.

------
iso-8859-1
Epiphany Architecture Reference: [http://www.adapteva.com/wp-
content/uploads/2012/10/epiphany_...](http://www.adapteva.com/wp-
content/uploads/2012/10/epiphany_arch_reference_3.12.10.03.pdf)

------
zmmmmm
I wish these things had just a bit more memory. Most of the interesting
algorithms I work with (bioinformatics) really want 4G of memory. A lot of
them you can squeeze down to 2G but 1G is just out of the question.

------
dharma1
i think this is cool, but wouldn't learning OpenCL be more future proof for
someone wanting to get into parallel processing? Seems like there is more
drive behind GPU development than specialist hardware like this

------
jasonkolb
This is really cool. Since it's linux I assume it can run the JVM, correct?
That's incredibly powerful, as even GPU programming requires bridge libraries.
And what, $99? That's incredible. I'm going to get one...

~~~
wmf
IIRC Linux does not run _on_ Adapteva. Linux runs on the ARM which is _next
to_ the Adapteva chip.

------
peripetylabs
This is perfect for numerical computing applications like software-defined
radio or image processing, which can now be done on embedded platforms. I'll
definitely be ordering a board when they're available.

------
mmanfrin
Can someone explain to me how a $99 computer can have 45ghz of processing
power, but an i7 costs 3x that for 1/10th that clock speed? What does this $99
miss out on that my i7 has the capability of doing?

~~~
qb45
First of all, this 45GHz figure definitely isn't valid for modern x86 chips -
thanks to multiple cores and SIMD instructions they reach few dozen GFLOPS at
stock frequencies.

Furthermore, x86 chips pack all of their performance in low number of cores,
what makes them much more useful for common scalar code. And if 20 times
higher scalar performance isn't enough to convince you to pay premium, the
complexity required to achieve this level of scalar performance definitely is
enough to discourage Intel from selling you i7s for $99.

------
umsm
I'm not familiar with this so I have a question: Can you interconnect a few of
these boards to create a more powerful unit? I notice they have "expansion
connectors"...

------
joeblau
I only want to know one thing. How fast can it mine Bitcoins?! I feel like
that's the new "...but can it run Crysis?"

------
dsdjung
It is not easy finding people with good parallel programming skills.
Hopefully, this will help things along.

~~~
wmf
Of course if you learn on Adapteva then your knowledge may not translate to
the "worse" architectures that are used in the real world. If you want to
learn parallel programming, the computer you already have supports threads,
CSP, actors, OpenMP, OpenCL, etc.

------
protomyth
It might be interesting to have a go at writing a version of Connection
Machine Lisp for it.

------
nbdbvcrea
Cool. I don't know any other cheap way to experiment with optimization for 64
cores.

------
pratik661
Hmm I wonder if there is a way to bypass your graphics card and use this as a
GPU?

------
D9u
I want one, or two, maybe more. I'm totally fascinated with parallel
computing.

------
madsravn
So where do I buy one?

~~~
iso8859-1
Sign up on the site and they'll mail you when you can order. If you only need
FPGA, you can get the Mojo (see link elsewhere in thread) from May.

------
gatehead
Can it run XBMC?

~~~
iso8859-1
Yes.

------
dharma1
usb 3.0 would have been nice

------
NewAccnt
I wonder how those in the performance computing sector feel about running a
proprietary supervisor with built in DRM on each and every CPU? Raspberry
users might not care when for just hobbyist applications, but I doubt any
serious scientist is going to overlook that.

[http://www.arm.com/products/processors/technologies/trustzon...](http://www.arm.com/products/processors/technologies/trustzone.php)

~~~
trotsky
Intel platforms have a very similar risk via SMM and the platform code &
controller. It's less advanced, but it can easily exert full control over the
system without the os allowing it, minus access to some registers and on die
cache. It could DMA in or out of the gpu memory as well.

Whether your soc vendor forces a secure supervisor to load is up to them, and
i'd be surprised if an HPC builder had trouble finding vendors to supply parts
with a totally controllable boot chain.

I'm sure there are ways to obscure it, but there are just as many ways on x86
platforms, the only real difference being that you could pull the eprom and
reflash it and inspect the other board components. There's also plenty of evil
things you can put in a soc without relying on trustzone.

Bottom line is you have to trust your vendor. If you want a soc integrated and
fab monitored by a business/state that is politically aligned with yours it is
probably just a matter of paying a premium.

------
pmorici
Can it mine BitCoin competitively?

~~~
tempaccount9473
> Can it mine BitCoin competitively?

Prior to the popularity of mining using GPUs, it would have been the shizzle.

Today's ASIC-based systems will hash circles around it.

~~~
qdog
It's only pulling 2W, so it really depends on the performance per W. Maybe
that'll be my first project...

------
kingmanaz
Could you imagine a Beowulf cluster of these?

~~~
eleitl
Sure thing [http://www.adapteva.com/white-papers/building-the-worlds-
fir...](http://www.adapteva.com/white-papers/building-the-worlds-first-
parallella-beowulf-cluster/)

------
lucb1e
So how fast is this really? It doesn't sound like much of a supercomputer to
me. If it were so super for $99, it'd have been hyped everywhere already and
gamers would not buy desktops anymore. It rather sounds like a platform to
practice multithreading on.

~~~
Leszek
Supercomputer =/= "a fast computer".

~~~
lucb1e
Then what is it?

Wikipedia also seems to say it's "a fast computer": _"a computer at the
frontline of current processing capacity, particularly speed of calculation"_

<https://en.wikipedia.org/wiki/Supercomputer>

~~~
VLM
A supercomputer is a machine that's I/O bound instead of CPU bound, at least
as a first approximation.

You'll get lots of specs thrown at you like in mid 2013 a supercomputer means
using X, Y, and Z technologies. But that is just a longer format version of
the above.

A pessimist usually warps the definition to a machine that's primarily
programmer limited rather than CPU or I/O limited, LOL.

Over the decades as parallelism has been popular its drifted more toward being
financially limited more so than anything else, in the long run this is
probably going to be the new definition, a overall system who's performance is
solely limited economically. You might think thats all computers, not so,
there's plenty which are inherently limited by architecture to low
performance, or limited by programming to single core / single thread tasks.

The biggest bummer of supercomputers in the parallel era is no one is doing
anything about latency. That's nice that your 2000 processor design with 20
deep pipelines eventually after enormous latency can really churn stuff out,
but the olden days pursuit of low latency resulting in speed was pretty
interesting technologically. Hilariously you'll even get noobs who don't even
understand the difference between latency and speed or claim there isn't one.

------
st8ic
These guys are completely dishonest. I saw their kickstarter video where they
said that for $99 you could have "a computer many times faster than anything
on the market ZOMG".

Yeah, maybe it's faster for all those times during the day when you calculate
matrix chain products. But for largely single-threaded tasks, like EVERYTHING
you do on a day to day basis, it's going to be significantly slower than your
average dual-core i3.

~~~
vidarh
I backed them on kickstarter, and I don't remember seeing any claim like what
you claim to have seen.

To me it was always clear that the current models are not particularly fast.
They may be fast "per watt", and if they succeed in their roadmap, then their
future 1024 core chips may be fast for the subset of problems that they are
suitable for.

In the meantime, the kickstarter page is/was careful to focus on this as a
stepping stone, and developer platform for playing with the technology first
and foremost, and not as being about delivering some incredibly fast computer
for end users.

If anything, they've provided an extreme amount of data, down to cycle counts
for memory accesses and the instruction set, and they've dumped a lot of code
in our laps, including drivers etc., and the final unit actually comes with a
faster version of the Zynq SoC than what they promised, after Xilinx
apparently gave them an amazing deal.

