Hacker News new | past | comments | ask | show | jobs | submit login
Supercomputing on the cheap with Parallella (oreilly.com)
62 points by ingve on Dec 10, 2013 | hide | past | favorite | 26 comments

The Parallella Gen-1 has single precision throughput of 90 GFLOPS, power draw of 5W, and unit cost of $99. This yields 0.91 FLOPS/s per dollar and 18 FLOPS per Watt.

By comparison, the NVIDIA GTX 780 Ti GPU has single precision throughput of 5046 GFLOPS, maximum TDP of 250W, and unit cost of $699. This yields 7.22 FLOPS per dollar and 20 FLOPS per Watt.

This makes the GPU board 8 times cheaper and at least 1.1 times more power efficient than the Parallella board.

Note that the average cost of electrical power in the US is about 10.5 cents per kilowatt-hour (kWh), which means a single GPU board running 24 hours at peak utilization would cost about 63 cents per day - so it would take over 3 years of 24/7 peak operation for the nominal power cost to equal the upfront unit cost. So while power efficiency is very important, that doesn't mean the unit cost can be ignored.

In addition to FLOPS/$ and FLOPS/W, one must also consider the processing power density, which essentially measures how many "FLOPS per cubic foot" are yielded by each device. This is important because these machines take up physical space and physical space (i.e. real estate) is costly. A single GPU board has 56 times the throughput of a single Parallella board, and I would say that a GPU board is only slightly larger than the Parallella board (mainly because GPU boards include massive cooling components which Parallella lacks). A machine that requires 56 times more real estate to achieve the same performance is clearly not competitive.

So far we've only looked at floating-point throughput. One must also consider memory capacity and bandwidth of each device. The GTX 780 Ti has 3GB GDDR5 memory, and I understand that Parallella has 1GB memory. I don't know how the memory bandwidth compares between the two.

It's clear that the Parallella board has a long way to go before it could be competitive for supercomputing applications. And as Parallella tries to catch up, the industry will keep moving - GPUs (and other accelerators, like Xeon Phi) we continue to improve along all these dimensions and I imagine that NVIDIA/Intel/AMD have vastly larger R&D budgets than Parallella/Adapteva. So it is difficult to see how this could ever be a viable supercomputing platform and not just an interesting hobbyist board.

Everything you said is correct. However this was not a unit to go head to head with GPU's for industrial purposes. This is a way of battle testing their parallel processing technology which is significantly different from GPGPU. On their web site you can see that today apparently you can buy a solution that can scale to 4K cores at 5.6 TFLOPS and 70 GFLOPS/W

If thats true and if it turns out their programming model is better then GPGPU programming then they could disrupt the parallel processing space. However they need to actually ship something, and the more time they take doing that the better Intel gets with MIC and AMD get's with HSA and at that point they would lose any advantage their architecture might have.

One advantage Parellella has is that I can put it inside a small robot (backyardrobots.com) for say, image processing. I just can't do that with a NVIDIA card, particularly once you include the desktop that it connects to. There are probably other applications where a little card with that kind of processing is helpful.

But you make a good point, if I were mining bitcoins, I would take the NVIDIA hands down.

... or use ASICs and skip the general-purpose supercomputer all together.

For $99?

We were talking about card-based GPUs, but sure: http://www.amazon.com/ASICMiner-Block-Erupter-336MH-Sapphire...

There are far better things to do with a Parallela than mining, and far better things that work better for mining. I'd make a miner out of a Parallela system because I am curious about the tech and want practice at writing software for it not because I want a competitive miner.

I don't even see any advantage for Parallela in embedded devices compared to something like an AMD Temash.

Fair enough, that's an interesting device. How does it compare with a Parallela? Are there anything as available as say, the Raspberry Pi? Or compared to the Parallela after coming into general availability? Would this chip necessarily run things like computer vision or deep learning better in the embedded space? How about if you were a hobbyist or a maker? That is, what if you weren't looking at industrial embedded application and instead, you're looking more at DIY?

Well one advantage for a person like me is that I can buy a small development board with Parallella chip on it. It hasn't been delivered yet, but let's assume it is soon. I don't know where I can buy a small development board with an AMD Temash. If there is one, I am interested!

If you're going to look at creating a large supercomputer cluster, sure, based on those numbers, GPU is the way to go.

However, it would be more challenging to put the GPU board on a quadcopter drone than it would be the Parallela.

How exactly do you run the NVIDIA GTX 780 Ti GPU by itself? Are its capabilities the same? Aren't you forced to use OpenCL or something?

Since it's NVIDIA, I imagine you'd use CUDA (since CUDA is a bit more mature than OpenCL, and tailored specifically to NVIDIA cards).

How does this compare when you include the cost of running the computer driving the GPU?

Also your analysis only considers the "use in cluster" case, there may be cases where the Parallella provides the power and size budget needed for a single case. For example, using this in endpoint devices in control systems, sensor arrays, etc, may make a lot of sense.

I think that's a fair point. NVIDIA GPUs are co-processors and need a host system, which adds significantly to the cost, power draw and size. You can try to minimize this (e.g. by maximizing GPU density in a single server, for example you can apparently get 16 GPUs in a blade server format) but it's definitely always a significant factor. My guess is that NVIDIA sees this problem and will move away from the co-processor model and move towards a System-on-a-Chip design with GPU, ARM CPU, NIC and unified memory in a single package. If Parallella saw large usage in the areas you mention (endpoint devices in control systems, sensor arrays, etc) then the competition could be good for the market and force NVIDIA's hand. Of course Intel is working in a similar direction with Knight's Landing and AMD as well with their APUs (Accelerated Processing Units).

You mean like the NVIDIA Tegra?

I get more and more concerned every time I see another delay. The good thing is adapteva has been totally transparent every time there has been a delay, the bad news is there have been lots of them. As a kickstarter backer I am not too concerned eventually my new toy will show up and I can try some fun hacks and projects with it. However part of the deal when they proposed the kickstarter was that we were kick starting a company who was going to do awesome things in the parallel processing space and its taken two years now to just get a commercial 64 core solution out the door. While I am confident they will deliver the parallela board as promised, I am now much less confident that Adapteva will be the vanguard of next generation multi core processing.

"there have been lots of them"

Welcome to hardware development. Turnaround times for the software guys are stereotypically measured in minutes, for hardware its measured in days. Just kinda how it is.

I see no point in calling out, but pretty much any time someone who's never shipped hardware tries to ship their first hardware, it always takes two to ten times longer. If these guys are still together doing version 4 or something, those estimates will probably be close to reality. Its just a hardware design pattern to always be almost an order of magnitude more optimistic. They ALL do it, the RF guys, analog guys, digital guys. RF guys are by far the worst because of EMI/EMC and licensing reqs, if it makes you feel any better (LOL).

Which is the problem. If your going to have success in shipping hardware you need to be able to deliver which is why I am concerned, not because its taking a long time to get a product out the door, but because every setback they have here means a longer delay until you could get a commercially sustainable product available.

If they could have gotten the 16-core parallella out the door in 1Q 2013 then we could have hit the 2014 target of 1K-core parallella at 1.4 TFlops which is much more useful

I'm also a backer. I really haven't given it a second thought. I hope they hit it out of the park but you can never tell. Transmeta, for instance, looked like they were going to change the world at one point but we know how that ended.

It's just nice to help a group of people with a vision take a shot at moving the needle.

I think today's the last day to claim a free case or t-shirt for your Parallela.

I stumbled across a mail from andreas@adapteva in my spam box the night before last. Just an apology for the delays, looking like a Feb delivery now. You've to send a mail to sales@ with your preference for t-shirt or case.

May I ask when that email is dated?

I received my e-mail on Dec. 3rd at 3:58 PM (US Eastern time).

Thank you for your reply. I had sadly received no such email :(

A 'supercomputer" is defines withing 10% of world fastest computers. Since those are 20,000,000 GFlops these, a super is at leat 1,000,000 GFlops (petaflop).

Yes, when people use "supercomputer on a chip" marketing like this, they mean "as powerful as a supercomputer of X years ago" for some value of X, as well frequently "using high levels of parallelism like a supercomputer does."

I just checked the Top500 list from November 1999 (just an arbitrary year to pick) and the lowest on the list hits 38.5 GFlop/s, so if the Parallella gets 90 GFlop/s, it's ell within the range of supercomputers of 14 years ago. In fact, it looks like it would still make the cut in the November 2000 list, but is hitting the edge in the June 2001 and solidly off the November 2001 list.

All, of course, assuming that the 90 GFlop/s quoted is comparable to that measured on the Top500 machines; but even if you toss in a factor of 2 or 4, that still only pushes you a few more years.

Of course, if you stretch that X back a few more years, almost any modern processor is a supercomputer on a chip. I remember when the Cray Y-MP was a supercomputer to dream about, and nowadays its performance would be considered mid-tier for a smartphone.

Could this be used for scrypt-based altcoin mining?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact