Hacker News new | comments | show | ask | jobs | submit login
Building a supercomputer from 64 Raspberry Pis and Lego (southampton.ac.uk)
138 points by Sukotto on Sept 12, 2012 | hide | past | web | favorite | 49 comments

A technical description and how they got it running together with more pictures:


As you can see from the pictures each of the devices uses its own power supply, so there should be quite some possibility to improve overall power consumption.

Anybody with more experience of using MPI ?

It is almost always the case that programs written for clusters -- including those which use MPI -- are communication bound. Right now the limiting factor is not these little processors, but rather the interconnect speed. If I remember correctly, Raspberry Pi has 10/100 megabit ethernet. [Edit: just checked, this is the case] So while this looks like a lot of fun, it's not very useful for anything meaningful yet.

Of course it's not fair to compare this to an infiniband cluster (that's not the point of this exercise), but I'd really be interested to see a cluster built on $0.50 ARM chips with at least a gigabit ethernet interconnect. A couple of years from now -- given the low entry cost and lower infrastructure costs (cooling/power consumption/etc) -- that could be a game changer.

There are a few different companies that have ARM + custom interconnect systems out there or in development. They're not necessarily cost-competitive yet, but they're an interesting start.

Dell's "Project Copper" - http://content.dell.com/us/en/enterprise/d/campaigns/project...

Boston Viridis - http://www.boston.co.uk/solutions/viridis/default.aspx

Before the GFC killed them, SiCortex had a design where MIPS cores were grouped together on a single die with MPI-specific fabric logic.

This is the reason why Cray is still relevant. Their proprietary interconnects are what you're paying for and not the CPUs.

Oddly enough, Cray recently sold their interconnect tech to Intel [0]. Intel seems to be planning to integrate it on-chip down the road [1], which seems to leave Cray serving as a somewhat quirky system integrator longer-term.

[0] http://newsroom.intel.com/community/intel_newsroom/blog/2012...

[1] http://www.hpcwire.com/hpcwire/2012-09-10/intel_weaves_strat...

It's even worse because the Ethernet is connected via USB. What I would love to see are 16 or 32 ARM cores on a single card connected via high speed bus such as Infiniband and pack 4 or 8 of these cards into a chassis.

I found it amusing because at Blekko I was playing with one and talked about building a cluster with them. I think it would be tremendously valuable as a teaching tool to build smallish (24 - 96 machine) clusters and teach folks to write distributed algorithms. Its a stretch to call it a 'super computer' but it is quite educational.

One of my favorite systems questions is to have someone walk through the design and implementation of a system where all the machines in the system respond to a query Q based on the contents of a linked list L. The system has an API which consists of L <- M(op) (mutate list), R <- Q(id) respond to a query based on the contents of the list, and R <- S() report on the stability of the list. Start with M(op) being idempotent, then non-idempotent, Etc. Folks who've had a good introduction to state machines will immediately recognize and a number of problems that arise as you control correctness. If folks get through the whole sequence we're taking about a function f(C) which takes a correctness coefficent from 1.0 (fully correct) to 0.0 (unspecified) and look at the performance of the system across that range.

That kind of stuff you could easily do on a 48 node Pi Cluster.

I programmed MPI about 15 years ago, during a summer at the MIT AI Lab.

I implemented neural network feature creation for a backgammon agent ("Automated feature selection to maximize learning in artificial intelligence").

Nowadays, I mainly do parallel machine learning on machines with higher network latency. I haven't used MPI since.

What do you use now instead of MPI?

He talked about higher latency so I'm guessing just sockets


With the sort of work I do nowadays (large-scale ML and NLP), I generally need very little synchronization, i.e. my tasks are usually embarrassingly parallel. I typically save final results in a centralized store (DB or NFS) and look at it there.

I also use Hadoop where appropriate.

MPI is essential for molecular dynamics simulations. You split the "box" of atoms/molecules up into different domains -- one on each processor. Occasionally you'll have particles wander into the next box. The information of these ghost particles must be passed around and MPI facilitates this.

I ran some numerical simulations on my Raspberry Pi and on my laptop. With the LCD on, my laptop used 20.4 watt seconds to do the calculation and the RPi used 17.1 watt seconds. The RPi drew 3 watts and my laptop drew about 60 watts. I think, even with my screen on, that my laptop would be more efficient if I had used 2 cores instead of 1 in the calculation (or if I had just turned off the LCD).

My conclusion, the RPi doesn't even win in FLOPS/Watt let alone $/FLOPS.

That might matter for a cluster, but that's (obviously) not the target market for a Pi.

If you only have less than $100 to spend, a better $/FLOPS ratio of a MacBook (or anything else, really) doesn't matter. One is available, one is plainly not.

For people that CAN spend enough this usually is just a (third? nth?) gadget to play with. Like in the article (because, it's really just a neat way of playing with gadgets and lego, not 'useful' in any sense that can be quantified).

Does it have to? It sounds like they're using this as a learning platform rather than a serious $/FLOP/Watt cluster.

In that case you'd want to make it easy to get funding, and be optimising for initial cost. What would be the cost of 64 of your laptops?

It certainly doesn't have to! I just wanted to share the results of an experiment I did with numerical computing on the RPi.

Certainly laptops aren't the idea platform to compare against anyways. I'd compare to a 1u node that you would buy for compute cluster. One can by a 64 core 1u node with 64 GB of ram for about $7k. If one wants to play around with cluster computing you can emulate an entire cluster using one of those!

Of course that may (or may not) change if the RPi foundation will be able to get the programming documentation for the DSP/GPU part of the processor a recent blog post hinted at.

Well you are missing the point a bit: the RPi's mission is to be an accessible platform for teaching kids to code on. It's role as a toy for former-80s-8-bit-geeks is secondary.

what a pleasant surprise to see our little OSS project mentioned/used... surprised as this is a Windows/VS Python IDE & they're running Linux on the nodes - http://pytools.codeplex.com

I wonder how this compares to a Microwulf cluster [0] in terms of cost / gigaflop. I'm guessing it isn't very efficient.

[0]: http://www.calvin.edu/~adams/research/microwulf/

RasPi is ~ 175 MFLOPS per unit (CPU only, discounting the GPU). So, this cluster works out to 11 GFLOPS, with 16GB of RAM for > $2500 USD.

For comparison, you could buy a motherboard, 32 GB of RAM, and an Intel i5 processor for $500 that will do over 20 GFLOPS.

So, it doesn't really stack up well from a price/performance standpoint. The value of these systems is more teaching students how to work with parallel code.

> The value of these systems is more teaching students how to work with parallel code.

There are no practical applications to this, it's just to show that it can be done.

You can spawn 64 processes in a Core I7 and it would be about the same, just faster. Or 64 VMs.

Maybe the real value of these systems is to teach students to design and calculate before building it.

Yes, from a performance standpoint there's no reason to go after this type of system. No one is arguing this is a good system for production work.

The application of this is a teaching model. It's a lot easier to demonstrate parallelism gains on this type of platform. Scaling beyond a single ARM core is going to give you immediate performance benefits. Scaling further out to the entire cluster will continue to show returns.

With a single desktop, once you go beyond ~4 cores the gains will drop off too quickly. You just won't be able to see gains out to 64 threads on a single CPU, where on this you should.

It also doesn't hurt to have a quirky architecture to get students excited by. And yes, you could also spend some time discussing the architectural trade-offs and why this is not a cost-effective system for production use.

You are right. I'm still new in this site and don't know how to upvote comments so I wrote this.

I couldn't get this joke. What is this supposed to mean?

It was no joke. How I upvote something here?

Click the arrow that is to the left of the commenter's name. It seems you are one of today's lucky 10,000: http://xkcd.com/1053/

I made a bad assumption because your account is almost one year old. I'm sorry. Now I see that information is not given anywhere in this site. I think this vote method was taken from Reddit.


No, 256 MB of RAM on each rasp. It's 16GB of storage on each rasp.

64 rasps x 256mb = 16 gb total ram

I was trying to find a real comparison of MIPS of this thing compared to an I7 and landed on this page for performance data for the RPi. The most interesting thing to me is power consumption data at the bottom - that shows idle with network is around 370mA, which should mean (ignore power supply efficiencies) that 64 of these things should use about 120W at idle.


If those are switch-mode power supplies most of the that current will be drawn out of phase (ie. imaginary power); not important to the domestic customer, as the meter probably won't read it, but important to the supplier, who will come and make you fit power correction equipment to your supply :-)

My beaglebone consumes about 1.7W so that sounds about right.

One interesting thing is that the USB port being activated (even idle) can blow any power consumption by the CPU itself out of the water.

How do you get 120W? I get 24W.

0.37amps X 5volts X 64

Oh, I'm an idiot. Misread it as 370mW.

Would 64 Raspberry Pis actually be faster than a single current Intel CPU? Just asking... it probably was a fun project anyway...

No, not even close. This is more of a fun decoration you have in your home for visitors, as a conversation piece.

This is pretty cool.

On the other hand, stop sucking up all the supply for Raspberry Pis! My order keeps getting delayed and it's making me sad.

Nifty... but did you really write 64 SD cards, one after another? This is 1995-level clustering, you gotta PXE boot.

Can RasPi's netboot? I honestly don't know. I still haven't gotten one yet. Still waiting since placing the order.

The amazing thing about this project? Someone managed to get 64 RasPi's all in the same place!

They cannot netboot. The bootloader is crap.

Well to be explicit, the cannot netboot directly. I've built a u-boot that boots from the network via NFS on other arm based systems. You still need an MMC card to start it all off but once you get the network boot loader loaded and running you can do what ever you want. I started with some code for the Pandaboard that used that trick to boot from a USB attached drive.

I will have to look into this, http://kernelnomicon.org/?p=133 u-boot on the raspberry pi.

Ugh. It's the same with the Efika MX hardware, otherwise nice systems, made incredibly painful by the lack of 1970s technology.

more info (and build plans) here: http://www.southampton.ac.uk/~sjc/raspberrypi/

I submitted the same story from a different source four hours before . . .


But you should have submitted this source instead.

From http://ycombinator.com/newsguidelines.html

"Please submit the original source. If a blog post reports on something they found on another site, submit the latter. "

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact