Hacker News new | past | comments | ask | show | jobs | submit login
Nobody ever got fired for buying a cluster (microsoft.com)
72 points by jpmc on April 24, 2013 | hide | past | favorite | 55 comments



> a single “big memory” (192 GB) server we are using has the performance capability of approximately 14 standard (12 GB) servers.

That's 168 GB RAM total, within a single upgrade unit of their 192 GB single server, suggesting the problem is dominated by RAM.

If I had $6638 + 2640 = $9278 to spend on computing hardware from NewEgg, how about:

    1 at $1000 of HP ProLiant DL360e Gen8 Rack Server System Intel Xeon E5-2403 1.8GHz 4C/4T 4GB http://www.newegg.com/Product/Product.aspx?Item=N82E16859107943 http://h10010.www1.hp.com/wwpc/us/en/sm/WF06a/15351-15351-3328412-241644-241475-5249570.html?dnr=1 (12 DIMM slots)
    4 at $70 of Kingston 8GB 240-Pin DDR3 SDRAM ECC Registered DDR3 1333 Server Memory Model KVR13LR9S4/8 http://www.newegg.com/Product/Product.aspx?Item=N82E16820239540
    1 at $54 of Seagate Barracuda ST250DM000 250GB 7200 RPM 16MB Cache SATA 6.0Gb/s 3.5" Internal Hard Drive http://www.newegg.com/Product/Product.aspx?Item=N82E16822148765
    $1334 ea server, 7 servers = $9338
So we could get 7 of these low-end name-brand 16 GB servers for the same money to give us 224 GB RAM.

> MR++ runs on 27 servers whereas the standalone configurations are a single server running a single-threaded implementation

Sure, nothing will beat a single system at message-passing algorithms when the entire graph fits in main memory. But when the dataset outgrows that (and it will), we can triple the RAM in the empty slots, or add more servers in units of $1334 instead of having to rewrite your whole analysis.


A bit unfair, esp. when Microsoft budgeted $2000+ on SSDs alone.

I think the paper has a good point: scale-up should be definitely considered. I see systems that support 1.5 TB of RAM now... do you really expect your dataset to grow beyond that?

And as long as the dataset fits within 1.5 TB of RAM, scale up will be more scalable than scale out.


Yeah, ignore my numbers and I agree that scale-up should be considered and it's worthwhile to read a paper where somebody reality-checks the conventional wisdom.

> I see systems that support 1.5 TB of RAM now... do you really expect your dataset to grow beyond that?

According to Figure 1, about 20% of their jobs are already beyond that. If you spend your entire hardware budget on a single server and all your analytic processing jobs end up being coded with the (gasp) single threaded assumption, what are you going to do next?

Most of us have been there before because it's where you end up by accident ("Do you think if we max out the RAM on the database server it will return to acceptable performance?") This is why people love parallel map-reduce based systems.


Certainly, I don't mean to contradict the conventional wisdom :-p. Scaling out with ~$1500 2U servers makes sense a lot of the times.

I'm more concerned about another problem. There are people out there who think that they can scale out with Hadoop cluster built on top of Atoms (or more recently... ARMs). And while yes... there are some applications where that is a decent scaling strategy, I doubt it works in most cases.

But yeah, I know that wasn't your argument. I'm just musing a bit on random stuff.


I admit to being ARM/Atom M/R curious. Sometimes I wonder about Hadoop for a few seconds until I remember it's heavily JVM-based. Poking around on the web you find folks who sound serious about porting it to Android.


Curious is good. Things definitely move quickly in the tech industry, and I'm willing to bet that within 5 years things can change. And I'd support building up the skills / infrastructure needed to hedge against the uprising of ARM.

But at the moment, the evidence is solidly in favor of the typical 2U Xeon. (by nature of ARM servers barely exist right now. Maybe next year ARM will have something ready)

Atom looks mostly dead however, with Intel beginning to advertize 16W Xeons. Can two 8W Centrino Atoms compete against a 16W Xeon? I doubt it. Intel only has the resources to choose one "hero" chipset in server-space, and it looks like it will be the Xeon again.


Someday somebody is going to realize they are in possession of a trash barge full of 5 million discarded smartphones and break into the top supercomputer list with them.


That idea is a non-starter. Consider the cost of transportation, the cost of connecting all those ridiculously slow CPUs and incompatible CPUs together, powering them, programming for them. Building a supercomputer using standard x86_64 CPUs will probably be cheaper.


Here's an example: http://www.androidauthority.com/samsung-galaxy-s3-sales-20-m... Within a few years there will be 20 million Samsung Galaxy S III's with dead batteries and/or cracked screens. That's a dual-core 1 GHz ARM with 1 GB of RAM, 8 GB of SSD, and a hardware GPU.

Right now, there are professional 1st-world developers developing apps for all those "incompatible CPUs". So it's possible and even cost effective under the right circumstances.

Yet someone is currently being paid to haul away this stuff as electronic recycling.


A great example of why Xeons are superior. :-)

Snapdragon S4 completes the Linpack benchmark with 460 MegaFlops: http://www.androidauthority.com/snapdragon-s4-pro-vs-exynos-.... Snapdragon is rumored to use ~3 to 5 Watts of power.

Consider a Xeon E5-2690: which gets you 347,000 MegaFlops of performance in 135 Watts. http://www.intel.com/content/www/us/en/benchmarks/server/xeo...

Xeon just blows the ARM chip out of the water in Performance per Watt used. A modern day Xeon gives you somewhere on the order of 20x more performance per watt compared to a Snapdragon S4 Pro. Otherwise, all the electricity that it takes to run your supercomputer built out of crap chips will make your supercomputer more costly to run in the long term.


Does Linpack even touch the memory bus?


I don't know, but the Snapdragon has 1MB of Cache, while the Xeon has 20MB of Cache. Furthermore, the Xeon actually has a multi-socket connector... so two Xeons can talk to each other at a rate of 25.6 GB/s through QPI (QuickPath Interconnect).

So my bet on memory is that Xeon wins that too.

On the multi-processor communications issue... I doubt that the Snapdragon would have PCIe connections... let alone a high-bandwidth low-latency connection like the Xeon's QPI.

There's a reason why supercomputers stick with Intel or AMD. They've got the processor power, the performance per watts, and I/O girth to communicate to other processors. A small embedded chip (like Snapdragon), just doesn't have the features to make a supercomputer out of.


OK, I'm sold.


ARM does have promise though, its just not in Snapdragons. As noted, the primary issue in Supercomputers is bandwidth (so that the processors can talk to each other).

Calxeda for example has spent a lot of time trying to solve this interconnect issue. Each Calxeda card is composed of 4 CPUs. Every CPU has 8 10Gb (aka: 1GB) ethernet links, 2 8xPCI connections and more.

Its a relatively young project, but they're moving towards the right direction. There are rumors that Facebook is deploying a Calxeda server.


If 20% of your jobs are beyond 1.5TB of RAM, ideally you want a heterogeneous environment. Some scale-up machines , and some scale-out machines. Mix and match as appropriate, and don't give in to the siren song of scale-out where it isn't appropriate.


You can't compare old pricing against new. Now you can get 384 GB for under $9K and 768 GB would be maybe $15K.


Oh, the paper is dated January 2013 but I see now something about "prices converted to US$ as of 8 August 2012".

I'm confused. I think the link changed to a different version of the paper. The numbers I went by aren't even in there any more!


Yeah, I convinced the moderators to update the link to a newer version of the paper. But keep in mind that a paper published in 2013 probably describes research performed in 2012 on hardware purchased in 2011 (if they're lucky). Getting back to hardware, the premium for 4-socket servers appears to be much smaller than it used to be, so now you can get a ton of DIMM slots for a reasonable price.


Off-topic, but since younger HNers may not know what the title is riffing on, https://www.google.com/search?q=nobody%20got%20fired%20for%2...


That one sentence sums up about a decade of professional contention. And I mean specifically relating to IBM software and hardware.

I should have left that job sooner.


Thanks, I had no clue the title was intended to be riffing on anything at all.


Nice try, but a few thoughts from someone who now spends a lot of time in Hadoop/MapReduce. I will admit it took me a while to warm up to the whole concept, but I'm now so familiar with it that I'm not able to think of compute before Hadoop.

I'm able to fire off pretty intensive MapReduce jobs on an Amazon Elastic MapReduce cluster with many nodes for a fraction of the price mentioned in the post (less than $100).

While I can imagine I could repurpose all my MapReduce/Hadoop code to run on a single box - especially since Amazon does offer several high-memory instances today - I would be loathe to.

The MapReduce framework provides a really nice framework that lets me horizontally scale out compute, rather than vertically, and that is really handy at terabyte-data volumes (data warehousing and large-data analytics.)


MapReduce is handy at terabyte-data volumes, but how often do we run jobs with input size > 1 TB? According to the paper:

at least two analytics production clusters (at Microsoft and Yahoo) have median job input sizes under 14 GB, and 90% of jobs on a Facebook cluster have input sizes under 100 GB

The other point the authors make is that DRAM costs are following Moore's law and terabyte-data workloads should "soon" be cost-feasible on single servers with DRAM.


The thing is that MapReduce is still a very good programming paradigm for a single box.

Machines have 8-64 cores these days -- you don't want to write multi-threaded code every time you want to do an analysis. So you can write MapReduce, but use a multicore framework instead of a cluster framework (hadoop).

The unfortunate thing is that there is no popular open source implementation of a multicore mapreduce, so people use Hadoop, which is wasteful on small data sets, as mentioned.

But the great part about it is that you will use the same application code for both. I fully expect in 5 years or so that people will be running multicore mapreduce jobs on 100 or 1000 core boxes.


Not all algorithms map well to MapReduce, which is the authors' main point. They explain one such example in the paper (section 3).


There is Disco[1], which is a MR framework written in Python and Erlang. It's open source and pretty awesome, and if I'm not mistaken it will leverage multi-core processors, no need for a cluster.

[1] http://discoproject.org/about


Speaking of Erlang, I've got my eye on Riak pipes. Now that the Basho folks have a large object store, I wonder if more in that department is a natural path.


> The unfortunate thing is that there is no popular open source implementation of a multicore mapreduce

Not popular, I'd agree, but I have had a lot of success with one-off Akka projects. My mappers and reducers are usually under 10 lines of Scala (more if I'm stuck writing Java, obviously).


It sounds like you're using Hadoop correctly, which is fine. But a lot of people are using "big data" that isn't very big (<1TB) and crunching it with small clusters that are less powerful than a single server due to the massive overhead of Hadoop.


It's less about the number of bytes than the number of records produced by the map step(s). Sometimes 10gb input data will produce many billions of records to reduce (if you're looking at combinations of things).

Basically, if your computation would never exceed memory on a single machine, then it is more processor efficient to code a more simple multi processing method and run on a single box than to code a map reduce on a cluster.

But what if you're not sure of the input data size? Processors are cheap. Engineers are expensive. Code the thing once for map reduce and you don't have to worry about making the transition later.


I agree that you should code your analytics once. I think the lesson from this work is that the market could benefit from a Hadoop-compatible but non-clustered runtime which should be easier to run and 10X faster.


What happens if you run Hadoop on that single machine?


It works but it's fairly slow because of things like HDFS that you don't need.


It's a tradeoff - clear simplicity for limited logic. I.e. not every problem that requires distributed computation fits well with map reduce, yet many attempt to fit the problem into it to begin with, instead of trying to shape a right distributed solution for particular problem.


Do you have any website, tutorial, pdf, etc. for me to reduce the latency on Hadoop?


totally offtopic typographic comment - anyone else seeing the alt-font fi ligature in the title? It's quasi-bolded on my machined (Safari, OS X), so it sticks out like a sore thumb.


Yes. Chrome 27, OS X. I spent a couple of seconds wondering if it was some typographical quirk inserted on purpose by someone.


The ligature appears for me, but the font only seems to mess up in firefox. Which is weird, because i would have expected Firefox and IE to have the same behaviour owing to DirectWrite..(?)

http://imgur.com/a/HNH5H From top to bottom: Chrome 27/beta, Firefox 16, IE 10 on Windows 7 x64.


Yes, same thing, Firefox 20, Linux.


i saw it, but i don't understand why it's happening. the text doesn't have any special markup so why doesn't the font display it as "fi" (two letters) if it doesn't have the correct ligature? is it a bug in the font?

edit: if i reduce the style to a single font, wf_segoe-ui_light, then it still appears that way. so i guess it's an issue with the font, not the browser.


I noticed that too and was confused.


Didn't notice anything. You got a good eye.


Seconded. Chromium, Linux.


"At 100 GB, scale-up still provides the best performance/$, but the 16-node cluster is close at 88% of the performance/$ of scale-up."

If these message-passing graph algorithms are representative of a "bad fit" to the parallel map-reduce model, I'd say a 12% penalty is not a bad price to pay at all in return for all the benefits of the parallel cluster in other cases.


I've done a handful of big-memory workloads on the JVM, and I have never seen it not choke on Allocation/GC for anything above 300gb of memory. Does this paper address this limitation?


Yes, section 3.4: "Heap size - By default each Hadoop map and reduce task is run in a JVM with a 200 MB heap within which they allocate buffers for in-memory data. When the buffers are full, data is spilled to storage, adding overheads. We note that 200 MB per task leaves substantial amounts of memory unused on modern servers. By increasing the heap size for each JVM (and hence the working memory for each task), we improve performance. However too large a heap size causes garbage collection overheads, and wastes memory that could be used for other purposes (such as a RAMdisk). For the scale-out configurations, we found the optimal heap size for each job through trial and error. For the scale-up configuration we set a heap size of 4 GB per mapper/reducer task (where the maximum number of tasks is set to the number of processors) for all jobs."


I think they ran one JVM per core. If you're willing to spend money, large heaps are supposed to be solved by Azul.


It sure sounds like parallel map/reduce tasks are begging for a 'just kill the process and re-fork a fresh worker' solution instead of garbage collection.


One of the great ideas in Erlang.




Having just recovered from a 72+ hour outage (about a dozen Hyper-V based hosts on our hosting provider) caused by a single (large) machine attached to a single (large) storage appliance, I think I'll pass on this idea of just scaling up instead of scaling horizontally.

edit: "pass on" (thanks, marshray)


False dichotomy. This paper is primarily focused on programming models, and the truth that single scale-up machines grossly exceed all but the rarest of edge cases (what people call "big data" is often very little data). If you have a virtualization cluster of course you need redundancy for all components, and that is just basic fundamentals.


I'm surprised they didn't select any tasks that couldn't easily fit on a single machine (largest input set was <200GB). That said, if the scale up performance can be improved without compromising scale out, why not?


"... There IS the occasional savage beating, and more than their fair share of suicides. But that has "statistical clustering" all over it. "




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: