Stranger in a Strange Land: “Big Data” programmer meets HPC community

DerpDerpDerp · on May 3, 2014

The title was better before the edit, since "Stranger in a Strange Land" is a work of fiction, has been borrowed in many contexts, and the additional title material clarified that it was borrowing that title to talk about an experience in a particular field.

I'm always sad when titles get edited to "match the post" in a way that actually removes clarity as to what the linked material is.

ableal · on May 4, 2014

> since "Stranger in a Strange Land" is a work of fiction

A tad older: "And she bare him a son, and he called his name Gershom: for he said, I have been a stranger in a strange land." ( Exodus 2:22, e.g. http://www.bartleby.com/108/02/2.html#S2 )

A good portion of the titles in English are taken either from the KJV Bible or Shakespeare, not that there is much difference.

__Joker · on May 4, 2014

" not that there is much difference.". Curios, can you explain, why there is not much difference ?

ableal · on May 4, 2014

They were contemporaneous, and both pretty much hailed as the best use of English around. To my (foreign) ear, they sound similar. Some writers even go so far as fictionalizing Shakespeare sticking his finger in the KJV - I think Anthony Burgess did that.

scott_karana · on May 4, 2014

Each corpus is dauntingly large (arcane?), making knowledge of the source difficult to determine for casual onlookers.

Both have exerted tremendous influences (both overt and subtle) on traditional Western culture.

dang · on May 4, 2014

Ok, I restored the "original" title.

For my part, I'm always sad when a meta comment that has nothing to do with the content of the post ends up getting voted overwhelmingly to the top. It's less your fault than that of the overwhelming upvoters. But it's an unfortunate glitch that reliably mars HN discussions, and I don't know of a good answer to it.

By the way, the reason we reverted the title in the first place is that the editorialized subtitle immediately provoked a bikeshed debate about the "communities" and whether they're really distinct or not—when the meat of the post is about things like fault tolerance in distributed computing.

DerpDerpDerp · on May 4, 2014

(This is of course wandering down the same meta path, but...)

I've always wondered if it would be possible to tag comments with tags like "tangent" or "meta" (or whatever else seems good), and then apply sorts based on that (probably in some user selectable way).

> By the way, the reason we reverted the title in the first place is that the editorialized subtitle immediately provoked a bikeshed debate about the "communities" and whether they're really distinct or not—when the meat of the post is about things like fault tolerance in distributed computing.

I disagree that the comment was a bikeshed debate: the article spends several paragraphs (and the first four!) talking about the relationship of the two communities, and the discussion of their relationship (both in the article and here) explained some of the ways that knowledge I knew about math intensive computing (and related technologies, such as GPUs) related to problems I hadn't previously worked on.

Exploring the relationships between subdisciplines, especially closely related ones like these, is a benefit both in the article and in the related discussion.

lmm · on May 4, 2014

I've said this before and I'll say it again: please either stop mod-editing titles, or remove the title field on submissions and set the title automatically from the linked page. Even when a mod changes to a better title, it causes enough confusion and disagreement to be a net negative.

mindcrime · on May 3, 2014

I would argue that they aren't so much "separate communities" as two sides of the same coin. And to the extent that that isn't entirely true, I'd argue that it's becoming more true over time.

Of course, I'm biased, as I went to school and did a degree specifically in "HPC" back when the NSF was pouring money into the field in the mid 2000's and pushing colleges to implement degree programs on the topic. But my background prior to that, and what I mostly did career wise even after getting that degree, is largely focused around the Java / Open Source / Hadoop / etc. world.

So now I build data warehouses using Hadoop and friends, while being informed by the things I learned writing tons of MPI and OpenMP code back in the day. And now that YARN has positioned Hadoop and a general purpose cluster framework, not tied specifically to Map/Reduce (you can implement MPI on top of YARN now), it starts to look more and more like these world really are just different aspects of the same thing.

Where there is a noticeable schism, to me, is the point made in TFA about how some jobs are more about IO and some are more about CPU cycles. IOW, sometimes a job is "big data" because the actual amount of data is literally very large, but other times it's "big data" (or "HPC") because of the nature of the calculation being performed, even though the actual data might not be that big.

I think there will always be a place for the Beowulf type clusters and MPI-CH and OpenMP and their ilk, but I am really excited about the things that you can do in Hadoop and Friends world these days. With the new found flexibility in Hadoop with YARN, and things like Hama (for BSP) and Storm / S4 (for streaming data), SAMOA & Mahout (machine learning / analytics), it's a fun place to play.

greatzebu · on May 3, 2014

I'd say that the two communities here are just people working at opposite ends of a continuum of applications that run on large clusters. The HPC community is all the way at the end with tightly coupled applications, low data to computation ratios, and diverse communication patterns. The big data community is characterized by giant data to computation ratios, highly constrained and regular communication patterns, and loose coupling.

The fundamental problems are similar (fault tolerance, load balancing, scheduling) but the best approaches depend on where you are on that continuum.

alephnil · on May 4, 2014

I see that you talk about "large clusters". In the HPC community there is often made a distinction between clusters and supercomputers, where the latter implies fast interconnect between the nodes, allowing synchronization of data between the steps to be fast. Such fast interconnection is often required for some workloads like weather forecasting, simulation of biomolecules. On clusters without such fast interconnection, it is not possible to parallelize such problem beyond a dozen of processors. Real supercomputers can often be an order of magnitude or more expensive per CPU, but is required for such workloads. For such workloads, it is very important how data is moved around, and that is probably what they meant by data locality.

On the other hand, many other HPC tasks are possible to spread across cluster nodes, and for those tasks clusters are sufficient. In fact you will often be denied access to supercomputers for such workloads, and be told to use a cluster instead.

dekhn · on May 4, 2014

Many clusters have tight interconnect, yet I don't call them supercomputers.

Supercomputers are more defined by their capabilities and max capacities: they tend to be orders of magnitude larger in their max memory, and their ability to do X, Y, or Z. It's really just a term at this point, not something truly differentiating.

metaobject · on May 4, 2014

I agree. We have a 128 node cluster where I work (that has infiniband interconnect, etc), but I wouldn't call it a supercomputer. Some of my colleagues, however, have access to machines at various national labs (ORNL, e.g.) that I would call supercomputers. I suppose it's all relative, though. For someone who's only ever developed on a dual-core 2 GHz machine, a 128 multi-core node cluster might be considered the equivalent of The WOPR.

alephnil · on May 4, 2014

I agree that a 128-node computer would not be called a supercomputer regardless of interconnection, my point was more the other way, that a computer without fast interconnections would still be called a cluster, and not a supercomputer regardless of the number of nodes.

seanmcdirmid · on May 3, 2014

I've observed that people doing big data typically have systems backgrounds while while those doing HPC are usually specialists in something else (graphics, speech recognition) who stumble on HPC because it's necessary. Of course, this is changing...machine learning people now need big data.

Companies are also forcing the two communities onto the same clusters, so much hilarity ensues when trying to satisfy both of their IO, fault tolerance, affinity, and processing needs together. Also, god forbid if the HPC people are into CUDA as well as MPI...now each machine needs a couple of pricey GPUs.

sampo · on May 3, 2014

> "now each machine needs a couple of pricey GPUs."

Well it would be much more pricey to get the same amount of flops out of traditional CPUs.

mturk · on May 4, 2014

It depends on if you are counting hardware costs or software costs as well. The rough estimate I heard recently is that an hour of programmer time costs about the same as an hour of a 100 Teraflop computer.

seanmcdirmid · on May 4, 2014

Right, but GPUs are something that the BigData crowd don't really know how to leverage yet.

pjmlp · on May 4, 2014

I think the situation might improve as we get more languages to run in the GPUs.

The current approach of both CUDA and OpenCL camps to move to some sort of standard intermediate representation is a step in the right direction.

This might open the door to more declarative approaches for data modelling in the GPU.

seanmcdirmid · on May 4, 2014

Right now you can't get any closer to the metal then you can with CUDA [1]; there is very little memory abstractions and programming is imperative in the worst possible way; you basically are scheduling your own memory bandwidth! In the short term, I don't see a way around this. Functional programming won't be effective unless some kind of advanced synthesis takes hold (you can write GPU code with Haskell, but the ceiling is low).

[1] Or OpenCL, though I don't know anyone that uses it...too slow.

pjmlp · on May 4, 2014

Yeah, CUDA is way better designed than OpenCL, specially given its support for C++ and FORTRAN at compiler level.

Up to recently OpenCL was all about C. Plus it followed the same broken model as GLSL shaders, requiring low level compiler as library.

I think OpenCL camp is mostly unaware how advanced CUDA is in regard to language support, since they live in pure C land.

Better performance is a consequence of NVidia cards being quite good in regards to the competition and they pushing their own tech.

Overall you are quite right. As I mentioned, I see the move to PTX, HSAIL and SPIR as a small step in that direction.

More needs to happen in terms of the whole architecture.

sampo · on May 4, 2014

> "FORTRAN"

Old habits die hard, but for the past 20 years, the official spelling has been: Fortran.

pjmlp · on May 4, 2014

I did some HPC at the university in distributed computing classes, back when PVM was still a thing in 1997.

Followed by some work at CERN a few years later on ATLAS HLT group.

Nowadays, just plain boring consulting enterprise work.

What I see as a consequence of big data and HPC fields approaching is the increase on the Java world to improve tooling. Given how performance focused the HPC folks are and the amount of Java code in big data frameworks.

A visible consequence of this is the ongoing discussions on the Java world of adding value types, replacing JNI by something more approachable, replacing Hotspot by Graal/Truffle, support for HSAIL, promoting unsafe to an official status and so on.

Create · on May 4, 2014

I would argue that the ATLAS HLT group was bloated enough to be incapable to be performance focused, along with all the other computing groups at CERN. Both technically and organizationally. Not to mention the failed world of the grid: LCG, EGEE and company, which serve as a testament to my previous statement. It has been largely taken over by the AWS/GOOG model (about the same era) in its OpenStack clone incarnation.

Obviously there were/are attempts, but with none of the dedication, competence and seriousness as for example the key gaming industry people were of the same era, ie. iD Games (FPS, GPU, etc) or Naughty Dog (GOOL, PS1 register allocation optimization).

Actually, most HPC "novelty" GPU-s, CUDA, Open[C|G]L are coming from the gaming industry end, and most of HPC is now assembled from stock COTS (or almost, just top bin) parts, instead of driving industry as it used to be a few decades ago. It resembles to the it purchasing department and operation of a large company, more than ever.

It is just plain boring consulting enterprise work, and even that for as cheap as possible.

pjmlp · on May 4, 2014

I remember the grid, even attended the first summer school about it.

As for being bloated, kind of. When I was there, a huge issue was every group having their own threading libraries and the multiple failed attempts to merge them into a single one.

agibsonccc · on May 3, 2014

You bring up a really good point. Being a bit younger, I hadn't seen much of HPC. I recently attended Big Data Innovation Summit and saw a joint talk between Intel and Dell.

It didn't surprise me that a lot of the stuff hadoop did wasn't really newer per se, but it was great seeing the 2 converge and understanding the differences in the philosophies.

YARN as a whole is going to be a great enabler for many kinds of platforms.

It's a great time to be doing distributed systems. I might be biased in that I have a few cards on the table as well in that space, but I like getting up every morning so..meh.

frozenport · on May 3, 2014

>>I think there will always be a place for the Beowulf type clusters and MPI-CH and OpenMP

You use OpenMP to scale your Hadoop jobs, see http://www.cray.com/Products/BigData/CS300-Hadoop.aspx

dekhn · on May 4, 2014

In general, there's nothing preventing you from using threading in MapReduce. For example, your Map() function can act as a capture buffer (calls to map just save data into a thread safe queue), you can have multiple worker threads reading items from that queue, communicatijng, and finally, a thread that calls Flush() on the results of the thread pool/thread queue.

capkutay · on May 3, 2014

This makes a lot of sense. Around the time when I was doing MPI in C, I was looking into a java implementation just out of curiosity. The only real framework people recommended at the time was hadoop.

agibsonccc · on May 3, 2014

Despite the marketing stuff below, it's interesting to see the big players in each space partnering, the 2 communities have intersected for a while now:

http://www.cloudera.com/content/cloudera/en/solutions/partne...

http://www.microway.com/hpc-tech-tips/hadoop-isnt-just-for-w...

http://www.cloudera.com/content/cloudera/en/solutions/partne...

http://inside-bigdata.com/2012/10/16/bringing-hadoop-to-hpc-...

http://hortonworks.com/partner/sgi/

gfody · on May 3, 2014

"Programming against an abstraction of a reliable machine" is just plain old separation of concerns. This is good engineering practice. You don't worry about low level failures in high level code, and you shouldn't introduce more high level failure modes than are necessary. Addressing failures in the appropriate layer is perfectly apt. If ECC is insufficient then revise ECC. Declaring the approach to be a fundamentally flawed is to challenge one of the pillars of good engineering.

dang · on May 4, 2014

This comment is a middlebrow dismissal. It dismisses the entire article with a woolly generality like "good engineering practice". Not only that, but it dismisses the entire debate that the article is about, as if it were an obvious non-issue.

This article is really substantive and deserves much better.

gfody · on May 5, 2014

I beg your pardon, Separation of concerns is a fundamental principle of good engineering practice and that seems to be what the debate is about. I gave a couple examples in my comment, not just a vague dismissal (and it's not like SoC needs me to defend it).

OP's article made a few unsubstantiated claims about ECC and silent corruption. These are presented as reasons to abandon the entire approach of "programming to an abstraction of a perfect machine" rather than just addressing the allegedly insufficient failsafes directly.

dekhn · on May 4, 2014

I used to work on supercomputers (Cray T3E to Seaborg era) and then joined google.

The difference between big data and hpc is rapidly disappearing. Many aspects that people ascribe to MR have been removed or varied to enable more HPC-like computing.