
Stranger in a Strange Land: “Big Data” programmer meets HPC community - drjohnson
http://www.machinedlearnings.com/2014/02/stranger-in-strange-land.html
======
DerpDerpDerp
The title was better before the edit, since "Stranger in a Strange Land" is a
work of fiction, has been borrowed in many contexts, and the additional title
material clarified that it was borrowing that title to talk about an
experience in a particular field.

I'm always sad when titles get edited to "match the post" in a way that
actually removes clarity as to what the linked material is.

~~~
dang
Ok, I restored the "original" title.

For my part, I'm always sad when a meta comment that has nothing to do with
the content of the post ends up getting voted overwhelmingly to the top. It's
less your fault than that of the overwhelming upvoters. But it's an
unfortunate glitch that reliably mars HN discussions, and I don't know of a
good answer to it.

By the way, the reason we reverted the title in the first place is that the
editorialized subtitle immediately provoked a bikeshed debate about the
"communities" and whether they're really distinct or not—when the meat of the
post is about things like fault tolerance in distributed computing.

~~~
DerpDerpDerp
(This is of course wandering down the same meta path, but...)

I've always wondered if it would be possible to tag comments with tags like
"tangent" or "meta" (or whatever else seems good), and then apply sorts based
on that (probably in some user selectable way).

> By the way, the reason we reverted the title in the first place is that the
> editorialized subtitle immediately provoked a bikeshed debate about the
> "communities" and whether they're really distinct or not—when the meat of
> the post is about things like fault tolerance in distributed computing.

I disagree that the comment was a bikeshed debate: the article spends several
paragraphs (and the first four!) talking about the relationship of the two
communities, and the discussion of their relationship (both in the article and
here) explained some of the ways that knowledge I knew about math intensive
computing (and related technologies, such as GPUs) related to problems I
hadn't previously worked on.

Exploring the relationships between subdisciplines, especially closely related
ones like these, is a benefit both in the article and in the related
discussion.

------
mindcrime
I would argue that they aren't so much "separate communities" as two sides of
the same coin. And to the extent that that isn't _entirely_ true, I'd argue
that it's becoming more true over time.

Of course, I'm biased, as I went to school and did a degree specifically in
"HPC" back when the NSF was pouring money into the field in the mid 2000's and
pushing colleges to implement degree programs on the topic. But my background
prior to that, and what I mostly did career wise even after getting that
degree, is largely focused around the Java / Open Source / Hadoop / etc.
world.

So now I build data warehouses using Hadoop and friends, while being informed
by the things I learned writing tons of MPI and OpenMP code back in the day.
And now that YARN has positioned Hadoop and a general purpose cluster
framework, not tied specifically to Map/Reduce (you can implement MPI on top
of YARN now), it starts to look more and more like these world really are just
different aspects of the same thing.

Where there is a noticeable schism, to me, is the point made in TFA about how
some jobs are more about IO and some are more about CPU cycles. IOW, sometimes
a job is "big data" because the actual _amount_ of data is literally very
large, but other times it's "big data" (or "HPC") because of the nature of the
calculation being performed, even though the actual data might not be _that_
big.

I think there will always be a place for the Beowulf type clusters and MPI-CH
and OpenMP and their ilk, but I am really excited about the things that you
can do in Hadoop and Friends world these days. With the new found flexibility
in Hadoop with YARN, and things like Hama (for BSP) and Storm / S4 (for
streaming data), SAMOA & Mahout (machine learning / analytics), it's a fun
place to play.

~~~
greatzebu
I'd say that the two communities here are just people working at opposite ends
of a continuum of applications that run on large clusters. The HPC community
is all the way at the end with tightly coupled applications, low data to
computation ratios, and diverse communication patterns. The big data community
is characterized by giant data to computation ratios, highly constrained and
regular communication patterns, and loose coupling.

The fundamental problems are similar (fault tolerance, load balancing,
scheduling) but the best approaches depend on where you are on that continuum.

~~~
alephnil
I see that you talk about "large clusters". In the HPC community there is
often made a distinction between clusters and supercomputers, where the latter
implies fast interconnect between the nodes, allowing synchronization of data
between the steps to be fast. Such fast interconnection is often required for
some workloads like weather forecasting, simulation of biomolecules. On
clusters without such fast interconnection, it is not possible to parallelize
such problem beyond a dozen of processors. Real supercomputers can often be an
order of magnitude or more expensive per CPU, but is required for such
workloads. For such workloads, it is very important how data is moved around,
and that is probably what they meant by data locality.

On the other hand, many other HPC tasks are possible to spread across cluster
nodes, and for those tasks clusters are sufficient. In fact you will often be
denied access to supercomputers for such workloads, and be told to use a
cluster instead.

~~~
dekhn
Many clusters have tight interconnect, yet I don't call them supercomputers.

Supercomputers are more defined by their capabilities and max capacities: they
tend to be orders of magnitude larger in their max memory, and their ability
to do X, Y, or Z. It's really just a term at this point, not something truly
differentiating.

~~~
metaobject
I agree. We have a 128 node cluster where I work (that has infiniband
interconnect, etc), but I wouldn't call it a supercomputer. Some of my
colleagues, however, have access to machines at various national labs (ORNL,
e.g.) that I would call supercomputers. I suppose it's all relative, though.
For someone who's only ever developed on a dual-core 2 GHz machine, a 128
multi-core node cluster might be considered the equivalent of The WOPR.

~~~
alephnil
I agree that a 128-node computer would not be called a supercomputer
regardless of interconnection, my point was more the other way, that a
computer without fast interconnections would still be called a cluster, and
not a supercomputer regardless of the number of nodes.

------
agibsonccc
Despite the marketing stuff below, it's interesting to see the big players in
each space partnering, the 2 communities have intersected for a while now:

[http://www.cloudera.com/content/cloudera/en/solutions/partne...](http://www.cloudera.com/content/cloudera/en/solutions/partner/cloudera-
and-sgi.html)

[http://www.microway.com/hpc-tech-tips/hadoop-isnt-just-
for-w...](http://www.microway.com/hpc-tech-tips/hadoop-isnt-just-for-
web-2-0-big-data-anymore-hadoop-for-hpc/)

[http://www.cloudera.com/content/cloudera/en/solutions/partne...](http://www.cloudera.com/content/cloudera/en/solutions/partner/Dell.html)

[http://inside-bigdata.com/2012/10/16/bringing-hadoop-to-
hpc-...](http://inside-bigdata.com/2012/10/16/bringing-hadoop-to-hpc-panasas-
partners-with-hortonworks/)

[http://hortonworks.com/partner/sgi/](http://hortonworks.com/partner/sgi/)

------
gfody
"Programming against an abstraction of a reliable machine" is just plain old
separation of concerns. This is good engineering practice. You don't worry
about low level failures in high level code, and you shouldn't introduce more
high level failure modes than are necessary. Addressing failures in the
appropriate layer is perfectly apt. If ECC is insufficient then revise ECC.
Declaring the approach to be a fundamentally flawed is to challenge one of the
pillars of good engineering.

~~~
dang
This comment is a middlebrow dismissal. It dismisses the entire article with a
woolly generality like "good engineering practice". Not only that, but it
dismisses the entire debate that the article is about, as if it were an
obvious non-issue.

This article is really substantive and deserves much better.

~~~
gfody
I beg your pardon, Separation of concerns is a fundamental principle of good
engineering practice and that seems to be what the debate is about. I gave a
couple examples in my comment, not just a vague dismissal (and it's not like
SoC needs me to defend it).

OP's article made a few unsubstantiated claims about ECC and silent
corruption. These are presented as reasons to abandon the entire approach of
"programming to an abstraction of a perfect machine" rather than just
addressing the allegedly insufficient failsafes directly.

------
dekhn
I used to work on supercomputers (Cray T3E to Seaborg era) and then joined
google.

The difference between big data and hpc is rapidly disappearing. Many aspects
that people ascribe to MR have been removed or varied to enable more HPC-like
computing.

