Hacker News new | past | comments | ask | show | jobs | submit login
HPC is dying, and MPI is killing it (dursi.ca)
225 points by ljdursi on April 7, 2015 | hide | past | favorite | 123 comments



I came into the essay with suspicion. A map-reduce system like Hadoop isn't a good fit for HPC problems, and I thought it would argue that MPI is old => it's stuck in the past.

Instead, and to my joy, it was a well-reasoned essay with good, solid points.

My only quibble is that Charm++ is not "a framework for particle simulation methods". While the molecular dynamics program NAMD has been using it for 20 years, which is why I know of Charm++, it wasn't designed specifically for particle simulation methods, nor is restricted to that topic. Quoting from http://charm.cs.illinois.edu/newPapers/08-09/paper.pdf :

> NAMD, from our oldest collaboration, is a program for biomolecular modeling [2]; OpenAtom is a Car-Parinello MD program used for simulation of electronic structure (in nanomaterials, as well as biophysics) [3]; ChaNGa, an astronomy code [13]; and RocStar, a code for simulating solid-propellant rockets, such as those in the space shuttle solid rocket booster


Thanks! I'm probably mischaracterizing Charm++ a bit, because I'm most familiar with it in particle context (OpenAtom, ChaNGa, NAMD). I guess it's probably particularly used in that context just because it's so good with very fine-grained distributions of work units. I'll edit that line in the article.


Another nitpick: 128 cores equals ~ 4 nodes only if your problem is not bound by memory bandwidth. If it is, 128 cores equals ~ 16 cores, and then the interconnect matters a lot.

Great writeup though. I do think we need to get more people into the mindset that MPI won't be the standard in 10 years, otherwise it will still be the standard in 10 years.


"won't be the standard in 10 years, otherwise it will still be the standard in 10 years." -- nicely said.


(Thanks Andrew Dalke. You remember it after all these years!) Jonathan, actually, even in that set, OpenAtom is not a particle code. Its a quantum chemistry code where each electronic state is represented often by a large 3D array spread over processors.. For more recent examples of representative miniApps, see http://charmplusplus.org/benchmarks/ or our upcoming workshop (sorry for the plug: http://charm.cs.illinois.edu/charmWorkshop

The broader article deserves further though, and I hope to find time to respond. But it is clear that raising the level of abstraction beyond MPI is necessary.


(Indeed I do!)


People need to be careful about talking about Hadoop as just a map-reduce system.

It's YARN container system is flexible enough to run any JVM application. For example we use it to run an autoscaling ElasticSearch cluster alongside our Hadoop workloads. And we are actively investigating using it to run our Scala microservices.


Even years after Amazon started selling a lot more than just books, if people were asked "what does Amazon sell?", the answer was often "books."

I looked at YARN now. I've not heard of it before. It doesn't look like it has anything to do with the topic at hand. How would one build an explicit solver for a 1D diffusion equation, corresponding to the examples given in the "HPC is dying, ..." article, using YARN?

How do you do checkpointing so you can restart your 10 million atom simulation should there be a system fault after 2 weeks of run-time? (Checkpoints need about 220 MB; each atom has an x,y,z position as well as a vx,vy,vz velocity vector. Also, it needs to be at the same timestep across the entire distributed machine.)

Instead, it looks like YARN is designed for service-based components, where the components are relatively independent from each other, and where failure recovery is mostly a matter of starting a new service and resending the request.

If my understanding is correct, then it's certainly more capable than map-reduce. But not in a direction that's relevant for most current HPC.


YARN is just a resource manager on top of which Hadoop jobs are run e.g. Hive, Pig.

It is analogous to a set of Docker containers distributed across nodes. The same methods you would use synchronize state in that situation you could use with YARN. For example using a persistent distributed system e.g. Hazelcast to handle system failures and checkpointing.

I am not saying this is some amazing solution to every HPC problem only that Hadoop is far, far more flexible than many people give it credit for.


I understand your last paragraph. Looking this time at Hazelcast, what I see is layers of code to understand before being able to do something simple. It really does look like all of the technology you are pointing to is solving a different problem. It's not related to any of the HPC needs I've heard of.

Parts of my simulation are out of phase. I need some gather step to collect the data from individual nodes, when a given timestep is reached, and save the state. A simple solution is to do a barrier every ~30 minutes, send to the master node, and have it save the data.

When I look at Hazelcast I see what looks to be a different sort of clustering - using clusters for redundancy, and not for CPU power. Eg, I see "Hazelcast keeps the backup of each data entry on multiple nodes", and I think "I don't care." If a node goes down, the system goes down, and I restart from a checkpoint. It's much more likely that one of the 512 compute nodes will go down than some database node.

I'll withdraw my original statement that "A map-reduce system like Hadoop" and say simply "a system like Hadoop isn't a good fit for HPC problems".

Here's a lovely essay which agrees with me ;) http://glennklockwood.blogspot.com.au/2014/05/hadoops-uncomf... . It considers the questions:

> Why does Hadoop remain at the fringe of high-performance computing, and what will it take for it to be a serious solution in HPC?


Why does the system go down if a node goes down? Shouldn't the system keep chugging along at a slightly reduced capacity until the node comes back online?

Sorry, I'm not involved in HPC at all. I know a little bit about Hadoop. I'm mostly interested in building online message processing and blended real-time/historical analytics. Our problem domain wouldn't want to lose all capacity if part of the system became unavailable.


There are several different aspects which make recovery hard. HPC tries to push the edge of what's possible with hardware. It does this by throwing redundancy out the window.

First, the simulation can be set up to match the hardware. One simulation program I used expected that the nodes would be set up in a ring, so that messages between i and (i+1)%N were cheap. It ran on hardware with two network ports, one forwards and one backwards in the ring. In fact, the only way to talk between non-neighbors was to forwards through the neighbors.

If a node goes down, then the ring is broken, and the entire system goes down.

This is very different than a cluster with point-to-point communications, where a router can redirect a message to a backup node should one of the main nodes go down.

The reason for this architecture is that there's a lot of inter-node traffic. When I was working on this topic back in the 1990s, we were network limited until we switched to fiber optic/ATM. When you read about HPC you'll hear a lot about high-speed interconnects, and using DMA-based communication instead of TCP for higher performance. All of this is to reduce network contention.

Suppose there's 1GB/s of network traffic for each node. (High-end clusters use InfiniBand to get this performance.) In order to have a backup handy, all of that data for each node needs to be replicated somewhere. That's more network traffic. Presumable there are many fewer spare nodes than real nodes, since otherwise that's a lot of expensive hardware that's only rarely used. If there are 512 real nodes and 1 backup node, than that backup node has to handle 512GB/second. Of course, the backup node can die, so you really want to have several nodes, each with a huge amount of bandwidth.

Even then, the messages only exchange part of the state data. For example, in a spatial decomposition, each node might handle (say) 1,000 cells of a larger grid. The contents of a cell can interact with each other, and with the contents of its neighbor cells, up to some small radius away. (For simplicity, assume the radius is only one cell away, so there are 26 neighbors for each cell.)

If one node hosts one cell and another node hosts another then at each step they will have to exchange cell contents, in order to compute the interactions. This requires network overhead.

On the other hand, a good spatial decomposition will minimize the amount of network traffic by putting most neighbors on the same machine. After all, memory bandwidth is higher than network, and doesn't have the same contention issues.

But this means that the node has mutating state which isn't easily observed by recording and replaying the network. Instead, the backup node needs to get a complete state update of the entire system.

This is a checkpoint. But notice that I used a spatial decomposition to minimize network usage by not sending all of the data all of the time? I've thrown that out of the window. Now I need to checkpoint all of the time, and have the ability to replay the network requests that the node is involved in, should it go down.

This is complicated, and will likely exceed what the hardware can do, given that it's already using high-end hardware for the normal operations.


Ah, I see. I didn't realize so much of the problem domain was at the network layer.

For our domain, we'll gladly accept the increased network cost and node redundancy for durability because most of work ends up being not involved with other nodes (most of our computations can occur wherever the data is stored and mutations, aside from append, are infrequent).

Thank you for giving me some context.


MPI is Message Passing Interface, for those who don't know. Author never explicitly says what MPI stands for.

https://en.wikipedia.org/wiki/Message_Passing_Interface


"MPI, the Message Passing Interface, began as a needed standardization above a dizzying array of..."

The capitalization and apposition makes it pretty explicit.


I triple checked before posting this, looks like someone edited the blog ;)


Way too far into the article for it to be useful.


Yeah, I had to go back to google after failing to find it defined anywhere on the page. If you don't use something regularly yourself, you may remember the concept but forget the acronym. For want of a couple of sentences of context at the outset, the rest of the article was quite inaccessible. Maybe there's a lesson for the author here.


It wouldn't be hard to define up front, but in context I don't think assuming that the audience knows what MPI is poses much of a problem for the goals of the piece. This isn't a general-interest piece about HPC, but specifically an advocacy piece attempting to convince members of the HPC community that their strong attachment to MPI is detrimental to the field, and that they should refocus their efforts elsewhere. If someone doesn't know what MPI is, they are probably not strongly attached to it (and probably not in HPC), so aren't the people the author is trying to convince.


a lesson for the author xor the reader.


Thank you. Using TLAs without defining them should be a bannable offense.


Indeed he does not.


I've heard there is a new effort being led by Torsten Hoefler to modernize MPI and address a number of the issues mentioned in this article.

http://htor.inf.ethz.ch/

I was at a talk of his last year and there are a number of fault tolerant MP algorithms being drawn in. MPI hasn't been updated in ages, I don't think that necessarily means we need to ditch it, it just means the standard needs to be modernized. I don't feel very strongly about this since working with MPI is a huge pain in the ass and it seems like the challenge of modernizing it is just gargantuan.

Also I'm not familiar with spark, but isn't Chapel a decade old at this point and barely works at all? I tried their compiler last summer and it took 5 minutes to compile hello world, hopefully its improving.


Certainly Prof Hoefler has done a lot of work driving the design of updated remote-memory access for of MPI-3, and any further progress would be welcomed; but I don't think any modernizing of MPI can fix the basic problem. At the end of the day, it's just too low level for application developers, while being too high-level for tool developers. There are parts of MPI which don't share this problem so much - the collective operations, and especially MPI-IO; but the disconnect between what people need to build either tools or scientific applications and what MPI provides just seems too great.

For Chapel, it depends on what you count; it very heavily borrows from ZPL, which is much older, but Chapel itself was only released in 2009. It is already competitive with MPI in performance in simple cases, while operating at a much higher level of abstraction. Whether Chapel, or Spark, are the right answers in the long term, I don't know; but there's a tonne of other options out there that are worth exploring.


Again I'm not sure if I agree or disagree with this. My hatred of MPI is only outweighed by the fact that I can use it... and my code works.

I think a large part of the inertia behind MPI is legacy code. Often the most complex part of HPC scientific codes is the parallel portion and the abstractions required to perform them (halo decomposition etc). I can't imagine there are too many grad students out there who are eager to re-write a scientific code in a new language that is unproven and requires developing a skill set that is not yet useful in industry (who in industry has ever heard of Chapel or Spark??). Not to mention that re-writing legacy codes means you're delaying from getting results. Its just a terrible situation to be in.


I work in a medium sized telco, we have had Spark for prototypes for over a year and have now got a roadmap to put it in production.

I think Spark will totally displace map-reduce in the next 12 months (because it's got map reduce in it, but in memory).


>who in industry has ever heard of Chapel

Chapel's made by Cray. If what you're saying is true then Cray's not done a very good job of advertising Chapel. God knows they have the capability to advertise properly.


Oh, sure. I don't think anyone should start rewriting old codes; but as new projects start, I think we have a lot more options out there than we did 10 years ago, and it's worth looking closely at them before starting, rather than defaulting to something. Especially since, once you start, you're probably pretty much locked into whatever you chose for a decade or so.


So, say you wanted to write a weather model, or engineering fluid mechanics model. Which options (besides MPI) you would look at?


Chapel has been used for incompressible moving-grid fluid dynamics, so it's certainly feasible. For that problem the result was ~33% the lines of code of the MPI version. There is a performance hit, but the issues are largely understood; if (say) a meteorological centre were to put its weight behind it, a lot of things could get done.

It's also pretty easy to see how UPC or co-array fortran (which is part of the standard now, so isn't going anywhere any time soon) would work. They'd fall closer to MPI in complexity and performance.

You couldn't plausibly do big 3d simulations in Spark today; that's way outside of what it was designed for. Now analysing the results, esp of a suite of runs, that might be interesting.


How you guys think of Chapel? It compiles so slow.


A decade is no age for a language, though. Creating a language with compiler can be done very quickly, but creating a good language with a good compiler and a good standard library takes time. And then it needs to catch on. This requires about a decade++.

Scala is 12 years old, Go is 6 years old and Clojure is 8 years old.


Yeah good point. I just felt it might be a misleading of the author to suggest Chapel as an alternative when you cannot possibly write a useful program with it.


There are numerous benchmarks implemented in Chapel, some of which are competitive with other implementations (see paper reference in article). There is a growing standard library and literally thousands of test codes that represent a broad set of functionality. That said, Chapel is not yet a product grade language, nor is it promoted as such.

Chapel may not be an appropriate replacement for all MPI programs, but it can be used for some programs today.


I agree with this, but it also sort of risks being a self-fulfilling prophesy; everyone uses MPI because everyone uses MPI, and no one uses Chapel yet because no one uses Chapel yet. At some point, we who are willing to be early adopters need to just start.


The Fortran Standards Committee is attempting to make HPC easier through the use of coarrays, which are essentially massive abstractions over MPI.

I really wish people would give Fortran a second chance. It has come a long way from the ancient, all-caps days.


I spent a few months learning (modern) Fortran a year or two ago. My chief obstacle was the difficulty involved in finding modern tutorials. I don't want to have to read tutorials written in 1994 whose focus is getting people used to F77 up to speed. I've yet to find a tutorial that approached teaching F08 as if it was a new language, which is what I feel is needed.

Even in F08 there's a lot of backwards-compatibility cruft still left in the language, too. The IO model still provides very little abstraction and is based on tape drives. You can/have to "rewind" files. There are obscure "unit descriptors" that manifest themselves as integer literals in most code posted online which makes it a chore to learn from. As far as I can tell there is no functionality that approximates the behaviour of C++'s streams.

It's fast as hell, and the GNU compiler is mature and well-developed, but Fortran remains a horrid language for doing any sort of interactive programming. It's best used if you just give it some arguments, let it run free, and then have it return some object or value that a more sane language can then interpret and present to the user for a decision.

There is little reason to learn a language where the only sane choice for doing input/output involves calling your Fortran module from a python script and letting the python handle i/o.


> You can/have to "rewind" files.

This isn't necessarily a Fortran-specific thing. The standard C library includes a rewind(fd) function, equivalent to lseek(fd, 0, SEEK_SET).


"horrid for interactive" is ironic, since interactive and HPC are pretty much disjoint. (well, viz...) From an HPC perspective, Fortran IO should be performed by HDF5...


I think you're reinforcing my point.

Fortran might be used more widely, like C++ is, if it wasn't so awful for doing things other than shitting out numbers at insane speeds.


+1


I also think we need someone to put some real effort into making gdb usable for Fortran dynamic/automatic arrays. That's a real PITA currently.


> There are obscure "unit descriptors" that manifest themselves as integer literals in most code posted online which makes it a chore to learn from.

Well, you can think of a "unit descriptor" (or somewhat more Fortranny, "file unit number") as something roughly equivalent to a POSIX file descriptor, which is also an integer. The problem, as you allude to, is that classically unit numbers were assigned by the programmer rather than the OS or runtime library, so you could end up with clashes e.g. if you used two libraries which both wanted to do I/O on, say, unit=10. Modern Fortran has a solution to this, though, in the NEWUNIT= specifier, where the runtime library assigns a unique unit number.

> As far as I can tell there is no functionality that approximates the behaviour of C++'s streams.

As of Fortran 2003, there is ACCESS="stream", which is a record-less file similar to what common operating systems and programming languages nowadays provide.

> It's fast as hell, and the GNU compiler is mature and well-developed, but Fortran remains a horrid language for doing any sort of interactive programming.

Personally, I'm hoping for Julia to succeed, but we'll see..


I think coarrays have a lot of promise; it's less ambitious than the single unified picture of a program that Chapel or UPC has, but maybe that's a feature, not a bug, for incrementally changing how we do things. One problem has been that implementations of coarrays depended on MPI-2 features which were brittle and not super optimized (because they weren't widely used); but with MPI-3s RMA, or gasnet (which I think gfortran is starting to use?) it could start being more practical for production use.


Maybe after that, the cool kids would "permit" us to try Perl again, if they declare it cool again.

Modern Perl is pretty nice. 90s Perl still not so good.


As the old saying goes, a good developer can write FORTRAN in any language.


Writing, sure. Getting it to run as fast -> replace any language with C and add some boilerplate.


I LIKE PROGRAMMING IN ALL-CAPS. IT MAKES EVERYTHING I WRITE LOOK OFFICIAL, NO-NONSENSE AND GENERALLY HARD-CORE. INSTEAD OF HAVING A CONVERSATION WITH THE COMPILER I INSTEAD SHOUT!!! COMMANDS AT IT.


Oh so clever to make fun of FORTRAN.

This is how the modern FORTRAN Hello world looks like...

     program hello
          print *, "Hello World!"
     end program hello


I was actually partly serious. I voluntarily program in all-caps fortran because I think it makes my code look old/funny/the things I said.


Fortran has not been all caps after 1991, when the Fortran 90 standard came out. You knowledge is 24 years old.


Yeah, I know. It's just that I like programming in an old-school way for fun. I briefly considered not using structured programming at all, but that's kinda too much.


Yes, "high performance computing" is dying. There's no commercial market for it. Check the list of the top 500 supercomputers in the world.[1] The top 10 are all Government operations. In the top 25, there are a few oil companies, mostly running big arrays of Intel Xeons.

CPU clock speeds maxed out between 3-4GHz a decade ago. Nobody develops special supercomputing CPUs any more. The market is tiny. Old supercomputer guys reminisce about the glory days when IBM, Cray, Control Data, and UNIVAC devoted their best R&D efforts to supercomputers. That ended 30 years ago.

Supercomputers have poor price-performance. Grosch's Law [2] stopped working a long time ago. Maximum price/performance today is achieved with racks of midrange CPUs, which is why that's what every commercial data center has. Now everybody has to deal with clusters of machines. So cluster interconnection has become mainstream, not the province of supercomputing.

[1] http://www.top500.org/list/2014/11/ [2] http://en.wikipedia.org/wiki/Grosch%27s_law


The article is somewhat confusingly written (or maybe I was just confused by "killing it" colloquially meaning "doing well"), but even a cursory glance shows that it is not arguing that non-distributed high performance computing is relevant. I'm not sure what your post is responding to.

It is instead arguing that traditional HPC is being made irrelevant because traditional HPC uses MPI (the first successful distributed/parallel computing library), which is increasingly irrelevant in favor of newer libraries for the same task.


Did you RTFA? It's about MPI:

"MPI is a language-independent communications protocol used to program parallel computers."

Runs fine on commodity clusters.


>Runs fine on commodity clusters.

Kind of.... For simple, low communication jobs this is true. But when you start trying to find the eigenvectors of a large sparse matrix, communication becomes your bottleneck, at which point MPI on commodity clusters (those without a really fancy interconnect) "works", but not fast enough to be useful.


I don't think "really fancy interconnects" makes a cluster not commodity, since the original post in this thread is about supercomputer processors. You can put Infiniband in any system with a PCI-e 3.0 bus.


Communication would become a bottleneck regardless of whether you're using MPI or something else. The problem you're talking about has nothing to do with MPI; it is intrinsic to distributed computing itself.


And what system on commodity clusters is fast enough in this case?


What would you define as a "commodity cluster"? To me it's a 512-core vendor-specific blade server with special interfaces to get more bandwith at lower latency across longer links. But maybe i'm just an old fogey.


blades were never more than a marketing trick: the offer nothing that can't be achieved in a standard chassis. there were a few multi-chassis SMP/NUMA machines that had cache coherency over external interfaces, but that was neither commodity nor HPC.


1) Not vendor specific 2) Not blades 3) 10Gbe, not special interfaces

That is a commodity cluster.


Did you fucking read TFA?

It's title literally is: "HPC is dying, and MPI is killing it".

His comment shows more understanding of the article's main point (about the demise of HPC) that your "it's about MPI"...


CPUs any more

CPUs, err, schmee-PUs. It's all about the interconnect and people can and do make special interconnects.


Yeah the 5 dimensional torus network used at MIRA is just too cool not to bring up here. Modern supercomputers are increasingly become less and less discrete.

https://computing.llnl.gov/tutorials/bgq/


What's a 5 dimensional torus? I know a 3D is Circle x Circle, is a 5-d torus a Circle x Circle x Circle x Circle?

If so, that could be interpreted as simply a 4D square grid which wraps around the edges, right? (just as a 3D torus is a 2D grid which wraps)


Think of it as loops in 5 dimensions (x,y,z,a,b). I believe each node connects to 10 different neighboring nodes (2 in each of 5 dimensions), although I could be mistaken on that. You can actually tune which direction you prefer the nodes to communicate over by passing certain flags when you submit a job.

Also here's an image... which I admit is not terribly useful, but its what the national lab people put out. https://computing.llnl.gov/tutorials/bgq/images/5Dtorus.400p...


Oh so it's actually a 5D grid that wraps around. Confusingly that would make the standard torus a 2D torus, they're referring to the surface dimension instead of the euclidean space it can be embedded into.


> CPU clock speeds maxed out between 3-4GHz a decade ago.

Even for x86 this isn't true (4+GHz is at least possible), let alone platforms like POWER which have already pushed beyond 5GHz. Fancier things like vacuum-channel transistors, graphene transistors, etc. could push that even further once they break into commercial viability.

Not that clock speed alone really matters all that much compared to the other performance benefits of high-performance RISC architectures like POWER and SPARC...

> Nobody develops special supercomputing CPUs any more.

Today I learned that Blue Gene was a figment of my imagination :)

Special supercomputing CPUs are still being developed. The reason why they seem insignificant is because their market size has remained relatively constant, while the markets for general-purpose, non-supercomputing-specific platforms have grown much more rapidly. This doesn't mean supercomputing is dead necessarily, just like how the invention of the microwave oven doesn't mean that ordinary ovens are suddenly dead. Rather, it's just an indicator of different use cases, and the different markets thereof.

> The top 10 are all Government operations.

It's a bit misleading (though I suppose technically accurate) to list academic institutions (like the University of Texas, which holds the #7 spot) as "Government operations"; they're government-funded, yes, but there's a big difference between that and, say, an actual government agency directly managing such an installation. I also fail to see how even a majority of those being government installations has anything to do with anything; governments typically have much greater capital to spend on such things - and greater need for such things - than all but the most massive commercial entities.

HPC was never really the purview of commercial enterprises anyway (unless they had extreme computational requirements). The uptick in the use of COTS products for high-performance computing among enterprises (particularly big Internet-reliant ones like Google) wasn't really at the expense of the HPC crowd losing potential users; it's rather just something that formed very recently alongside HPC already being a niche topic.

Basically, by your arguments, "high-performance computing" has been dying for basically as long as it's existed.

> Grosch's Law [2] stopped working a long time ago.

Only because the world switched to clustering, where Grosch's Law doesn't quite apply, and hasn't addressed the limitations of current transistor technology (like the above-mentioned vacuum-channel and graphene transistor technologies, among many others).

> Maximum price/performance today is achieved with racks of midrange CPUs, which is why that's what every commercial data center has.

That's what "every commercial data center has" (this isn't exactly true, but we'll go with it for now) more because of price alone than because of an actually-calculated price/performance ratio. Businesses tend to think in terms of short-term investments much easier than they tend to think in terms of long-term investments (in contrast with academic and often government institutions, which tend to think in the opposite direction, and therefore have entirely different sets of problems in many cases).

Meanwhile, the big businesses that really do actively calculate an optimal price/performance ratio (like Google) aren't the ones using COTS solutions; they usually have the financial capability to invest in homegrown solutions and cut out any unnecessary expense, and are certainly not just buying a bunch of prebuilt servers from Dell. Google in particular has started to invest heavily in IBM's Open POWER initiative, probably due to a perception that POWER will offer a better price/performance ratio than x86 in their already-very-customized hardware stack.


"Today I learned that Blue Gene was a figment of my imagination" .. and I learned that Anton doesn't exist either.


NEC is still making their weird vector computers. I have no idea who buys them, but there's a hotchips about it.


These days Crays run intel Xeons.


in High Performance Computing, there is (1) 3-dimensional simulations (weather, fluid dynamics, structural mechanics, all kinds of physics simulations, like magnetic storms in space or nuclear reactors etc.) and then there is (2) everything else, like data mining, machine learning, genomics etc.

Some of the sparse matrix computations in structural mechanics and in some machine learning algorithms have some overlap. But mostly, group 2 has little reason to be interested in what group 1 is doing.

Now, group 2 obviously has more modern tools than the 3d-simulation community, because machine learning came to common use much later that numerical fluid mechanics.

But do 3d-simulation people also have much reason to be interested in what the machine learning people are doing?

The "machine learning / big data" people are probably not doing anything that makes a weather prediction model to run faster? Or are they?


They are doing it (interesting things) for lower capex and lower development costs. On opex, good for operations, bad for power consumption (relatively).

In terms of absolute performance HPC is absolutely faster. In terms of bangs for bucks, Big Data is hands down faster. Also in terms of accessibility Big Data is hugely easier - I can build you a 100 core big data system for $300k


But your Big Data system is only good for Big Data. If I want to run a weather prediction model, it is not going to help.

My point is, the big data and the physics simulation people probably do not have a lot of common interests - besides using large amounts of computing power.


There are big data people running on supercomputers too. I know there are people writing custom asynch job managers to handle big data type problems because the top supercomputers have low memory latency.

Also I think the dichotomy you're looking for is IO bound vs CPU bound problems. Although certainly there are a plethora of different kinds of IO bound problems (asynch vs synch or disk bound vs memory bound vs cache bound).


that's silly: HPC has been pinching pennies before big data was a thing. and the computer industry is biz: you get what you pay for. if you can live with Gb performance, you can drop around $2k (IB card, cables, switches) off your price. But it's not as if the hardware is any different, faster or more accessible.


I think it's economics, GPU's are sold by the million, super computers interconnects are sold by the thousands. Commodity kit is mass produced spreading design, vvt and manufacturing tooling costs.

The hardware is different in terms of the layout. Aggregations of small cores on boards (gpus) vs. very high speed large cores with lots of local memory. Highly localised connections vs. an interconnect fabric.

And it is more accessible because it's affordable, and you can get at it in the cloud; this means that skills building is easier for more people and it also means that a wider user base is possible.


Huh, where did GPU's enter the discussion?

Anyway, for a typical HPC cluster, it's bog standard x86 hardware, the only remotely exotic thing is the Infiniband network. Common wisdom says that since Infiniband is a niche technology, it's hugely expensive, but strangely(?) it seems to have (MUCH!) better bang per buck than ethernet. A 36-port FDR IB (56 Gb/s) switch has a list price of around $10k, whereas a quick search seems to suggest a 48-port 10GbE switch has a list price of around $15k. So the per-port price is roughly in the same ballpark, but IB gives you >5 times better bandwidth and 2 orders of magnitude lower MPI latency. Another advantage is that IB supports multipathing, so you can build high bisection bandwidth networks (all the way to fully non-blocking) without needing $$$ uber-switches on the spine.


That's interesting, things may have changes with IB since I last looked.

The GPU thing seems to have fallen out of my original comment, I meant to write "I can build you a 100,000 core system for $300k" but some how the decimal point jumped left three times! To do that I would definitely have to use GPU's...

I am seriously lusting after such a device, I feel that there is much to be done.


The article briefly mentions Erlang with its focus on message passing, and the focus on being fault tolerant should be another benefit. I could find a few mentions of Erlang used for simulations, such as here[1], but I am curious whether there is much actual usage in scientific computing, or whether there are some problems in practice.

[1] https://books.google.com/books?id=p0h9vAb1m7IC&pg=PA365#v=on...


Erlang itself is not good for the type of numerical computation typical done in HPC. It is amazing as a backend, and there are several ways it can call code written in other languages through the use of ports or NIFS, but if you try to do massive number-crunching using its own libraries then you're going to be unhappy with the results.


How about in simulations where a large number of cellular automata are interacting with each other, but individually only carrying out simple computations?


You cannot make an efficient fluid mechanics simulation on a 4000x4000x4000 grid if you set up a separate process for each individual gridcell. More efficient to just store your numbers in 3d arrays.


why not?


If you store the data in arrays, you can use matrix multiplication libraries such as Intel's MKL or OpenBLAS, which are written to be exceptionally optimized for use on multiple cores. I cannot emphasize enough how much time and effort has been put into these libraries to multiply matrices as fast as can possibly be done.

If you use processes such as in the Erlang VM, they're doing calculations, sure, but they're also sending messages back and forth, and they're acting as supervisors, and they're being shuffled around by the VM. There's a lot going on. And that extra stuff that's going on takes away from the time you could be multiplying stuff. And even then, there's been no optimization done for this sort of calculation. There are a lot of tricks you can do. Heck, the better matrix multiplication libraries have individual optimizations for CPUs.


You certainly can do that. In fact, I believe I saw an example of that in a tutorial somewhere talking about what Erlang/Elixir would be good for (I include Elixir because its the same VM controlling processes underneath, but a "nicer" syntax on top). Now, would it be the best language for that? Depends. If you're looking just for speed and massive computations, then no. In the end, C++ pretty much rules everything in that regard (except for maybe Fortran in some instances). But, I would say that it might be more fun to set it up in Erlang/Elixir. And you could pretty easily expand the whole thing just by adding more processors and simply telling it to spawn more processes (the Erlang VM is pretty awesome). I would almost say that an Erlang version of it would feel more life-like. You could probably experiment with it on the fly more easily, too. Kill off a process here or there and see what happens, etc.


I actually started going through Programming Elixir this past week because I was thinking it would eventually be a fun way to explore models through ad hoc substitution of rules and exogenous conditions. There are a lot of models available in NetLogo, but it doesn't look like it would be very powerful in terms of running larger scale simulations, and the scope of the language's usefulness is pretty limited compared to that of a general purpose language.

While I have been using C++/Rcpp to extend R as an occasional time saver (for analysis rather than simulations), it's only been little snippets written badly.

Anyway, since its so easy to come up with a too long list of technologies to learn, then only scratch the surface, hearing that Erlang/Elixir isn't the completely wrong tool is helpful.


As an example, here's the game of life written in Elixir:

https://github.com/spikegrobstein/elixir-conway

How much better that is than a C++ version, I don't know. But if you want to get better at C++ and give it a try, forget what you know and read this book: http://www.amazon.com/Programming-Principles-Practice-Using-...


erlang is not competitive in math calculation performance, and fault tolerance is not as important when your program has a relatively short, defined lifetime and well-understood inputs and outputs. As a message router and underlying infrastructure for a cluster it'd probably work pretty well, but then you'd have an impedance mismatch and operational concerns between the routing infrastructure and the calculation infrastructure.


Erlang is like Python in that you normally use C for the single-thread numerical stuff.


I am trying to get into distributed computing so this article is particularly interesting to me. I may be mistaken so please excuse my naivety if my points are off marks. I thought MPI was mainly geared towards communication-heavy tasks where the underlying network is specialized, for example infiniband or bus between CPUs. One use of MPI is to manage distributed memory tasks between different physical CPUs while threads run on multiple cores of same CPU. Spark, I believe, doesn't handle cases like this well because JVM hides low level details. I have read papers that propose to layer MPI over RDMA rather than expose a flat memory model, which came as a surprise to me but it shows the flexibility of MPI. One thing unclear to me what performance we can expect from MPI when we use commodity network gears, and how it compares to Spark. The article is absolutely correct MPI leaves robustness to user and that is today an oversight.

Modern Hadoop ecosystem is designed for different workload from MPI's. It emphasizes co-localizing date and computation, seamless robustness,and trades off raw power for simple programmingmodels. MapReduce turns out too simple, so Spark implements graph execution, which is nothing new to HPC. As far I know Spark's authors don't believe it is ready for distributed numerical linear algebra yet. But a counterpoint is that I am seeing machine learning libraries using Spark, so perhaps things are improving.

One thing I have learnt today is that MPI isn't gaining popularity. I just have a hard time picturing a JVM language in overall control in HPC where precise control of memory is paramount to performance.


> I thought MPI was mainly geared towards communication-heavy tasks where the underlying network is specialized

The beauty of mpi is:

  * its definition is completely open

  * it segregate the high level message passing interface from the low level stuff
This means that code that was written on cheap old commodity network gear over tcp/ip will work on brand new specialised hardware using their own protocol. Because it's fully open, any hardware vendor can provide MPI driver for their hardware at virtually no cost.


I think you've omitted something. PVM had that same beauty, but MPI won over PVM. Do you have any idea why?


I'm not sure... My only exposure to HPC was the install of a cluster with MPI over Infiniband. At the time I looked into MPI a bit and played with it at home (over wifi and ethernet).


I agree that languages that rely on tracing GC seem like they're fundamentally at a disadvantage when it comes to pushing the envelope of single-node performance; the best article I've read arguing this was actually in the context of mobile games, rather than HPC, but I can't for the life of me find the article now.

I don't know if Spark itself is the right way forward; but it's an example of a very productive high-level language for certain forms of distributed memory computing. And some of these issues - like the JVM - aren't fundamental to Spark's approach; there's no inherent reason why something similar couldn't be built based on C++ or the like.


I'd be very wary of criticising Spark based on the technologies it's built on.

In particular, a decent garbage collector will give you performance dependent on the number of live objects (typically low) and not on the number of allocations and deallocations, as you might see in a non-garbage collected language. This gives great allocation performance and reduces overheads.

The disadvantages can be (potentially long) GC pauses and higher overall memory requirements, but in practice this isn't usually a problem for non-interactive systems.

Of course, if you do have a device with low memory and low tolerance for GC pauses (i.e. mobile gaming) there might be a problem.

The main disadvantage seems to be less predictable performance; which could be a problem in domains which require good timing performance, but that's not really Spark's problem.

A GC'd language is also generally easier to program in, since one doesn't have to (in general) worry about memory management, so it's generally a lot easier to program very large systems with lots of moving parts.


I totally agree that the programming model of Spark is the right direction. I dream of the day when compiler and OS cooperate to expose a simple interface to distributed memory and an optimal execution-communication system, kind of like Cilk but for clusters.

BTW, thanks for a thought provoking article. You have given me a lot to ponder.


RDMA doesn't really provide a flat memory model - all it's really doing is minimizing copies when you send a message. more like "put this 100K string into that node at <address>".


Too bad he didn't talk about GPGPU killing MPI too or not. I don't know enough to say.

I'm not familiar with the HPC space but I thought a lot of new work, at least in machine learning, was migrating to GPGPU instead of traditional CPUs. The compute per $ or per watt payoff is too large to ignore.


I wouldn't say GPUs are killing off MPI. You still need some way to pass data between nodes/GPUs (most of these datasets can't fit within a single GPU). What you are seeing though, is less and less use of the CPU. If code developers decide to use GPUs, they try and move their data onto the GPU and keep it there for as long as possible (data movement across PCIe is a killer for performance). ORNL's new machine Sumiit [1] will have 1/4 the nodes of their current machine, Titan, but multiple GPUs per node. Thus emphasizing the move away from CPUs and onto GPUs. Again though, there are still 3K nodes and you need some efficient way to pass data between those nodes.

[1] https://www.olcf.ornl.gov/summit/


GPUs have a large advantage in a very narrow niche: regular and very intensive ops on regular and compact data. ML is not completely ideal (because it's not that compute-intensive), but MC simulations often are. Most things are not ideal for GPUs, especially since it's often not obvious how to nicely scale across GPUs. MPI's strongest point is that it lets you take very good advantage of any topology of general-purpose computers: multicore, NUMA, distributed-memory. Models that emphasize data-parallel programming (co-array fortran, etc) suit GPUs much better. But nothing is going to change the fact that GPU registers are much faster than GPU (card) memory, which is faster than any possible interconnect.


I think with GPGPU, the issue is a little different; it's local computation, so a bit orthogonal to distributed-memory coordination. But it is interesting to see how many higher-level libraries and other tools (like OpenMP4/OpenACC) are springing up around GPU usage. It's hard not to be a bit jealous...


This made me feel old, as I remember the days when we got to learn PVM at distributed computing classes and MPI was presented as something that some people were kind of working on.


If it makes you feel better, I started with PVM 2.


I used MPI (Message Passing Interface...) back in the day.

But it was a pain, especially since our code was a mix of c (which was easy to mpi) and ada (not so easy). Its pretty low level stuff (I think we used Open MPI). All the nodes need to have MPI set up and configured, fine if you have a team willing to do it but these days....

mpiexec -n 10 myprocess

I think we liked it because the processes would be put to sleep by the mpi daemon until a message arrived. You can sleep and wait for a message with sockets now I think. Its been a while since I've used the unix IPC (Interprocess communication).

I don't think I'll miss it.


This is a good article.

I built a tiny cluster in my basement (a prototype) and looked at MPI and decided that it was way to complicated, so I just built something that pushes the essential bits, pretty much without abstraction, to the nodes and was done. The cluster is a very specific solution, so I felt justified in not looking at MPI. And now the decision feels even more justified.


John: You make some solid points. Most though seem to more support the idea that more research investment/emphasis is badly needed for HPC programming models. From an application perspective, one sees such a dominant reliance on MPI+X primarily because typically the value proposition just isn't there yet for alternatives (at least in the areas where I work, where we have undertaken new developments recently and done fairly exhaustive evaluations of available options). Though the coding can be somewhat tedious and rigid, in the end these shortcomings have been outweighed by the typical practical considerations -- low risk, extremely portable, tight control over locality and thus performance, etc. It's obviously not all or nothing - as you say we could choose something even lower level and possibly get better performance, but when seen from the perspective of the near and mid-term application goals, it's hard to make a different choice unless explicitly tasked with doing so.


Thank goodness someone said it. I get tired of using ancient software that's older than me for relatively simply tasks that should have long ago been coded into a higher level of abstraction. I programmed a pair-correlation function calculation routine using MPI once -- yech.

Fortran, MPI, even C to an extent -- can we please move on? I don't understand why the scientific community is so reluctant to embrace change. It seriously doesn't take that long to learn a new language or a platform like Github (yeah, that's still considered "new" in the scientific community), and the time investment more than pays itself back many times over.


> I don't understand why the scientific community is so reluctant to embrace change.

Let's assume the opposite were true, and it was fast to embrace change. How much time would be spent on this change -- relearning, rewriting, refighting old bugs -- vs. actual work done?

Change is overhead. You do as little of it as necessary, and only when not changing starts costing a lot. Which means you change, but slowly.

As to Fortran, it will go away when something better comes along, and then it will do so slowly, for the aforementioned reasons.


The lion in the room is that the DOE National Laboratories have a huge amount of code tied up in MPI and they continue to spend millions of dollars both on hardware and software to support this infrastructure. If you look at the top 500 list:

http://www.top500.org/lists/2014/11/

Four out the ten computers are owned by DOE. That's a pretty significant investment, so they're going to be reluctant to change over to a different system. And, to be clear, a different software setup could be used on these systems, but they were almost certainly purchased with the idea that their existing MPI codes would work well on them. Hell, MPICH was partially authored by Argonne:

http://www.mcs.anl.gov/project/mpich-high-performance-portab...

so they've a vested interest in seeing this community stay consistent.

Now, on the technical merits, is it possible to do better? Of course. That being said, part of the reason that DOE invested so heavily in this infrastructure is that they often solve physics based problems based on PDE formulations. Here, we're basically using either a finite element, finite difference, or finite volume based method and it turns out that there's quite a bit of experience writing these codes with MPI. Certainly, GPUs have made a big impact on things like finite difference codes, but you still have to distribute data for these problems across a cluster of computers because they require too much memory to store locally. Right now, this can be done in a moderately straight forward way with MPI. Well, more specifically, people end up using DOE libraries like PETSc or Trilinos to do this for them and they're based on MPI. It's not perfect, but it works and scales well. Thus far, I've not seen anything that improves upon this enough to convince these teams to abandon their MPI infrastructure.

Again, this is not to say that this setup is perfect. I also believe that this setup has caused a certain amount of stagnation (read huge amount) in the HPC community and that's bad. However, in order to convince DOE that there's something better than MPI, someone has to put together some scalable codes that vastly outperform (or are vastly easier to use, code, or maintain) the problems that they care about. Very specifically, these are PDE discretizations of continuum mechanics based problems using either finite different, finite element, or finite volume methods in 3D. The 1-D diffusion problem in the article is nice, but 3-D is a pain in the ass, everyone knows it, and you can not get even a casual glance shy of 3-D problems. That sucks and is not fair, but that's the reality of the community.

By the way, the oil industry basically mirrors the sentiment of DOE as well. They're huge consumers of the same technology and the same sort of problems. If someone is curious, check out reverse time migration or full wave inversion. There are billions of dollars tied up in these two problems and they have a huge amount of MPI code. If someone can solve these problems better using a new technology, there's a huge amount of money in it. So far, no one has done it because that's a huge investment and hard.


The sad thing is that MPI-3 doesn't have wide acceptance yet. A lot of organizations mainly use MPI-1/2. While using RHEL 6.x. Shoutout to red hat for maintaining ancient software


W-why are they using that picture at the start when their source is an essay on exactly the way they're misusing it?


This happened with Smalltalk.

I was a Smalltalk coder. I thought it was the best thing since sliced bread. It has always been clear to me that Smalltalk is far superior to Java.

I left the company after a little while, to do C++ graphics. I later heard that my former employer rewrote their Smalltalk application in Java.

Now no one uses Smalltalk anymore. While Objective-C is based on Smalltalk, Smalltalk was far easier to use, however lots of people use Objective-C. No one uses Smalltalk.

How could it have been different? My friend Kurt Thames once said that "Smalltalk is the way object-oriented programming SHOULD be done." I have always agreed with that.

But when new methods (!) of OOP arose, all the Smalltalk crowd did was gripe about how Smalltalk was far better than Java or Objective-C.


What "new methods" are you thinking of?

What let Java get ahead of Smalltalk for me personally, as someone getting into programming in 1996, was that i could write it in the text editor i already had, compile it with a compiler i could get for free, and then post the source code on Geocities (actually, Xoom - remember that?) to share with others.

Whereas when i tried to get into Smalltalk, the first thing i had to do was learn my way around this wacky environment with its strange class browser and ultra-retro window manager, and get my head around the fact that my source code wasn't anywhere particular, and yet was everywhere, and that if i wanted to share your code, i had to somehow "file out", and then hope that my internet friends could successfully "file in" to their own potentially modified images. Once i'd got hold of the tools at all, that is.

Which is not to say that the Smalltalk environment was not better than Notepad/DOS box/javac, because of course it was. It just didn't lend itself to adoption and spread nearly as well. It was a tool for masters, with affordance for apprentices.

Also, Java had pretty good networking right in the standard library, and networking was really exciting in 1996.


Hmm, I don't know. I'm pretty damn productive just programming in notepad(equivalent).


I'd say that your ability to write Java source in any text editor at all was, all by itself, a new method of doing Object-Oriented Programming.

We used Visual Smalltalk Enterprise. It had the cool feature that, at the end of the workday, I could make what amounted to a core dump, then the following morning I would load my core dump into a running program, and there would be all my open windows with the cursors in the right places in the source documents and so on.

That was quite cool I really enjoyed it, however that environment was profoundly non-portable. I expect that much of the success of Java as opposed to Smalltalk was the simple ability one had to post a tarball full of source code on one's FTP site.


That wasn't new, though. C++ had that. Eiffel had that. Every programming language that isn't Smalltalk has that.


I'm not dead certain but I think smalltalk may have come before C++. For sure it was in widespread use before C++ was in widespread use.


Smalltalk started in '71 and was used for research purposes in '72. Smalltalk '80 was the definitive version for most. This means that Smalltalk slightly predates C. By comparison, C++ didn't start until 1983.

You wrote "your ability to write Java source in any text editor at all was, all by itself, a new method of doing Object-Oriented Programming".

Simula, considered the first OO language, is text based in the same way that Java is, and not image based like Smalltalk. Since Smalltalk is inspired by Simula, I do not believe one can say using any text editor is a new method of doing OO programming.


Ah, but I think that the comparison doesn't hold. In my opinion, Smalltalk was the right way to do all of what we do, and Java took the enterprise mindset by storm. There was an enormous project at a very large Insurance company near Chicago that was written in Smalltalk, but got abandoned for some obscure reason.

I think that Smalltalk has the right level of abstraction and a lot of very good other things about it.

MPI was, as the article points out, the wrong abstraction for the problem. If MPI dies, I am ok with that.

I am sad that Smalltalk is not more widely used.


"Java took the enterprise by storm"

Smalltalk's demise no doubt had a lot to do with Sun's marketing people convincing a bunch of Pointy-Haired Bosses that garbage collection means that you have no memory leaks, as well as that Java was the only way to do cross-platform development.

IMHO Java is one of the very worst ways to do cross-platform, however when I ported a a Mac OS Pascal program to Java so that it could be run on both Windows and Mac, the client was completely convinced that Java was the only way that could possibly be done - this despite my loud and frequent protests that the state of Java at the time was quite poor, that the Java interpreter was dog-slow, that Java sucked the memory dry, and that I knew a whole bunch of ways to write cross-platform native code that would be far faster and use far-less memory.

The reason that Smalltalk specifically suffered from this, is that Sun made most of its money by selling servers to the enterprise. Sun Workstations were favored by scientists and engineers, however Sun's real money came from enterprise servers.

Right around that time, Smalltalk was largely used for enterprise applications. Enterprise application developers loved Smalltalk absolutely to death however the bean counters and the PHBs were more inclined to listen to marketdroids promises about garbage collection being immune to memory leaks.

Garbage collection and memory leaks are orthogonal.


While I agree with your conclusion, the Smalltalk's demise no doubt had a lot to do with Sun's marketing people convincing doesn't match what I observed. I was at allstate at the time, and when the groundswell of attention paid by the rank and file was sufficiently overwhelming (in a positive way), that the PHBs had to say "chill. We need to investigate this" in true Gartner fashion. Later, it did get adapted, but there is a lot of .net there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: