
Erlang and IBM Power8 in the cloud: super-high single-system parallelism - davidw
http://erlang.org/pipermail/erlang-questions/2014-October/081407.html
======
jacques_chester
For those as confused as I was at first, the critical line is "total time".
2.8s for the P8 vs 38.7s for the x86.

Otherwise the x86 comes out looking a lot better -- lower 95th percentile,
lower max, lower average.

(Modulo usual complaints about benchmark porn: single run, lack of standard
deviation, unknown configuration differences etc etc).

~~~
baq
thanks for the explanation, had major cognitive dissonance about that post.

~~~
felixgallo
yeah, sorry, it was just a quick test; I didn't even run any erlang besides
building it from scratch, starting it up and seeing that BEAM recognized the
right number of schedulers. It would be interesting to run a more
comprehensive test suite than a single timing run, for sure.

------
throwaway_yy2Di
How does POWER8 compare to x86, e.g. Haswell? Just skimming some of the
architecture details...

* 4x hardware threads per core (8-way SMT vs. 2)

* 1/4th FP throughput per core (8 SP flops/cycle vs. 32)

* 3x bandwidth to RAM (230 GB/s vs. 68) [edit: updated for Haswell-EP]

[https://en.wikipedia.org/wiki/POWER8#Specifications](https://en.wikipedia.org/wiki/POWER8#Specifications)

[http://www.redbooks.ibm.com/abstracts/tips1153.html](http://www.redbooks.ibm.com/abstracts/tips1153.html)

What is it good for?

~~~
apaprocki
Worth mentioning that POWER is the only chip I'm aware of that has hardware
DFP instructions.

~~~
polack
Sounds interesting! What's needed from the user in order to use this feature?
Will it be possible to use from high level languages like java or do you have
to use assembly? Can think of quite a lot of applications that would benefit A
LOT from this.

~~~
desdiv
High level languages like Python, Ruby, and Go could take advantage of this
without having to make any code changes, provided that the appropriate
libraries are updated to use this new hardware.

Java, however, is a special case. Since Java bytecode does not have a
BigDecimal type, there's no way to update the JVM to take advantages of this
hardware unfortunately.

~~~
wmf
I would assume that the IBM JVM uses DFP instructions to implement the
BigDecimal _class_.

~~~
apaprocki
Yes, according to IBM's presentations, BigDecimal in their JVM uses 64-bit DFP
instructions on their hardware. Minimum hardware level is POWER6 or Z10 (Z9
supported via microcode).

[http://www.ibm.com/developerworks/rational/cafe/docBodyAttac...](http://www.ibm.com/developerworks/rational/cafe/docBodyAttachments/3212-102-1-6349/DecimalFloatingPoint.pdf)

------
desdiv
Previous HN discussion:
[https://news.ycombinator.com/item?id=8481851](https://news.ycombinator.com/item?id=8481851)

------
Erwin
A while ago (here's an article from 2002:
[http://lwn.net/Articles/6367/](http://lwn.net/Articles/6367/) ), the big new
thing was going to be ibm mainframes running linux VMs. Whatever became of
that? Is anyone still doing that?

~~~
PhuFighter
the last I heard, Linux on the IBM Mainframes were driving > 50% of their
sales.

------
neurotixz
While the performance seen here is nice, i'm curious to see the
price/performance ratio. Running against a 8-core XEON would not make sense if
the closer Intel system price-wise is a quad 12-core xeon... Obviously we are
talking cloud here so it might not even apply.

By my experience with Power7, the price/performance ratio is much lower on
Power then Intel systems. Maybe it changed but i'm not holding my breath, even
if IBM seems much more aggressive on pricing with P8 then they were with
P5-P7. The

Quick calculation, absolutely unscientific:

Seeing that the price is 0.14$/hour for the 6-core xeon and 1.08$/hour on the
176 core P8, it would have to be roughly 8-10x faster to justify the cost
difference, not sure it will be the case.

~~~
felixgallo
the thing you're getting here is primarily throughput on a single image. Even
if it's more expensive per-core per-hour, you can't discount that you'd have
to work a lot harder to get the equivalent 30-box distributed solution to work
properly, and even then it would have certain disadvantages owing to network
latency.

~~~
nonsequ
This is interesting and I'd like to hear more opinions on it. My impression is
that distributed computing has been eating Power/Sparc/Z processors' lunch for
a long time now because software has made up for the deficiencies of
coordinating 30 boxes. Do you and do any others believe that we are at an
inflection point where the pendulum swings back in the direction of 'high-
performance' processors like Power8, or will improvements in 'scale-out' ease-
of-use and economies of scale continue to win the day?

~~~
felixgallo
The dominant use case for the last decade or so has been web servers hitting
caches to do low-CPU low-causality CRUD operations. That looks unlikely to
change in the next decade, so keep your Intel stock.

That said, for a lot of interesting use cases, like that king-hell postgres
database sitting in the middle of the swarm, or video processing, or streams
processing, or indeed any situation in which thousands-to-millions of
simultaneous actors need to work on the same shared state, this sort of system
starts looking real interesting.

As a thought experiment, think of this system like a GPU, except every single
processor is a fully capable 2 GHz i5 running Unix, and instead of having to
deal with the CUDA or OpenCL API, you can just write erlang (or haskell; .. or
whatever) code and it will run. And instead of having 2-8G of RAM, you have
48G. And instead of having arcane debug tools, you have recon and gdb and ddd.

I don't think there is a pendulum, I think there's a spectrum and has always
been one; pragmatism should always rule, and your use case is not my use case.
There isn't going to be an objective winner ever, no matter how close Intel
may get to covering much of the sweet spot.

------
dschiptsov
It is not just about Erlang, but any language+runtime which has been designed
upon a well-known set of sound principles (immutability, share-nothing,
message-passing).

As long as order of evaluation does not matter (for a pure-functional code)
some [major] parts of a program could be evaluated in parallel by runtime
without any changes in the code (especially when a high-order function
composition - map/filter/reduce/etc is the primary pattern). So, Haskell, for
example, will do it too.

~~~
dragonwriter
> So, Haskell, for example, will do it too.

By "will do it" you mean "would, in theory, support a compiler that
parallelized significant amounts of idiomatic code with no source changes"?
Maybe. I don't think any existing Haskell compiler does that, though.

~~~
wyager
I've had GHC automagically parallelize tree-traversal code before. However, I
think in general you have to put a bit of work in to it.

Of course, you could theoretically just write an actor library that did the
exact same thing as erlang.

------
PhuFighter
Interesting simple cup test. I can't wait to see other, more comprehensive
applications.

------
listic
Which one is that plan that allows you to "spend about $1"?

~~~
felixgallo
[https://cloud.runabove.com/login/?launch=power8](https://cloud.runabove.com/login/?launch=power8)
leads to the power8 pages. Not a super well organized site imo.

