

Brawny cores still beat wimpy cores, most of the time [pdf] - gruseom
http://research.google.com/pubs/archive/36448.pdf

======
gwern
> Third, smaller servers can also lead to lower utilization. Consider the task
> of allocating a set of applications across a pool of servers as a bin-
> packing problem—each of the servers is a bin, and we try to fit as many
> applications as possible into each bin. Clearly that task is harder when the
> bins are small, because many applications might not completely fill a server
> and yet use too much of its CPU or RAM to allow a second application to
> coexist on the same server. Thus, larger bins (combined with resource
> containers or virtual machines to achieve performance isolation between
> individual applications) might offer a lower total cost to run a given
> workload.

Hey, that's a pretty good analogy! 'When you have lots of small processors,
it's like packing for a trip with a lot of small bags - where do you fit your
skis?'

------
pjscott
The meta-problem here is automatically determining how to allocate tasks among
different types of processors in a large, heterogeneous data center. There
_are_ workloads for which wimpy cores have a serious advantage, and ones where
they don't fare as well. Also, assigning jobs to cores that aren't perfectly
suited for them may still be a good idea if you need to meet load spikes. The
profiling and scheduling could be tricky, but for a company the size of
Google, it could save a lot of power and money.

~~~
miratrix
I think you may have missed one of the central point that Urs was trying to
make:

"Software development costs often dominate a company’s overall technical
expenses, so forcing programmers to parallelize more code can cost more than
we’d save on the hardware side."

There could be savings to be had by having a very smart scheduler that knows
all about what kind of workload a given piece of task is by analyzing the code
and dispatching it correctly. As you mentioned, though, that's not an easy
problem to solve, and will also be a part of the software development cost.

From the programmer's point of view, that'll be yet another variable out of
their control that may break the computing abstraction that makes these
massive parallel machines possible.

~~~
pjscott
That's part of the big win of having job abstraction systems, like MapReduce:
having the jobs wrapped in higher-level abstractions that make automated
profiling and fancy scheduling feasible.

It doesn't even have to be anything that fancy. It could be along the lines of
"run the job a hundred times on each type of machine, and measure throughput,
latency, and energy usage; then try to allocate jobs to the best kind of
processor for it." That shouldn't break anything.

~~~
gruseom
How do you tell what kind of job you have without running it a hundred times?

Or do you mean run small portions of a large job before deciding where to
commit the rest?

~~~
pjscott
I meant each _type_ of job; sorry for the ambiguity. So, for example, if you
have a job type that is "look over a chunk of text and return term
frequencies", you could test this on a few hundred chunks of text, and then
use this profiling information to guide the scheduling any other time you run
a job of this type.

~~~
cma
Google has this covered in spades:
<http://research.google.com/pubs/pub36575.html>

------
justincormack
I think the space of applications that could run on wimpy machines is a
reasonable size, although of course not all. Google though do have a huge
installed base of software so the cost of moving could be very high. The
examples are based on stuff that is at the limit for sequential execution
speeds, but loads of software is not, has scope to run slower.

------
spullara
One thing you can do to combat the issue is use a templating framework that is
multithreaded like mine: <http://github.com/spullara/mustache.java> It spreads
all the requests work over multiple cores so single request latency doesn't
suffer when you have a lot of cores.

~~~
kevingadd
How do you address the scheduling and synchronization issues described in the
paper to keep request latencies low?

------
zdw
I wonder what the massively parallel CPU vendors like the Opteron 6xxx series
or Sparc T-series would have to say about this.

Also, overall power usage isn't given much consideration, just response time.

~~~
codedivine
I would say a core in Opteron 6xxx series at least still fits the defintiion
of a brawny core. Its more about Atom/current ARM cores type of wimpy cores
that we are talking about.

------
ldar15
Takeaway: parallel programming is hard even for Googlers.

~~~
kennystone
Google is a lot C++, Java, and Python, so indeed it is. Use Erlang (or
similar) and parallel is easy.

~~~
lukesandberg
I don't know a lot about erlang but i doubt it makes parallelism easy all of
the time. I'm sure there are many cases when the default actor model works
great, but that's not a reasonable model for _every_ problem. Shared memory
multithreading (for all of its faults) can be very effective for certain kinds
of problems. So it can be worth the difficulty.

