
Computing at scale, or, how Google has warped my brain - adambyrtek
http://matt-welsh.blogspot.com/2010/10/computing-at-scale-or-how-google-has.html
======
jedbrown
_Forget about complex shared-memory or message passing architectures: that
stuff doesn't scale, and is so incredibly brittle anyway (think about what
happens to an MPI program if one core goes offline)._

MPI is not good at fault tolerance, though MPI-3 will help some. MapReduce and
Sawzall are well-suited to the problems they were made for: largely
independent operations on very large data sets with occasional reductions and
weak synchronization. Usually disk is a significant factor and strong
scalability is not especially important.

But they really can't compete with MPI for other problem domains. Claiming
that MPI doesn't scale is demonstrably false, as evidenced by PDE solvers
running on 200k - 300k cores (Jaguar, Jugene, etc). This is a comfortable
order of magnitude larger than Google is running, and the communication
requirements of those algorithms are far more demanding, with essential all-
to-all communication as well as low-latency global reductions (for which Blue
Gene has dedicated hardware). The need for these low-latency all-to-all
operations is actually very fundamental (formally proven). The algorithms also
tend to have much stricter synchronization requirements, for example, nothing
can be done until everyone gets the result of a dot product, because the next
operation needs to happen in a consistent subspace (though MPI has fine
asynchronous support). There has been plenty of work trying to make algorithms
less synchronous, but weakly synchronous algorithms almost uniformly have
worse algorithmic scalability (number of iterations to solution increases by
more than a constant factor). And since problem sizes are usually chosen to
fit in memory, the network doesn't get a free pass due to the disk being slow.

If you care about strong scalability, MapReduce and Sawzall are probably not
good choices. Similarly, if your problem domain requires low-latency all-to-
all, reductions, or "ghost exchange".

~~~
Retric
_200k - 300k cores... a comfortable order of magnitude larger_

I would assume Google has well over over 20,000 cores. Or did you mean
something else?

~~~
jedbrown
From talking with someone at Google Zurich who is involved in page-rank
computations, those jobs run on ~10k to 20k cores. Of course Google has many
more distributed around the world, but unless something has changed in the
last few months, they are not running single jobs on more than about 20k
cores.

~~~
Retric
In many ways pagerank is a special case where you need a lot of resources for
a short period of time. In such cases you could use PMI but that's not the
hard part of scaling for Google.

From what I have read, Google treats its global compute infrastructure in a
reasonably abstract fashion. To the point where their production systems share
resources across datacenters so that losing a datacenter has minimal impact.
Thus, when Google people talk about "scale" they often mean getting it to work
for long periods of time, efficiently, on flaky hardware, spread across the
world, which PMI does not do.

~~~
jedbrown
You use "PMI" consistently, but I assume you mean "MPI". MPI is not a silver
bullet, and if your problem can be reasonably solved with nearly asynchronous
algorithms such that fault tolerance is a bigger problem than network
performance, then MPI is probably not a good choice. But this is a rather
narrow definition of "scale" (and much different from the standard definition
over the last several decades of scientific computing), and for more tightly
coupled problems, MPI can beat these alternative paradigms by huge margins.

~~~
Retric
The more cores you have the more important dealing with failure becomes. Multi
CPU super computers have been dealing with these issues for a while. 2-3 CPU
failures per day is fairly common and losing a rack in the middle of a job is
way too common.

~~~
jedbrown
Yes, however the algorithms are such that failure cannot be made strictly
local. For example, when solving a stiff PDE, there is unavoidable and quite
complex global communication at every time step (indeed, within the nonlinear
solves required in each time step). There is no hope of "continuous local
checkpointing" or the like because the program state is changing to the tune
of gigabytes per second per node.

The most common response is to perform global checkpoints often enough that
hardware failure is not too expensive, but infrequently enough that your
program performance doesn't suffer too much. Bear in mind that our jobs are
not normally hitting disk heavily, so frequent checkpointing could easily cost
an order of magnitude. When hardware failure occurs, the entire global job is
killed and it is restarted from the last checkpoint. There are libraries (e.g.
BLCR) that integrate with MPI implementations and resource managers for
checkpoint, restart, and migration.

Note that it is not possible to locally reconstruct the state of the process
that failed because there just isn't enough information unless you store all
incoming messages, which can easily be a continuous gigabyte per second per
node. Even if you had a fantastic storage system that gave you ring storage
with bandwidth similar to the network, local reconstruction wouldn't do much
good because the rest of the simulation could not continue until the
replacement node had caught up, so the whole machine would be idle until
during this time. If you have another job waiting to run, you could at least
get something done, but the resource management implications of such a
strategy are not a clear win.

So if hardware failure becomes so frequent that global checkpointing is too
costly, the only practical recourse is to compute everything redundantly.
Nobody in scientific computing is willing to pay double to avoid occasionally
having to restart from the last checkpoint, so this strategy has not taken off
(though there are prototype implementations).

~~~
Retric
There are three reasons why that approach tends to work well in scientific
computing circles.

First off individual computational errors are rarely important. EX: Simulating
galactic evolution as long all the velocity's stay reasonable each individual
calculation is fairly unimportant and bounds checking is fairly inexpensive.

Second, there is a minimal time constraint, losing 30 minutes of simulation
time day is a reasonable sacrifice for gaining efficiency in other areas.

Third, computational resources tend to be expensive, local, and fixed. AKA,
Cray Jaguar not all those spare CPU cycles running folding at home.

However, if you’re running VISA or World of Warcraft then you get a different
set of optimizations.

~~~
jedbrown
_First off individual computational errors are rarely important._

This is completely wrong. Arithmetic errors (or memory errors) tend not to
just mess up insignificant bits. If it occurs on integer data, then your
program will probably seg-fault (because integers are usually indices into
arrays) and if it occurs in floating point data, you are likely to either
produce a tiny value (perhaps making a system singular) or a very large one.
If you are solving an aerodynamic problem and compute a pressure of 10^80,
then you have might as well have a supernova on the leading edge of your wing.
And flipping a minor bit in a structural dynamics simulation could easily be
the difference between the building standing and falling.

I would argue that data mining is actually more tolerant of such undetected
errors because they are more likely to remain local and may stand out as
obviously erroneous. People are unlikely to die as the result of an arithmetic
error in data mining.

 _Second, there is a minimal time constraint,_

There is not usually a _real-time_ requirement, though there are exceptions,
e.g. <http://spie.org/x30406.xml>, or search for "real-time PDE-constrained
optimization". But by and large, we are willing to wait for restart from a
checkpoint rather than sacrifice a huge amount of performance to get continual
uptime. If you need guaranteed uptime, then there is no choice but to run
everything redundantly, and that still won't allow you to handle a nontrivial
network partition gracefully. (It's not a database, but there is something
like the CAP Theorem here.)

~~~
Retric
For the most part you can keep things reasonable with bounds checking. If
pressure in some area is 10x those around it then there was a mistake in the
calculation. If your simulating weather patterns on earth over time, having a
single square mile 10 degrees above what it should be is not going to do
horrible things to the model. Clearly there are tipping points but if it's
that close to a tipping point the outcome is fairly random anyway.

Anyway, if you could not do this and accuracy is important, then you really
would need to double check every calculation because there is no other way to
tell if you had made a mistake.

------
retube
When I was at CERN 10 years ago, the set up was similar. Instead of Macs the
desktops were PCs. The code was Fortran, Perl and c++. All work was on remote
Linux boxes (typically exporting a desktop over X), although physical not
virtual OS instances. You'd ssh to a gateway IP and be logged into one of the
hosts. All data and user accounts were stored on a distributed filesystem AFS.
There was job scheduling and management middleware that'd batch process your
analysis across hundreds of hosts in the datacentre.

"The cloud" has been around for a while, although implementation has changed.

~~~
irahul
Yahoo uses a similar setup. Instead of virtual hosts, the hosts here are
physical machines. Search team had a dedicated lab(not sure what's happening
now with search migration) and all developers would ssh to their assigned lab
machines, fire up their vim/emacs and crank code.

Search team has a mixture of windows laptops, macs, linux desktops, linux
laptops..and it didn't matter because everyone just ssh'd to a RHEL box and
worked.

Yahoo used to be a BSD shop, then moved to linux. Search team used both Debian
and RHEL in the beginning, then transitioned to RHEL fulltime.

I don't understand the obsession with RHEL. In my experience, Debian/Ubuntu
will give the same benefits as RHEL sans the cost.

The rationale for moving from BSD to linux was practical. The gist was most of
the vendors either do not do bsd or bsd was second class citizen. And BSD
wasn't as quick as linux in catching up with esoteric hardware. But discarding
Debian didn't actually make sense to me.

~~~
retube
CERN used RH too, 6 when I was there. I believe they've now switched to
Scientific Linux.

Actually I don't understand why Google would use virtualisation. Why would
they split a physical box up? Making many hosts look like one, yes (is such a
thing possible?), but not the other way around.

And AFS was cool: it was actually a global filesystem that many institutions
used. you could cd /, ls and browse to nasa.gov amongst many thousands of
others (much was protected!) (cern was mounted on /cern.ch)

~~~
mechanical_fish
_I don't understand why Google would use virtualisation. Why would they split
a physical box up?_

Ease of administration. It turns out to be much easier to have ten virtual
machines running one daemon each than one virtual machine running ten daemons.
Fewer interactions, cleaner process boundaries (the boundary between one VM
and another is pretty thick), easier debugging. Easier to swap out parts
without affecting other parts. Much easier scaling. (If you need more servers
for Service X, launch more VMs; if your machine needs more RAM to hold those
VMs, launch the VMs on a different physical box -- the system doesn't need to
be changed to support that because it's already architected to assume separate
"machines" for everything.)

For more on the subject ask Ezra Zygmuntowicz or his blog and books. I stole
all of this material from him.

~~~
gaius
_It turns out to be much easier to have ten virtual machines running one
daemon each_

It's funny, when you needed one OS instance per network server daemon on
Windows in the 90s, everyone laughed at it. Now it counts as cutting edge
thought.

~~~
borism
Everyone laughed at it because at the same time there were OSes available that
could run unlimited number of daemons simultaneously.

Now it's a matter of choice for system owner. And of course all those single-
daemon VMs run inside OS that lets numerous VMs run at the same time.

------
jessriedel
>The idea that you need a physical machine close by to get any work done is
completely out the window at this point. ... I ssh into the virtual machine to
do pretty much everything: edit code, fire off builds, run tests, etc.
...Wide-area network latencies are low enough that this works fine for
interactive use, even when I'm at home on my cable modem.

This situation sounds fine if you're a developer and your job requires only
low-bandwidth text streams between you and your physical computer. But for
anyone who does anything graphical, or for anyone at home who wants to enjoy
media, the bandwidth and latency are big problems.

Maybe I should just assume that because this is on HN and written on a
developer's blog, context is assumed. But it sure seems like a disconnect
between hard programmers and the other 99% of the population.

------
regularfry
I am reminded of a William Gibson quote: "The future is here, it's just not
evenly distributed."

------
edw519
_So, printf() is your friend. Log everything your program does, and if
something seems to go wrong, scour the logs to figure it out. Disk is cheap,
so better to just log everything and sort it out later if something seems to
be broken. There's little hope of doing real interactive debugging in this
kind of environment, and most developers don't get shell access to the
machines they are running on anyway._

1985: Interactive debuggers suck. PRINT() is your friend.

1990: Interactive debuggers have matured. No one uses PRINT() any more.
Debugger questions are now even part of technical interviews.

2008: Concurrent processing and cloud computing have made interactive
debugging difficult.

2010: printf() is your friend again.

Developing software is getting to be like fashion. Keep those old skinny jeans
and workarounds in your closet. Sooner or later, they'll be in style again.

~~~
hvs
This just tells me that we need to figure out a new way to debug
concurrent/cloud processes more easily. The traditional debuggers just don't
cut it anymore.

~~~
aristidb
How about trying to find errors as early as possible, by for example using the
type system? That will not catch everything, but it will catch quite a lot.

~~~
hvs
That's one of the trade-offs when deciding between a statically or dynamically
typed language. Static typing isn't really going to help you with
debugging/avoiding concurrency issues, though.

~~~
jedbrown
Well, the type system can be used to isolate side effects, thus ruling out
large classes of concurrency bugs (especially with threads instead of separate
processes or strict message passing (e.g. Erlang)). It can also prove that
typing errors cannot occur, whereas with a dynamic language, it is possible
for the program to be ill-typed in only certain circumstances (a function
could return a different type depending on input). But yes, there are plenty
of concurrency issues that are not greatly mitigated by static typing.

------
swah
Huh? His "dumb terminal" is a Mac laptop that will run on a single machine
most less-than-Google-sized tasks...

~~~
mechanical_fish
Even if you are avoiding Google-sized tasks, there turn out to be advantages
to treating your local machine as a dumb terminal.

One is that remote machines can be rented instead of being bought.

Another is that the configuration of everything becomes easier. You buy new
laptops by taking them out of the box, running a handful of app installers,
and moving over a SSH key or two and an encrypted password file. Your career
as amateur sysadmin might just be over. (If you miss the chance to build your
own Linux kernel, of course, there's plenty of work left for you as a cloud
sysadmin. But cloud system admin is also usually easier, because machines in
data centers come with nifty administration tools, like SANs and dashboards,
that benefit from economies of scale.)

Security: When you lose the "dumb terminal" you lose no data. You do lose the
keys that are on the machine, but you can revoke those. And it is much, much
easier to get consistent automated backups on a datacenter box than on a
laptop.

There are definite problems with the "cloud" approach, such as "who owns the
data and how do you trust it is being handled properly", but the advantages
are strong these days.

------
brown9-2
_I do all of my development work on a virtual Linux machine running in a
datacenter somewhere -- I am not sure exactly where, not that it matters. I
ssh into the virtual machine to do pretty much everything: edit code, fire off
builds, run tests, etc. The systems I build are running in various datacenters
and I rarely notice or care where they are physically located._

I love moments like this when you realize as a developer that the bits that
are the fruit of your labor are off executing in some far off physical
location you'll never know about.

------
global_hero
This guy is a total lamer!

