Computing at scale, or, how Google has warped my brain

edw519 · on Oct 21, 2010

So, printf() is your friend. Log everything your program does, and if something seems to go wrong, scour the logs to figure it out. Disk is cheap, so better to just log everything and sort it out later if something seems to be broken. There's little hope of doing real interactive debugging in this kind of environment, and most developers don't get shell access to the machines they are running on anyway.

1985: Interactive debuggers suck. PRINT() is your friend.

1990: Interactive debuggers have matured. No one uses PRINT() any more. Debugger questions are now even part of technical interviews.

2008: Concurrent processing and cloud computing have made interactive debugging difficult.

2010: printf() is your friend again.

Developing software is getting to be like fashion. Keep those old skinny jeans and workarounds in your closet. Sooner or later, they'll be in style again.

hvs · on Oct 21, 2010

This just tells me that we need to figure out a new way to debug concurrent/cloud processes more easily. The traditional debuggers just don't cut it anymore.

aristidb · on Oct 21, 2010

How about trying to find errors as early as possible, by for example using the type system? That will not catch everything, but it will catch quite a lot.

pjscott · on Oct 21, 2010

You can definitely get a lot out of using better abstractions, as the success of MapReduce has shown. I'll list a few ideas, with the warning that they may be half-baked:

* Strong, stern type checking has always been kind of a pain in the ass, but it's very effective at catching miscellaneous stupid errors. If debugging in the cloud is hard enough, then maybe overbearing type systems are the lesser of two hassles.

* Interactive programming is a big win. One of the best things about Lisp is the REPL, and how well-integrated it is with the editor. (At least, if you use something like SLIME.) If I'm programming on a distributed system, I definitely want to be able to quickly test out my code, preferably in a concurrent environment. You could solve this by having a bunch of servers for this, or with some fancy simulation environment, or a bunch of virtual machines, or something. The point is, I want errors to show up fast.

* Really good log analysis. Ideally, any framework you're using would automatically log everything it does, and then you would use some nice tools to figure out what happened, and where it went wrong. Maybe the Loggly guys will come up with some good stuff.

* Data-flow oriented abstractions like what Apache Pig offers. Pig lets you define a directed acyclic dataflow graph, in which each edge is some operation like "group by field x", or "filter by function f(fields)". It compiles all this into MapReduce jobs and runs them on Hadoop. It's probably a lot harder to mess up a query in Pig than with raw MapReduce, for sufficiently complicated queries. I'm not sure how much of a performance hit you take, though.

hvs · on Oct 21, 2010

That's one of the trade-offs when deciding between a statically or dynamically typed language. Static typing isn't really going to help you with debugging/avoiding concurrency issues, though.

jedbrown · on Oct 21, 2010

Well, the type system can be used to isolate side effects, thus ruling out large classes of concurrency bugs (especially with threads instead of separate processes or strict message passing (e.g. Erlang)). It can also prove that typing errors cannot occur, whereas with a dynamic language, it is possible for the program to be ill-typed in only certain circumstances (a function could return a different type depending on input). But yes, there are plenty of concurrency issues that are not greatly mitigated by static typing.

polynomial · on Oct 21, 2010

> Developing software is getting to be like fashion. Keep those old skinny jeans and workarounds in your closet. Sooner or later, they'll be in style again.

Or is developing software like business? Sell those skinny jeans and workarounds when they are in demand, then buy some back when they go out of style. You're now ready for their comeback and you have some mad money in your skinny jeans' pocket.

wglb · on Oct 23, 2010

Actually around 1985, Mark Williams invented the C Source-Level Debugger, which certainly did not suck, and was quickly imitated by other larger outfits.

However, my use of debuggers has always been quite minimal, perhaps because despite the CSD experience, I am often behind the times. Or perhaps it was due to my early (not quite childhood) experience building real-time interrupt-rich programs in Sigma 5 assembly language.

I also read recently that most of the grownups don't use debuggers but have always stuck by printf or equivalent.

It kind of reminds me of an early attempt on my part, not wanting to do laundry in my early bachelor years, to start a wrinkled shirt fashion trend. I was not successful, at least for a large number of years. It finally did become fashion.

So stick with printf and build useful but lean logging habits would be my advice.

jakevoytko · on Oct 21, 2010

All projects are different. Use the most useful tool for your job, and pay little attention to other developers. They may be using something better, or they may just have different needs.

tibbon · on Oct 21, 2010

This seems to be even more interesting to compare the Ruby community (from what I've seen is massively test-driven and shuns use puts()), to the use of Python at Google. Of course, most Ruby stuff that I mentioned is Rails based, and there isn't that much concurrency happening and its set up well for testing- but just an observation.

adambyrtek · on Oct 21, 2010

Google is extremely test-oriented, every project is covered by a test suite and there is a powerful distributed continuous integration infrastructure in place that executes tests "in the cloud" and doesn't allow to commit code that doesn't pass. Not to forget about the public Google Testing Blog[1] and the famous Testing on the Toilet[2].

People in the Ruby community also use temporary puts/logging statements for exploratory debugging, but of course nobody in right mind commits such code to the repository.

[1] http://googletesting.blogspot.com/

[2] http://googletesting.blogspot.com/2007/01/introducing-testin...

jedbrown · on Oct 21, 2010

Forget about complex shared-memory or message passing architectures: that stuff doesn't scale, and is so incredibly brittle anyway (think about what happens to an MPI program if one core goes offline).

MPI is not good at fault tolerance, though MPI-3 will help some. MapReduce and Sawzall are well-suited to the problems they were made for: largely independent operations on very large data sets with occasional reductions and weak synchronization. Usually disk is a significant factor and strong scalability is not especially important.

But they really can't compete with MPI for other problem domains. Claiming that MPI doesn't scale is demonstrably false, as evidenced by PDE solvers running on 200k - 300k cores (Jaguar, Jugene, etc). This is a comfortable order of magnitude larger than Google is running, and the communication requirements of those algorithms are far more demanding, with essential all-to-all communication as well as low-latency global reductions (for which Blue Gene has dedicated hardware). The need for these low-latency all-to-all operations is actually very fundamental (formally proven). The algorithms also tend to have much stricter synchronization requirements, for example, nothing can be done until everyone gets the result of a dot product, because the next operation needs to happen in a consistent subspace (though MPI has fine asynchronous support). There has been plenty of work trying to make algorithms less synchronous, but weakly synchronous algorithms almost uniformly have worse algorithmic scalability (number of iterations to solution increases by more than a constant factor). And since problem sizes are usually chosen to fit in memory, the network doesn't get a free pass due to the disk being slow.

If you care about strong scalability, MapReduce and Sawzall are probably not good choices. Similarly, if your problem domain requires low-latency all-to-all, reductions, or "ghost exchange".

Retric · on Oct 21, 2010

200k - 300k cores... a comfortable order of magnitude larger

I would assume Google has well over over 20,000 cores. Or did you mean something else?

jedbrown · on Oct 21, 2010

From talking with someone at Google Zurich who is involved in page-rank computations, those jobs run on ~10k to 20k cores. Of course Google has many more distributed around the world, but unless something has changed in the last few months, they are not running single jobs on more than about 20k cores.

Retric · on Oct 21, 2010

In many ways pagerank is a special case where you need a lot of resources for a short period of time. In such cases you could use PMI but that's not the hard part of scaling for Google.

From what I have read, Google treats its global compute infrastructure in a reasonably abstract fashion. To the point where their production systems share resources across datacenters so that losing a datacenter has minimal impact. Thus, when Google people talk about "scale" they often mean getting it to work for long periods of time, efficiently, on flaky hardware, spread across the world, which PMI does not do.

jedbrown · on Oct 21, 2010

You use "PMI" consistently, but I assume you mean "MPI". MPI is not a silver bullet, and if your problem can be reasonably solved with nearly asynchronous algorithms such that fault tolerance is a bigger problem than network performance, then MPI is probably not a good choice. But this is a rather narrow definition of "scale" (and much different from the standard definition over the last several decades of scientific computing), and for more tightly coupled problems, MPI can beat these alternative paradigms by huge margins.

Retric · on Oct 21, 2010

The more cores you have the more important dealing with failure becomes. Multi CPU super computers have been dealing with these issues for a while. 2-3 CPU failures per day is fairly common and losing a rack in the middle of a job is way too common.

jedbrown · on Oct 21, 2010

Yes, however the algorithms are such that failure cannot be made strictly local. For example, when solving a stiff PDE, there is unavoidable and quite complex global communication at every time step (indeed, within the nonlinear solves required in each time step). There is no hope of "continuous local checkpointing" or the like because the program state is changing to the tune of gigabytes per second per node.

The most common response is to perform global checkpoints often enough that hardware failure is not too expensive, but infrequently enough that your program performance doesn't suffer too much. Bear in mind that our jobs are not normally hitting disk heavily, so frequent checkpointing could easily cost an order of magnitude. When hardware failure occurs, the entire global job is killed and it is restarted from the last checkpoint. There are libraries (e.g. BLCR) that integrate with MPI implementations and resource managers for checkpoint, restart, and migration.

Note that it is not possible to locally reconstruct the state of the process that failed because there just isn't enough information unless you store all incoming messages, which can easily be a continuous gigabyte per second per node. Even if you had a fantastic storage system that gave you ring storage with bandwidth similar to the network, local reconstruction wouldn't do much good because the rest of the simulation could not continue until the replacement node had caught up, so the whole machine would be idle until during this time. If you have another job waiting to run, you could at least get something done, but the resource management implications of such a strategy are not a clear win.

So if hardware failure becomes so frequent that global checkpointing is too costly, the only practical recourse is to compute everything redundantly. Nobody in scientific computing is willing to pay double to avoid occasionally having to restart from the last checkpoint, so this strategy has not taken off (though there are prototype implementations).

Retric · on Oct 21, 2010

There are three reasons why that approach tends to work well in scientific computing circles.

First off individual computational errors are rarely important. EX: Simulating galactic evolution as long all the velocity's stay reasonable each individual calculation is fairly unimportant and bounds checking is fairly inexpensive.

Second, there is a minimal time constraint, losing 30 minutes of simulation time day is a reasonable sacrifice for gaining efficiency in other areas.

Third, computational resources tend to be expensive, local, and fixed. AKA, Cray Jaguar not all those spare CPU cycles running folding at home.

However, if you’re running VISA or World of Warcraft then you get a different set of optimizations.

jedbrown · on Oct 21, 2010

First off individual computational errors are rarely important.

This is completely wrong. Arithmetic errors (or memory errors) tend not to just mess up insignificant bits. If it occurs on integer data, then your program will probably seg-fault (because integers are usually indices into arrays) and if it occurs in floating point data, you are likely to either produce a tiny value (perhaps making a system singular) or a very large one. If you are solving an aerodynamic problem and compute a pressure of 10^80, then you have might as well have a supernova on the leading edge of your wing. And flipping a minor bit in a structural dynamics simulation could easily be the difference between the building standing and falling.

I would argue that data mining is actually more tolerant of such undetected errors because they are more likely to remain local and may stand out as obviously erroneous. People are unlikely to die as the result of an arithmetic error in data mining.

Second, there is a minimal time constraint,

There is not usually a real-time requirement, though there are exceptions, e.g. http://spie.org/x30406.xml, or search for "real-time PDE-constrained optimization". But by and large, we are willing to wait for restart from a checkpoint rather than sacrifice a huge amount of performance to get continual uptime. If you need guaranteed uptime, then there is no choice but to run everything redundantly, and that still won't allow you to handle a nontrivial network partition gracefully. (It's not a database, but there is something like the CAP Theorem here.)

Retric · on Oct 22, 2010

For the most part you can keep things reasonable with bounds checking. If pressure in some area is 10x those around it then there was a mistake in the calculation. If your simulating weather patterns on earth over time, having a single square mile 10 degrees above what it should be is not going to do horrible things to the model. Clearly there are tipping points but if it's that close to a tipping point the outcome is fairly random anyway.

Anyway, if you could not do this and accuracy is important, then you really would need to double check every calculation because there is no other way to tell if you had made a mistake.

snissn · on Oct 21, 2010

http://en.wikipedia.org/wiki/National_Center_for_Computation...

retube · on Oct 21, 2010

When I was at CERN 10 years ago, the set up was similar. Instead of Macs the desktops were PCs. The code was Fortran, Perl and c++. All work was on remote Linux boxes (typically exporting a desktop over X), although physical not virtual OS instances. You'd ssh to a gateway IP and be logged into one of the hosts. All data and user accounts were stored on a distributed filesystem AFS. There was job scheduling and management middleware that'd batch process your analysis across hundreds of hosts in the datacentre.

"The cloud" has been around for a while, although implementation has changed.

irahul · on Oct 21, 2010

Yahoo uses a similar setup. Instead of virtual hosts, the hosts here are physical machines. Search team had a dedicated lab(not sure what's happening now with search migration) and all developers would ssh to their assigned lab machines, fire up their vim/emacs and crank code.

Search team has a mixture of windows laptops, macs, linux desktops, linux laptops..and it didn't matter because everyone just ssh'd to a RHEL box and worked.

Yahoo used to be a BSD shop, then moved to linux. Search team used both Debian and RHEL in the beginning, then transitioned to RHEL fulltime.

I don't understand the obsession with RHEL. In my experience, Debian/Ubuntu will give the same benefits as RHEL sans the cost.

The rationale for moving from BSD to linux was practical. The gist was most of the vendors either do not do bsd or bsd was second class citizen. And BSD wasn't as quick as linux in catching up with esoteric hardware. But discarding Debian didn't actually make sense to me.

cperciva · on Oct 21, 2010

Yahoo used to be a BSD shop, then moved to linux.

Yahoo still uses BSD. They're not completely BSD, of course; but they're not completely linux either.

irahul · on Oct 21, 2010

Yep.Yahoo still uses BSD, but earlier, a lot of work was put on BSD. There were BSD kernel teams, and teams which managed BSD ports. AFAIK, almost all of the new development is on RHEL and there aren't any BSD dedicated teams now. A lot of BSDs are still around - some have been running for last 10 years or so.

Some new teams use BSD for their webapp deployment and database servers; can't confirm (no links) but I think the teams were of the opinion MySQL on BSD was more stable.

But as said, earlier Yahoo was big time BSD, investing in it and submitting patches, which no longer is the case. Search team did a mass migration to RHEL about 2 years back. BSD is still used here and there as deployment platform. My personal preference would be to go with BSD and save some buck but apparently management views it differently.

cperciva · on Oct 21, 2010

I've been a FreeBSD developer since 2003 and I'm on the FreeBSD core team. I know the history of Yahoo and FreeBSD. ;-)

irahul · on Oct 22, 2010

I knew about your FreeBSD background. Didn't know you were involved with Yahoo dealings though.

retube · on Oct 21, 2010

CERN used RH too, 6 when I was there. I believe they've now switched to Scientific Linux.

Actually I don't understand why Google would use virtualisation. Why would they split a physical box up? Making many hosts look like one, yes (is such a thing possible?), but not the other way around.

And AFS was cool: it was actually a global filesystem that many institutions used. you could cd /, ls and browse to nasa.gov amongst many thousands of others (much was protected!) (cern was mounted on /cern.ch)

mechanical_fish · on Oct 21, 2010

I don't understand why Google would use virtualisation. Why would they split a physical box up?

Ease of administration. It turns out to be much easier to have ten virtual machines running one daemon each than one virtual machine running ten daemons. Fewer interactions, cleaner process boundaries (the boundary between one VM and another is pretty thick), easier debugging. Easier to swap out parts without affecting other parts. Much easier scaling. (If you need more servers for Service X, launch more VMs; if your machine needs more RAM to hold those VMs, launch the VMs on a different physical box -- the system doesn't need to be changed to support that because it's already architected to assume separate "machines" for everything.)

For more on the subject ask Ezra Zygmuntowicz or his blog and books. I stole all of this material from him.

gaius · on Oct 21, 2010

It turns out to be much easier to have ten virtual machines running one daemon each

It's funny, when you needed one OS instance per network server daemon on Windows in the 90s, everyone laughed at it. Now it counts as cutting edge thought.

borism · on Oct 21, 2010

Everyone laughed at it because at the same time there were OSes available that could run unlimited number of daemons simultaneously.

Now it's a matter of choice for system owner. And of course all those single-daemon VMs run inside OS that lets numerous VMs run at the same time.

adambyrtek · on Oct 21, 2010

Not to mention that you can do cool things with virtual machines like live migration and fail-over from one physical machine to another in a fully transparent way.

silverlake · on Oct 21, 2010

Would Solaris containers or Linux VServer work as well here? One would need good management tools, of course.

jeebusroxors · on Oct 21, 2010

I haven't gotten super in depth with containers (zones or ldoms) but I don't think they inherently offer nearly as much flexibility as vmware.

Not to mention Solaris 10 can be...interesting.

retube · on Oct 21, 2010

that's a great response, cheers

jessriedel · on Oct 21, 2010

>The idea that you need a physical machine close by to get any work done is completely out the window at this point. ... I ssh into the virtual machine to do pretty much everything: edit code, fire off builds, run tests, etc. ...Wide-area network latencies are low enough that this works fine for interactive use, even when I'm at home on my cable modem.

This situation sounds fine if you're a developer and your job requires only low-bandwidth text streams between you and your physical computer. But for anyone who does anything graphical, or for anyone at home who wants to enjoy media, the bandwidth and latency are big problems.

Maybe I should just assume that because this is on HN and written on a developer's blog, context is assumed. But it sure seems like a disconnect between hard programmers and the other 99% of the population.

regularfry · on Oct 21, 2010

I am reminded of a William Gibson quote: "The future is here, it's just not evenly distributed."

swah · on Oct 21, 2010

Huh? His "dumb terminal" is a Mac laptop that will run on a single machine most less-than-Google-sized tasks...

mechanical_fish · on Oct 21, 2010

Even if you are avoiding Google-sized tasks, there turn out to be advantages to treating your local machine as a dumb terminal.

One is that remote machines can be rented instead of being bought.

Another is that the configuration of everything becomes easier. You buy new laptops by taking them out of the box, running a handful of app installers, and moving over a SSH key or two and an encrypted password file. Your career as amateur sysadmin might just be over. (If you miss the chance to build your own Linux kernel, of course, there's plenty of work left for you as a cloud sysadmin. But cloud system admin is also usually easier, because machines in data centers come with nifty administration tools, like SANs and dashboards, that benefit from economies of scale.)

Security: When you lose the "dumb terminal" you lose no data. You do lose the keys that are on the machine, but you can revoke those. And it is much, much easier to get consistent automated backups on a datacenter box than on a laptop.

There are definite problems with the "cloud" approach, such as "who owns the data and how do you trust it is being handled properly", but the advantages are strong these days.

adambyrtek · on Oct 21, 2010

It's more about convenience than resources. For example I'm very happy that I don't have to carry my laptop with me anymore as I can always reconnect to my virtual machine from home and continue at the same place where I left.

brown9-2 · on Oct 21, 2010

I do all of my development work on a virtual Linux machine running in a datacenter somewhere -- I am not sure exactly where, not that it matters. I ssh into the virtual machine to do pretty much everything: edit code, fire off builds, run tests, etc. The systems I build are running in various datacenters and I rarely notice or care where they are physically located.

I love moments like this when you realize as a developer that the bits that are the fruit of your labor are off executing in some far off physical location you'll never know about.

global_hero · on Oct 21, 2010

This guy is a total lamer!