
Finding a problem at the bottom of the Google stack - 9nGQluzmnq3M
https://cloud.google.com/blog/products/management-tools/sre-keeps-digging-to-prevent-problems
======
londons_explore
A small tilt like that wouldn't impact most water cooling systems. Most
cooling systems would run with a 45 degree tilt, and some are completely
sealed and would work any way up.

I suspect this _isn 't_ a water cooling problem, but instead a heat pipe
system, with a phase change material in (often butane). Heat pipes are used to
conduct heat from CPU's to heatsinks in most laptops. They have a liquid in
which boils, and then condenses at the other end of the pipe, and then flows
back again as liquid. Heat pipes usually look like a thick bar of copper, but
are in fact way more thermally conductive than copper.

The inside of the pipe usually has a felt-like material to 'wick' water from
the wet end to the dry end, but wicking is quite slow compared to guaranteeing
the pipe is perfectly level and using gravity to just let the liquid flow
downhill.

I'm 99% sure that's the reason this system doesn't work with a slight slope.

~~~
hinkley
So that weird laptop stand that holds your laptop up at an angle... more
cooling from below, but maybe the heat pipes aren’t as happy?

------
tgtweak
We had a very similar situation recently where the colocation facility had
replaced some missing block-out panels in a rack and it caused the top-of-rack
switches to recycle hot air... the system temps of both switches were north of
100°C and to their credit (dell/force10 s4820Ts) they ran flawlessly and
didn't degrade any traffic, sending appropriate notices to the reporting
email.

Something as benign as that can take out an entire infra if unchecked.

I've seen racking fail in the past (usually someone loading a full 42U with
high density disk systems putting the rack over weight capacity by a notable
factor) and it is definitely a disaster situation. One datacenter manager
recounted a story of a rack falling through the raised flooring back in the
days when that was the standard (surprise - it kept running until noticed by a
tech on a walkthrough).

Good story but comes across as back-patting a bit.

~~~
hinkley
Last time this line of discussion came up someone confessed that they
discovered they’d accidentally made an exhaust fan into an intake fan thereby
blowing hot air across their cpu into their drives and nearly baking them.

~~~
dfox
Most rackmount switches are not really designed to be used in datacenter rack
as they do not have front-to-back airflow (and even if they would have,
networking hardware tends to have all connectors on wrong side of the box).
But almost everyone routinely does that which then necessitates knowing that
installing all air baffles would lead to exactly this issue.

~~~
dnautics
> Most rackmount switches are not really designed to be used in datacenter
> rack as they do not have front-to-back airflow

It's my understanding that switchable fans are standard on cisco, arista, and
mellanox switches (have no idea about juniper) and besides, front-to-back or
back-to-front depends on whether you mount in the back or in the front (both
are valid) AND even if you go the "wrong way" patch panels are pretty easy to
find and are cheap (at the expense of 1U)

~~~
dfox
On high-end-ish switches that are truly designed for this application. And
having switchable fans will not help you with device that has left-to-right
(or even more weird) airflow.

------
m0zg
The most epic bug I've seen investigated at Google was a CPU bug in the CPUs
that Intel made specifically for Google at the time. Only some chips had the
bug. And only some cores on those chips had it, and to make matters worse it'd
only manifest non-deterministically. I don't recall the full details, nor
would I share them even if I did, but what struck me is that this mindset ("at
Google one-in-a-million events happen all the time") and software engineering
mindset that goes with it (checksumming all over the place, overbuilding for
fault tolerance and data integrity) was basically the only reason why this bug
was identified and fixed. In a company with lesser quality of engineering
staff it would linger forever, silently corrupting customer data, and the
company would be totally powerless to do anything about it. In fact they
probably wouldn't even notice it until it's far too late. At Google, not only
they noticed it right away, they also were able to track it all the way down
to hardware (something that software engineers typically consider infallible)
in a fairly short amount of time.

------
Nextgrid
I wonder why it took as far as having the actual software running on the
machine to fail and have user-facing consequences for them to notice that
something was wrong. With all that bragging about how good they are, why
didn't they have alerts that would let them know the temperature was higher
than normal _before_ it got to a level critical enough to affect software
operation?

~~~
Forge36
>an abnormally high number of errors

I wonder what the initial errors being logged were. Based on the post >These
errors indicated CPU throttling

It sounds like the first tool is "show error counts".

It might have been "production is slower than normal, please investigate"

Your suggestion sounds like a good way to flag future issues "1 in 50 racks
for X is running a temp", however I'm curious what the false positive rate
would be. What if it isn't causing problems? Maybe you've spotted smoke.

~~~
jldugger
If BGP is busted, those errors were probably stuff like connection timeouts on
services running inside borg/k8s.

But like, it's not impossible to measure temp on devices. It's entirely
possible to design a temp monitor that tracks the physical grouping of
equipment: datacenter -> aisle -> rack -> server

The real question is one of prioritization: if a rack's temp has raised but
there's no customer error, is it an immediate problem?

~~~
cowsandmilk
> if a rack's temp has raised but there's no customer error, is it an
> immediate problem?

Here, the problem was on a machine running a google application, so they
noticed. But this is a post on the google cloud blog. This just makes me think
that Google isn’t monitoring the health of the hardware they provide to
customers in the cloud. It is a change you have to make when you change the
layer at which you are providing services to customers. If I’m using the
google maps web site, I don’t care if they are monitoring cpu temperature if
layers above insulate me from impact. If I’m spinning up a virtual machine, I
will be directly impacted.

~~~
jldugger
Netflix keynotes described how entire AWS AZs can and will go offline, and how
to induce failures to exercise recovery paths, so why is the evaluation
criteria for GCP 'my single point of failure VM cannot ever go down?'

~~~
AlphaSite
Most companies are not Netflix and I’m not sure I understand why we are
discussing AWS?

That post is more a statement on how errors which can be handled at the app
layer can have catastrophic effects on lower level components. You cannot
assume end customers are running thing at the scale or with the fault
tolerance of google, or Netflix.

~~~
jldugger
Most companies are not Netflix but all cloud customers can learn from their
public design discussions. The only reason I mentioned AWS is that they are
Netflix a high profile AWS customer, and their lessons in cloud architecture
apply pretty cleanly to GCP. You cannot assume an SLA of 100 percent, even if
it works out that way on shorter time scales. It's really no different than
running your own datacenter, so I don't know where this 'monitoring will
prevent catastrophy' fight is coming from.

> You cannot assume end customers are running thing at the scale or with the
> fault tolerance of google, or Netflix.

Correct, but there's a gradient between 'we have 10 copies of the service in
10 different countries and use Akamai GTM in case of outage' and Dave's one-
off-VM. One-off VMs are fine if you know what you're getting into, and I use
that setup for my personal, lowstakes & zero revenue website. But if you are a
paying cloud customer, it makes sense to pay attention to availability zones
regardless of scale.

And sure, there might be a market somewhere for a more durable VM setup. At a
past non-profit job we provided customers the illusion of a single HA VM using
Ganeti ([http://www.ganeti.org/](http://www.ganeti.org/)). But it's not clear
to me that the segment is viable -- customers at the low and top end don't
need the HA.

------
dilippkumar
This was a fun story. Could've been told in a more engaging way, but still a
good read.

I can't imagine a better way the phrase "bottom of the Google stack" could
have been used. That phrase can now be retired.

------
SahAssar
> In this event, an SRE on the traffic and load balancing team was alerted
> that some GFEs (Google front ends) in Google's edge network, which
> statelessly cache frequently accessed content, were producing an abnormally
> high number of errors

What does "statelessly cache" mean? Stateless means it has no state, and cache
means it saves frequently requested operations. How can it save anything
without state?

~~~
gouggoug
It's stateless in the sense that it isn't persisted to storage and isn't an
issue if lost.

~~~
am44
Any cache should have this property

------
nwallin
This is great work, but I feel like the wrong thing triggered the alert.

The temperature spike should have been the first alert. The throttling should
have been the second alert. The high error count should have been third.

If you're thermal throttling, you have many problems that could give give all
sorts of puzzling indications.

~~~
thethethethe
> The temperature spike should have been the first alert. The throttling
> should have been the second alert. The high error count should have been
> third.

This is a completely backwards. What you are describing is caused based
alerting, which is strongly discouraged by SRE. SREs prefer symptom based
alerting (e.i are users seeing errors) because you only get alerted when you
know that there is a problem that effects your business. If you only have
caused based alerts, you will get false alarms and un-actionable pages all of
the time.

Imagine being an SRE who manages the GFEs in this situation. Now imagine
getting a page telling you that the tempurature in a rack where your job is
running is hot. What are you supposed to do? Is the issue effecting users? If
so, how many? Is it anomalous? Now go poke around the system and try to figure
out what is wrong. What would be the first thing you check? Probably error
rate/ratio, the thing that you actually care about. If that is your workflow,
why not just alert on your error rate in the first place and figure out the
rest as you go along?

Edit: typos

~~~
closeparen
"The temperature is on a trajectory to breach the machine's operating limit in
N minutes" seems pretty actionable.

I follow Google's symptom-based alerting philosophy in general, but will make
an exception when there's a chance to catch something getting dangerously
close to an unambiguous hard-failure limit (e.g. 90% quota utilization).

~~~
thethethethe
> "The temperature is on a trajectory to breach the machine's operating limit
> in N minutes" seems pretty actionable.

Do you really think that the folks in the Google datacenters are not
monitoring rack temp? Implemnting Symptom based alerting does not mean you
should not be monitoring other system metrics. Their monitoring system
probably filed a P1 or P2 ticket for someone to go take a look at it at some
point. But should a person be paged at 2 A.M to repair this? Absolutely not.

------
nitwit005
Only hours later did I realize the title of this was a pun.

------
nn3
So they didn't monitor temperature of all systems by default to catch cooling
problems?

Sounds more embarrassing to me.

That's fairly basic mistakes.

~~~
gnarbarian
Hindsight is 20/20\. And there is at least some sort of survivership bias.

In a huge system with lots of failover you won't notice all the contingencies
they identified planed for and successfully mitigated. Then something breaks
which; you thought would be covered under another redundancy, was covered
until a recent change (firmware change to the HVAC system), you didn't plan
for etc...

And everyone points at the failure and says "that seems obvious", meanwhile
the mountain of tests and monitoring and redundancy goes unnoticed.

~~~
alexandercrohde
Well, to point, at the very least the SRE post-mortem should have said "And we
considered distributing monitoring software to all our datacenters for future
alerting"

[Not to say that every company out there would do this, nor that this wasn't
necessarily considered, but definitely questioning the merit/objectivity of a
brag-piece that is trying to rebuild the waning google hype]

------
thedance
The most interesting part of this piece is how it strongly implies that the
BGP announcement thing in question is single-threaded. The "overloaded"
machine is showing utilization just below 1 CPU, while the normal ones are
loafing along at < 10% of 1 CPU.

~~~
Dylan16807
What's interesting about that? Making it parallel is tricky and unnecessary,
so they didn't.

------
perlgeek
It's cool to see them track down the errors like this, but I'd like to point
out some weird things along the way:

* why do the racks have wheels at all? Doesn't seem like a standard build, and turns out the be risky

* there should be at least daily checks on the data center, including a visual inspection of cooling systems and the likes. I don't know if daily visual inspection of racks is also a thing, but should find that pretty quickly.

* monitoring temperatures in a data center is pretty essential, though I must admit I don't know enough if rack-level temperature monitoring would have caused overheating of CPUs in the rack.

~~~
cranekam
> * why do the racks have wheels at all? Doesn't seem like a standard build,
> and turns out the be risky

They probably aren't "standard" in any kind of common sense. Google and the
other big tech firms design their own hardware and likely have it assembled
remotely and delivered as a complete rack. Wheels make positioning it easier.
It's also not uncommon to need to move racks of machines around (say, during a
cluster refit or to change capacity in different DC halls). Not having wheels
is likely more of a pain than the potential gains of omitting them.

> * there should be at least daily checks on the data center, including a
> visual inspection of cooling systems and the likes. I don't know if daily
> visual inspection of racks is also a thing, but should find that pretty
> quickly.

Google has millions of machines and dozens of datacenters. Its techs are
likely already 100% busy replacing HDDs and other routine stuff. Noticing a
single rack failing (within error budget) isn't going to be a priority so it's
unlikely someone is employed to do that. Cooling/power systems are definitely
monitored on a more macro level.

> * monitoring temperatures in a data center is pretty essential, though I
> must admit I don't know enough if rack-level temperature monitoring would
> have caused overheating of CPUs in the rack.

There's likely DC-level temperature monitoring. Again, systems are designed to
cope with some amount of failure so it's not important to monitor everything
and alert someone every time anything changes. When dealing with millions of
things you need to look at aggregations.

------
CoffeeDregs
This is a fun read but should read as a description of a fairly common method
of developing products and remediating "deviations". A lot of words are spent
on describing what is essentially a root-cause-analysis (see GxP, ISO 9000,
fishbone diagram). Hopefully, you thought "check", "check", "check", as you
read it.

If you thought "couldn't Bob of shimmed it with a block of plywood?", you
might want to read up on continuous-improvement. Have Bob put the shim in to
fix the problem right quick, then start up the continuous improvement
engine...

------
williamDafoe
Why doesn't Google have a visualization system for temperature of the racks,
with monitoring and alarms? It seems the problem went undetected because SREs
did a poor job to begin with .... No DC has more than a few tens of thousands
of machines which is easily handle by 3 borgmon or 100 monarch machines LOL
...

------
aeyes
> All incidents should be novel.

Ah, how nice would it be if every company had infinite engineering capacity.

------
bluedino
Do people do not visit their datacenters often enough to notice a tilted rack?

~~~
mrep
The big companies have millions of servers. At that scale, they have software
teams monitoring every part of the datacenter from temperatures on each
server/rack down to where every drive is at and their health. It is also
highly automated with systems to direct technicians on what/where/how to
repair broken parts and thus they don't have people just patrolling racks
looking for problems. They also have special badges to access the data centers
so people generally cannot just visit datacenters as only a select few people
with business justification can visit them.

------
londons_explore
Another way to tell the same story:

"Someone at Google bought cheap casters designed to hold up an office table,
and put them in the bottom of a server rack that weighed half a tonne. They
failed. Tens of thousands of dollars were spent replacing them all"

~~~
johnghanks
Why is everyone here so negative all the time?

~~~
hackmiester
It's hard to say, but I'm feeling somewhat negative about this article too.

A more neutral way to tell it might be: "Some of our servers were overheating.
We took them out of service, then discovered they were overheating because of
an issue where the rack hardware had failed and tiled the rack. We eliminated
this problem from this and all other racks."

The way it's currently written seems to try to set Google apart, but I'm not
sure why it wouldn't work this way at any organization. Maybe I have missed
the point.

------
jeffrallen
Wow, I was really NOT expecting that.

------
rdxm
this is awesome. one of the by-products of the public cloud era is a loss of
the having to consider the physical side of things when considering
operational art.

