Hacker News new | past | comments | ask | show | jobs | submit login
Finding a problem at the bottom of the Google stack (cloud.google.com)
325 points by 9nGQluzmnq3M on March 15, 2020 | hide | past | favorite | 123 comments



A small tilt like that wouldn't impact most water cooling systems. Most cooling systems would run with a 45 degree tilt, and some are completely sealed and would work any way up.

I suspect this isn't a water cooling problem, but instead a heat pipe system, with a phase change material in (often butane). Heat pipes are used to conduct heat from CPU's to heatsinks in most laptops. They have a liquid in which boils, and then condenses at the other end of the pipe, and then flows back again as liquid. Heat pipes usually look like a thick bar of copper, but are in fact way more thermally conductive than copper.

The inside of the pipe usually has a felt-like material to 'wick' water from the wet end to the dry end, but wicking is quite slow compared to guaranteeing the pipe is perfectly level and using gravity to just let the liquid flow downhill.

I'm 99% sure that's the reason this system doesn't work with a slight slope.


Interestingly, if you have a heat pipe with a bend in the middle (most laptop heat pipes have a bend), then if you orient the laptop so the bent bit is down, you can make a 'liquid lock' so vapor can't flow from one end of the pipe to the other. If that happens, the heat pipe pretty much stops conducting heat, and you'll see far worse thermal performance.


So that weird laptop stand that holds your laptop up at an angle... more cooling from below, but maybe the heat pipes aren’t as happy?


We had a very similar situation recently where the colocation facility had replaced some missing block-out panels in a rack and it caused the top-of-rack switches to recycle hot air... the system temps of both switches were north of 100°C and to their credit (dell/force10 s4820Ts) they ran flawlessly and didn't degrade any traffic, sending appropriate notices to the reporting email.

Something as benign as that can take out an entire infra if unchecked.

I've seen racking fail in the past (usually someone loading a full 42U with high density disk systems putting the rack over weight capacity by a notable factor) and it is definitely a disaster situation. One datacenter manager recounted a story of a rack falling through the raised flooring back in the days when that was the standard (surprise - it kept running until noticed by a tech on a walkthrough).

Good story but comes across as back-patting a bit.


Last time this line of discussion came up someone confessed that they discovered they’d accidentally made an exhaust fan into an intake fan thereby blowing hot air across their cpu into their drives and nearly baking them.


Most rackmount switches are not really designed to be used in datacenter rack as they do not have front-to-back airflow (and even if they would have, networking hardware tends to have all connectors on wrong side of the box). But almost everyone routinely does that which then necessitates knowing that installing all air baffles would lead to exactly this issue.


> Most rackmount switches are not really designed to be used in datacenter rack as they do not have front-to-back airflow

It's my understanding that switchable fans are standard on cisco, arista, and mellanox switches (have no idea about juniper) and besides, front-to-back or back-to-front depends on whether you mount in the back or in the front (both are valid) AND even if you go the "wrong way" patch panels are pretty easy to find and are cheap (at the expense of 1U)


On high-end-ish switches that are truly designed for this application. And having switchable fans will not help you with device that has left-to-right (or even more weird) airflow.


It's been too long since I've been into a server room.

It was not uncommon to see half-depth switches in the era prior to 'hot aisles' and I definitely recall them pulling air in directions that you might not like. They were never deep enough to siphon air from the front of the rack.

If you have a switch fully populated rail-to-rail with ethernet jacks, it's not entirely clear to me where you should route air in a hot/cold aisle setup.


Whats wrong with back patting? They did a good job


There was an air of condescension, as if no one else had ever figured out how to run a data center before.


It didn't read that way at all to me, what wording leads you to that impression?


The most epic bug I've seen investigated at Google was a CPU bug in the CPUs that Intel made specifically for Google at the time. Only some chips had the bug. And only some cores on those chips had it, and to make matters worse it'd only manifest non-deterministically. I don't recall the full details, nor would I share them even if I did, but what struck me is that this mindset ("at Google one-in-a-million events happen all the time") and software engineering mindset that goes with it (checksumming all over the place, overbuilding for fault tolerance and data integrity) was basically the only reason why this bug was identified and fixed. In a company with lesser quality of engineering staff it would linger forever, silently corrupting customer data, and the company would be totally powerless to do anything about it. In fact they probably wouldn't even notice it until it's far too late. At Google, not only they noticed it right away, they also were able to track it all the way down to hardware (something that software engineers typically consider infallible) in a fairly short amount of time.


I wonder why it took as far as having the actual software running on the machine to fail and have user-facing consequences for them to notice that something was wrong. With all that bragging about how good they are, why didn't they have alerts that would let them know the temperature was higher than normal before it got to a level critical enough to affect software operation?


Because with tens or hundreds of thousands of racks it would be a full-time job to go through such alerts. Some are likely real (hi broken wheels) and some almost certainly aren't (hi someone training a ML model). And users don't care: as a user I don't care if a host or rack is hot. All I care is that when someone (an SRE or automation) notices errors the rack is drained and the errors go away. At somewhere small where an individual rack failing isn't okay it might make sense to monitor physical health more closely; where it doesn't matter it's better to focus on whether the user is having a good time.

As someone else said, alert on symptoms (user-facing errors), not causes, because you're never going to be able to enumerate all the potential causes ahead of time and dealing with the alert onslaught from trying is unworkable.


> Because with tens or hundreds of thousands of racks it would be a full-time job to go through such alerts.

Sorry, I'm merely a lowly sysadmin and not a godly Google SRE, but environment monitoring is a solved problem, humans need not apply.


> Sorry, I'm merely a lowly sysadmin and not a godly Google SRE

Don't really like how this is framed - plenty of Google SREs are born and bred sysadmins.


My comment was a bit flippant I agree, and I know Google is a huge company so there are a great many people I have never interacted with.

That said; the google approach _does_ appear to quite literally throw away institutional knowledge from high throughput/high uptime and automation focused systems engineering.

I don’t discount that google SREs are highly intelligent and often know systems internals to the same or a greater extent than highly performing sysadmins. But there does appear to be desire to discredit anything seen as “traditional” and I think that can be harmful, like in this case.

Sometimes they know better, but there is a failing to ask the question: “what does it cost to only act on symptoms not causes”


> Sometimes they know better, but there is a failing to ask the question: “what does it cost to only act on symptoms not causes”

I imagine hardware breaking is the cost -- but as long as an engineer's time is worth more than the hardware, that tradeoff will be made.


Debugging and replacing is far more time consuming and costly, both in hardware money and engineer money.


He's talking about after you filter them with automation.

When you have enough machines, 10 of them are always running hot.


The "hot" we're talking here is beyond what the machines are specced for and suggests something is wrong (in this case the rack was tilted and it was messing with the cooling system). That should never happen under normal operation, otherwise it suggests a bigger problem (either the whole building is too hot, or there's a design flaw in the cooling system).


Lots of things that should "never" happen start to happen with frightening regularity once you have enough machines.

Few of them are worth paging someone.


Those are two extremes.

For me, when a machine reports as abnormal it will decommission itself and file a ticket to be looked at, we have enough extra capacity that it’s ok to do this for 5% of compute before we have issues, and we have an alert if it gets close to that.

If you’re doing anything other than that then either your machines are abnormal and functioning, which is scary because they’re now in an unknown state- or you’re able to just throw out anything that seems weird, which might also hide issues with the environment.


Whether you page someone or not, a machine hitting thermal throttling repeatedly should be just as notable as the abnormal failure rate that started the investigation.


> When you have enough machines, 10 of them are always running hot.

Sure, but isn't there rack-level toplogy information that should warn when all of the elements in a rack are complaining? That should statistically be a pretty rare occurrence, except over short periods, say, < 5-10 hours.


It's Google; they have sufficient ML capacity to filter cases. It might require additional manpower in the short term, but it also lowers the likelihood of infrastructure failure long-term.

What might not make sense, resource-wise, to smaller companies does start to make sense at their scale.


Anomaly detection on timeseries data is surprisingly bad. The data tends to be bad in all the wrong ways -- seasonal, and not normally distributed.

And worse, you have no training data, so ML model you train simply learns whatever badness is allowed to persist in production as normal.


I worked at a FAANG and every now and then someone proposed or tried to make an anomaly detection system. They never worked well. It’s an extremely difficult thing to get right. Better to just have good monitoring of whether your system is responsive/available/reliable and ways to quickly work out why it’s not.


> We consider these types of failures "within error budget" events. They are expected, accepted, and engineered into the design criteria of our systems. However, they still get tracked down to make sure they aren’t forgotten and accumulated into technical debt


Yes, I understand that application-level exceptions can sometimes be "budgeted for" and a certain amount of them is acceptable.

Hardware metrics on the other hand should IMO be monitored and alerts should be in place for when things go wrong, as the consequences of it could potentially be way more serious.


Things look very different when you have as much hardware as Google does.

(Disclosure: I work for Google)


When the CPU hits the thermal envelope regularly and throttles itself that seems like a serious enough problem. Sure, if it's a normal part of cost optimization, then it won't stand out.


I've never worked at a FAANG, but there is a school of thought that if an error/alert isn't impacting your SLO, you don't care.

As they said, it was within the error budget, which means that they could instead focus on delivering new features. When the errors go out of budget, they stop delivering features and start delivering fixes to bring it back in budget.

It's the kind of attitude you can have at that scale.


It's the kind of attitude you kinda have to have at that scale, otherwise it's very hard to balance your time between shipping features or fixing issues (you end up biasing towards one side or the other for the detriment of the overall product/team).


Right. High temperature and thermal throttling are things that can and should be monitored directly.

Spotting it by digging down from service-level monitoring instead isn't terrible, but it isn't something to be proud of either.


It’s just the org chart being reflected in their stack architecture. There is a group responsible for machine health. That group doesn’t run any services. There is a group that operates GFEs. They don’t own any machines.


Does that imply a SaaS-based revision to Conway's law?

(GFEs are a reverse proxy that terminates TCP for Google. https://landing.google.com/sre/sre-book/chapters/production-... )


maybe something to be proud of by the person who actually did the digging down.


> The SRE tracked the problem all the way from an external, front-end system down to the hardware that holds up the machines.

I agree, this does seem like an incredibly circuitous and inefficient route to diagnosing the issue, as opposed to the hardware team simply receiving an alert that a server was continuously in an overheated, and then fixing it.


In my experience the breakfix team has no shortage of work to be done relative to staffing. But they have no insight into the relative priorities of things. That's where your customer metrics and in depth investigation helps.

It's entirely possible a server overheating ticket was in the queue so to speak, and the investigation ends up re-prioritizing an alarm that recovered before anyone was on site to fix it.


>an abnormally high number of errors

I wonder what the initial errors being logged were. Based on the post >These errors indicated CPU throttling

It sounds like the first tool is "show error counts".

It might have been "production is slower than normal, please investigate"

Your suggestion sounds like a good way to flag future issues "1 in 50 racks for X is running a temp", however I'm curious what the false positive rate would be. What if it isn't causing problems? Maybe you've spotted smoke.


My suggestion is that the on-site people should have alerts about hardware events. Unless it's a failing sensor, there shouldn't be any false positives - if a machine is exceeding its recommended operating temperature it's still not good and warrants an investigation even if it isn't yet causing problems (in fact the point of these alerts is to catch and fix these issues before they cause any user-facing issues).


That I like. From the photos it's also obvious the rack is tipping. I'm curious why someone isn't doing a regular walkthrough. There could also be sounds of some hardware failures.

A remote SRE shouldn't need to monitor the hardware health, an on-site person should have caught this sooner.

I wonder how the person responsible for fixing/replacing the wheels felt about the follow-up


Sure - the issue is we (the public reading this) don't know when this actually physically failed. It could be that SRE picked up on this before a scheduled walkthrough (of which I'm sure occurs) happened.

Disclaimer: Google employee. No knowledge of this event beyond the blog post.


If BGP is busted, those errors were probably stuff like connection timeouts on services running inside borg/k8s.

But like, it's not impossible to measure temp on devices. It's entirely possible to design a temp monitor that tracks the physical grouping of equipment: datacenter -> aisle -> rack -> server

The real question is one of prioritization: if a rack's temp has raised but there's no customer error, is it an immediate problem?


> if a rack's temp has raised but there's no customer error, is it an immediate problem?

Here, the problem was on a machine running a google application, so they noticed. But this is a post on the google cloud blog. This just makes me think that Google isn’t monitoring the health of the hardware they provide to customers in the cloud. It is a change you have to make when you change the layer at which you are providing services to customers. If I’m using the google maps web site, I don’t care if they are monitoring cpu temperature if layers above insulate me from impact. If I’m spinning up a virtual machine, I will be directly impacted.


Netflix keynotes described how entire AWS AZs can and will go offline, and how to induce failures to exercise recovery paths, so why is the evaluation criteria for GCP 'my single point of failure VM cannot ever go down?'


Most companies are not Netflix and I’m not sure I understand why we are discussing AWS?

That post is more a statement on how errors which can be handled at the app layer can have catastrophic effects on lower level components. You cannot assume end customers are running thing at the scale or with the fault tolerance of google, or Netflix.


Most companies are not Netflix but all cloud customers can learn from their public design discussions. The only reason I mentioned AWS is that they are Netflix a high profile AWS customer, and their lessons in cloud architecture apply pretty cleanly to GCP. You cannot assume an SLA of 100 percent, even if it works out that way on shorter time scales. It's really no different than running your own datacenter, so I don't know where this 'monitoring will prevent catastrophy' fight is coming from.

> You cannot assume end customers are running thing at the scale or with the fault tolerance of google, or Netflix.

Correct, but there's a gradient between 'we have 10 copies of the service in 10 different countries and use Akamai GTM in case of outage' and Dave's one-off-VM. One-off VMs are fine if you know what you're getting into, and I use that setup for my personal, lowstakes & zero revenue website. But if you are a paying cloud customer, it makes sense to pay attention to availability zones regardless of scale.

And sure, there might be a market somewhere for a more durable VM setup. At a past non-profit job we provided customers the illusion of a single HA VM using Ganeti (http://www.ganeti.org/). But it's not clear to me that the segment is viable -- customers at the low and top end don't need the HA.


Certainly, but you're making the assumption that Google does assume that.

Its completely possible to have different alerting setups for different binaries, and consider CPU throttling to not be a page-worthy issue for GFEs, but to be page-worthy for GCP jobs.


> The real question is one of prioritization: if a rack's temp has raised but there's no customer error, is it an immediate problem?

A server CPU that's thermal throttling is about one step less serious than shutting off. While it's not urgent to deal with dead machines, a dying machine should be in the top priority bracket you see on a day-to-day basis.


What's wrong with looking at smoke? Maybe it pays off to put the smokey ones into a separate class, eg batch processing, so they will unlikely impact frontend serving requests.


Probably because it had never happened before. And I bet this is in the "recommendations for the future", or whatever it's called, section of this postmortem.


They probably had throttling alerts back in the Sandy Bridge days (it was common for Linux machines to hit 100% CPU usage but not actually do anything back then because of some kernel bugs), you'd think they'd have them monitored.


Overheating machines is very common. Even if you only own ten thousand machines it’s pretty much assured you’ve got one machine out there with a cracked heat sink or similar thermal issue.


But at google scale, you'd think they would have something for that. At a previous company I worked at, we had a guy whose entire job was to monitor temperatures throughout all of our data centers. He built a service that collected minute by minute temperature metrics on every server in every datacenter. Once collected it would create automated tickets for data center technicians to investigate when servers and racks got too hot. He even built this cool website where you could view real time maps of each rack and their temperature in every datacenter so that operations teams could visualize the high temperature areas in the data centers.


It's not that they don't know, it's that they don't care. Hot machines go into an attention queue and some tech eventually fixes it. Meanwhile, they serve. As I commented elsewhere, I think it's an effective incentive for software to grow more sophisticated. In fleets that are promptly fixed, there may be a tendency to assume that, for example, all backends are equally suited to receive an RPC. Having a lot of ambient brokenness can be helpful, as Netflix advocates with "chaos monkey".


Because the heat increase would have been pretty much immediate after the wheel failed. This wouldn't have been a gradual degradation.


> Rare, unexpected chains of events happen often. Some have visible impact, but most don't.


Waste heat production scales with system load.


Seems like they don't even have alerts to let them know when they lose emails in your gmail account: https://support.google.com/mail/thread/6187016?hl=en

Still, stuff like this is really hard. It's easy to think of what should have been done in hindsight.


This was a fun story. Could've been told in a more engaging way, but still a good read.

I can't imagine a better way the phrase "bottom of the Google stack" could have been used. That phrase can now be retired.


> In this event, an SRE on the traffic and load balancing team was alerted that some GFEs (Google front ends) in Google's edge network, which statelessly cache frequently accessed content, were producing an abnormally high number of errors

What does "statelessly cache" mean? Stateless means it has no state, and cache means it saves frequently requested operations. How can it save anything without state?


Presumably it caches copies of frequently-requested content on nodes closer to the end-user, rather than always fetching that content from an origin node which might be quite far away (especially on a P2P network). It's "stateless" because stateful requests are always sent directly to the origin node, rather than being intercepted and reproduced from the closer cache.


I don't get how these two can be true at the same time?

> Presumably it caches copies of frequently-requested content on nodes closer to the end-user, rather than always fetching that content from an origin node

> It's "stateless" because stateful requests are always sent directly to the origin node

What is a stateful request versus a stateless one? Every request has some sort of state.


My understanding is it's about application state - cookies, mutations, etc...

If the request is "give me the profile photo of the logged-in user" - the response depends on the stateful question of who is the logged-in user.

If the request is "give me the gmail logo" - the response is identical no matter who's making the request. These can be cached on an edge node without regard to application state.


Okay, I understand. I have never heard that being called a stateless cache though, in my experience that is called a public cache. That is also what it is called in the http cache-control standard.

I get what you mean, but calling a cache "stateless" seems like a contradiction in terms to me.


You're talking about HTTP cache control headers, which is a use case that typically avoids storing stateful data in server-side caches. Caches are also used outside of HTTP however, and sometimes those caches can have state (such as expensive database queries). Taking down a cache with state can have a more severe impact on service performance or data consistency.

The blog post mentioned it this way to convey that the cache operation should be relatively simple and generally not return many errors, and that disabling the servers would be low risk (no data loss or severe performance impact).


The cache has state, but it's caching stateless requests.


Typically it means that immutable data is being looked up by hash key. The mapping between the hash key and the data never changes, so the cache is never stale. When there is new data, it's given a different hash key.

Whether or not data is in the cache can change, but if the data is there, the result is always the same.

An example of stateless data would be a video on YouTube or a particular version of a file in git.


If the data cannot change and is not gated by AuthZ/AuthN then it is usually called a public cache.

If the data cannot change but is controlled by AuthZ/AuthN then it is pretty clearly not stateless.

I get what you mean, but calling a cache "stateless" seems like a contradiction in terms to me.


Well it isn't "public", since its only available if you're on Google's internal production network, and in that sense, nothing on the network is public.


It's poorly worded. Obviously the content storage itself is state by definition, but it means non-dynamic (or at least cacheable / "stateless") resources.


My guess is that “stateless” means no durable storage. There’s still state, but in-memory (or maybe transiently stored on disk?) and no guarantee about it being there, perhaps?


It's stateless in the sense that it isn't persisted to storage and isn't an issue if lost.


Any cache should have this property


This is great work, but I feel like the wrong thing triggered the alert.

The temperature spike should have been the first alert. The throttling should have been the second alert. The high error count should have been third.

If you're thermal throttling, you have many problems that could give give all sorts of puzzling indications.


> The temperature spike should have been the first alert. The throttling should have been the second alert. The high error count should have been third.

This is a completely backwards. What you are describing is caused based alerting, which is strongly discouraged by SRE. SREs prefer symptom based alerting (e.i are users seeing errors) because you only get alerted when you know that there is a problem that effects your business. If you only have caused based alerts, you will get false alarms and un-actionable pages all of the time.

Imagine being an SRE who manages the GFEs in this situation. Now imagine getting a page telling you that the tempurature in a rack where your job is running is hot. What are you supposed to do? Is the issue effecting users? If so, how many? Is it anomalous? Now go poke around the system and try to figure out what is wrong. What would be the first thing you check? Probably error rate/ratio, the thing that you actually care about. If that is your workflow, why not just alert on your error rate in the first place and figure out the rest as you go along?

Edit: typos


"The temperature is on a trajectory to breach the machine's operating limit in N minutes" seems pretty actionable.

I follow Google's symptom-based alerting philosophy in general, but will make an exception when there's a chance to catch something getting dangerously close to an unambiguous hard-failure limit (e.g. 90% quota utilization).


> "The temperature is on a trajectory to breach the machine's operating limit in N minutes" seems pretty actionable.

Do you really think that the folks in the Google datacenters are not monitoring rack temp? Implemnting Symptom based alerting does not mean you should not be monitoring other system metrics. Their monitoring system probably filed a P1 or P2 ticket for someone to go take a look at it at some point. But should a person be paged at 2 A.M to repair this? Absolutely not.


> "The temperature is on a trajectory to breach the machine's operating limit in N minutes" seems pretty actionable.

For who?

For an application, the job will just get rescheduled onto a different machine, and you're N+2 or whatever so no one will notice the restart.

For the datacenter, since (for most jobs) you can assume that no one will care, you don't necessarily monitor for single machines overheating, but rack or row level issues, if a machine keeps overheating, eventually "nothing runs on this machine for more than an hour without it restarting" -symptom based alerting kicks in and tells someone to look at the machine.


Hardware-based alerts shouldn't go to the SREs, but rather to the people running the datacenter, who are in charge of making sure that broken hardware is replaced.


Yes, and I am sure that they were monitoring this, but is a single rack having issues worth paging someone? Do you think they should immediately send someone out to investigate when temperature anomaly is detected? Like I said elsewhere, this was likely add to a queue of lower priority bugs


Yeah, that makes sense. I fixated on the "SRE" part of your post, rather than the "page in the middle of the night" part. I'm working on being less pedantic, so thanks for reminding me that I've still got work to do. <3


> What you are describing is caused based alerting, which is strongly discouraged by SRE. SREs prefer symptom based alerting (e.i are users seeing errors) because you only get alerted when you know that there is a problem that effects your business.

The truth, of course, is somewhere is the middle. Symptom-based alerting is extremely tricky to get right in the first place. What is a “within budget” for a service, can be total outage for a customer. Real example from the past: we paged GCS “error rate within this bucket is high, snapshot restore timeouts”, reply is “it is within our SLO”.

Another situation is when problems in background jobs will cause massive outages later. Again, real examples from the past: zone out of quota and, on another instance, multi-million bill due to GC not being monitored correctly.


> The truth, of course, is somewhere is the middle.

I agree, as I have stated in other comments, cause based signals should not alert, they should file a ticket, or be displayed along side a symptom based alert if there is a correlation. You should be monitoring as much as possible so you can observe your service in realtime.

> Symptom-based alerting is extremely tricky to get right in the first place.

Depends on the service. If you operating a HTTP API, a simple SLI would just be the error ratio of 500s to total requests. Its not perfect but it is much better than alerting based on CPU percentage or ram usage for a particular machine. Things can get more complex if your API is k8s style or you are operating a data plane that is serving customer traffic.

> What is a “within budget” for a service, can be total outage for a customer.

Now you are getting into SLO territory, which is similar to, but separate from symptom based alerting. Yes, defining good SLOs is very hard and varies greatly by the type of service you are running

> another situation is when problems in background jobs will cause massive outages later.

How do you know what signals to alert with? If you knew what causes/signals to alert, then wouldn't you design your backend jobs to not do the things that would cause an outage? Hindsight is always 20/20. If you are worried that a caused based signal from your backend _could_ be telling you that there _might_ be a problem in the future but your customers are not seeing errors, it is not business critical and you can just file a ticket and someone can look into it later.


A machine breaking is something you should respond to before it affects users. You don't have to react to high temps, but thermal throttling is just as serious as a dying hard drive.

Imagine if you didn't replace drives in your RAID until it hurt the user!


> Now imagine getting a page telling you that the tempurature [sic] in a rack where your job is running is hot. What are you supposed to do?

Go and take a look at the rack? You'd have found and fixed the problem immediately in this case.


You are missing the point. The scenario I presented was from the SREs perspective, not the workers in the datacenter


The SRE can ask the data centre worker to take a look - or better yet the data centre worker would be notified directly and the SRE wouldn’t need to be in the loop.


> The SRE can ask the data centre worker to take a look

This is exactly what the blog post described


Why would you waste time doing that when you don't even know there's a problem?


Why does anyone heed a warning? In order to catch an issue before they get an error.


But the point is that warnings are often not good predictors of problems. There's a good argument that "warning" on anything at all is an anti-pattern, because you're just attracting valuable attention towards something that may or may not be anything to worry about.

Where warnings are useful are when "something is actually now going wrong" - they provide a valuable context that can help an engineer figure out what the issue might be, which is exactly how it worked in this case.


Google's alerting philosophy biases toward paging people only for user-visible symptoms, not potential causes, but there are still often lower priority ticket-based alerts for stuff like stuff getting too hot.

For rationale, see:

- https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa...

- https://landing.google.com/sre/sre-book/chapters/monitoring-...


That way of prioritising things makes sense, but in this case it meant the on-call SRE and someone from the edge networking team wasted their time diagnosing the hardware problem.

It seems a shame that the blog post wasn't able to follow

« They immediately removed ("drained") the machines from serving, thus eliminating the errors that might result in a degraded state for customers »

with something like « and they were shown a notification that there was a low-priority automatic ticket showing a possible hardware problem on those machines, so they subscribed to that ticket and didn't waste any more time ».


> on-call SRE and someone from the edge networking team wasted their time diagnosing the hardware problem

I highly doubt that the edge oncaller who was paged debugged this issue down to the hardware. The post even said that they worked with other teams to figure this out. Once the SRE figured out it wasn't their problem, the likely moved onto something else


VMware purchased a monitoring tool which was previously called Alive, I think it's part of vrops now. The tool modelled measured values and created channels for expected ranges. Nothing earth shaking but the models would adapt to regular deviations so (as an example) Friday night batch runs didn't trigger CPU alerts for 100% CPU but going that high on a Tuesday might.

The tool also had the idea of noise built in so it would suggest that in general three monitors could be out of a channel and might not alert unless six were. Still we very rarely turned on alerting out of the tool because so little of the data was actionable.

It was still a huge improvement over the Tivoli system it replaced that had one size fits all alerts which generated tens of thousands of unactionable tickets monthly.

The most interesting thing it caught for me was when a system deviated on the low side of a channel and I discovered one of three DNS servers was no longer answering queries.

So Google seems to have the right of it (at least for their business) by measuring user experience and reacting to those metrics and leaving infrastructure metrics for post mortem.


Historically, google has chosen to leave these problems to the service layer. Their internal systems detect these events but don’t readily remediate them. I think this is great policy, it’s the reason why Google’s stack has excellent, sophisticated client-side load direction and error avoidance.


The most interesting part of this piece is how it strongly implies that the BGP announcement thing in question is single-threaded. The "overloaded" machine is showing utilization just below 1 CPU, while the normal ones are loafing along at < 10% of 1 CPU.


What's interesting about that? Making it parallel is tricky and unnecessary, so they didn't.


Only hours later did I realize the title of this was a pun.


So they didn't monitor temperature of all systems by default to catch cooling problems?

Sounds more embarrassing to me.

That's fairly basic mistakes.


Would you share what you have in mind?

I (personally) would not page in the middle of the night for a temperature issue. The systems should be resilient enough to handle some cpu throttling. A non-transient issue like this would probably end up in a ticket queue.


Hindsight is 20/20. And there is at least some sort of survivership bias.

In a huge system with lots of failover you won't notice all the contingencies they identified planed for and successfully mitigated. Then something breaks which; you thought would be covered under another redundancy, was covered until a recent change (firmware change to the HVAC system), you didn't plan for etc...

And everyone points at the failure and says "that seems obvious", meanwhile the mountain of tests and monitoring and redundancy goes unnoticed.


Well, to point, at the very least the SRE post-mortem should have said "And we considered distributing monitoring software to all our datacenters for future alerting"

[Not to say that every company out there would do this, nor that this wasn't necessarily considered, but definitely questioning the merit/objectivity of a brag-piece that is trying to rebuild the waning google hype]


Nah, "monitor temperature" isn't a hindsight issue. Temperature is very near the top of the list of things to monitor, and everyone knows it.


It's cool to see them track down the errors like this, but I'd like to point out some weird things along the way:

* why do the racks have wheels at all? Doesn't seem like a standard build, and turns out the be risky

* there should be at least daily checks on the data center, including a visual inspection of cooling systems and the likes. I don't know if daily visual inspection of racks is also a thing, but should find that pretty quickly.

* monitoring temperatures in a data center is pretty essential, though I must admit I don't know enough if rack-level temperature monitoring would have caused overheating of CPUs in the rack.


> * why do the racks have wheels at all? Doesn't seem like a standard build, and turns out the be risky

They probably aren't "standard" in any kind of common sense. Google and the other big tech firms design their own hardware and likely have it assembled remotely and delivered as a complete rack. Wheels make positioning it easier. It's also not uncommon to need to move racks of machines around (say, during a cluster refit or to change capacity in different DC halls). Not having wheels is likely more of a pain than the potential gains of omitting them.

> * there should be at least daily checks on the data center, including a visual inspection of cooling systems and the likes. I don't know if daily visual inspection of racks is also a thing, but should find that pretty quickly.

Google has millions of machines and dozens of datacenters. Its techs are likely already 100% busy replacing HDDs and other routine stuff. Noticing a single rack failing (within error budget) isn't going to be a priority so it's unlikely someone is employed to do that. Cooling/power systems are definitely monitored on a more macro level.

> * monitoring temperatures in a data center is pretty essential, though I must admit I don't know enough if rack-level temperature monitoring would have caused overheating of CPUs in the rack.

There's likely DC-level temperature monitoring. Again, systems are designed to cope with some amount of failure so it's not important to monitor everything and alert someone every time anything changes. When dealing with millions of things you need to look at aggregations.


This is a fun read but should read as a description of a fairly common method of developing products and remediating "deviations". A lot of words are spent on describing what is essentially a root-cause-analysis (see GxP, ISO 9000, fishbone diagram). Hopefully, you thought "check", "check", "check", as you read it.

If you thought "couldn't Bob of shimmed it with a block of plywood?", you might want to read up on continuous-improvement. Have Bob put the shim in to fix the problem right quick, then start up the continuous improvement engine...


Why doesn't Google have a visualization system for temperature of the racks, with monitoring and alarms? It seems the problem went undetected because SREs did a poor job to begin with .... No DC has more than a few tens of thousands of machines which is easily handle by 3 borgmon or 100 monarch machines LOL ...


> All incidents should be novel.

Ah, how nice would it be if every company had infinite engineering capacity.


Do people do not visit their datacenters often enough to notice a tilted rack?


The big companies have millions of servers. At that scale, they have software teams monitoring every part of the datacenter from temperatures on each server/rack down to where every drive is at and their health. It is also highly automated with systems to direct technicians on what/where/how to repair broken parts and thus they don't have people just patrolling racks looking for problems. They also have special badges to access the data centers so people generally cannot just visit datacenters as only a select few people with business justification can visit them.


nope. My org could miss this for weeks. We would need alerting to tell us. We run a small data center on premises.


Another way to tell the same story:

"Someone at Google bought cheap casters designed to hold up an office table, and put them in the bottom of a server rack that weighed half a tonne. They failed. Tens of thousands of dollars were spent replacing them all"


Way to take all the fun out of diagnosing a problem. THE most fun part of engineering IMO.


Why is everyone here so negative all the time?


It's hard to say, but I'm feeling somewhat negative about this article too.

A more neutral way to tell it might be: "Some of our servers were overheating. We took them out of service, then discovered they were overheating because of an issue where the rack hardware had failed and tiled the rack. We eliminated this problem from this and all other racks."

The way it's currently written seems to try to set Google apart, but I'm not sure why it wouldn't work this way at any organization. Maybe I have missed the point.


At least the story doesn't end with only "And then an RCA was written which is a good practice for any SRE team."


Wow, I was really NOT expecting that.


this is awesome. one of the by-products of the public cloud era is a loss of the having to consider the physical side of things when considering operational art.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: