Hacker News new | more | comments | ask | show | jobs | submit login

Nice post mortem.

That outtage gives GCE at best a four 9's reliability for 2016.




Based on the higher level status page:

https://status.cloud.google.com/summary

It looks like GCE uptime is well below four 9's reliability for a sliding 1 year timeframe.


Traynor was quoted in a networkworld article last year saying they aim for three and a half nines (99.95%). But you need to read into the incidents more carefully -- figuring out actual "uptime" is quite hard. Consider the longest-lasting incident:

  "On Tuesday 23 February 2016, for a duration of 
   10 hours and 6 minutes, 7.8% of Google Compute Engine
   projects had reduced quotas.  ...  Any resources that
   were already created were unaffected by this issue."
I'm not sure off the top of my head how I'd try to compute the overall availability #s from that one. One can possibly try to determine and sum the effects on the individual customers, but we can't from the information provided. But it's certainly less overall downtime than just counting it as a 7 hour failure.


Agreed. It is difficult to tell. But if the bug is preventing you from processing (because you can't save the existing results) then it's essentially down time for new processing. There are also connectivity issues by region and DNS issues. It is difficult to get exact downtime considering partial failures.

That said, this is the second major asia-east1 downtime in 90 days:

https://status.cloud.google.com/incident/compute/16002


April's incident is unique, This was the only case (listed) that was a service outtage, which impacted all of GCE.

The other incidents (as far as I can tell), were service disruptions at the AZ/regional level. Those disruptions don't impact the 9's, as GCE was available for other regions.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: