Google Cloud Is Down

boulos · on June 2, 2019

Disclosure: I work on Google Cloud (but disclaimer, I'm on vacation and so not much use to you!).

We're having what appears to be a serious networking outage. It's disrupting everything, including unfortunately the tooling we usually use to communicate across the company about outages.

There are backup plans, of course, but I wanted to at least come here to say: you're not crazy, nothing is lost (to those concerns downthread), but there is serious packet loss at the least. You'll have to wait for someone actually involved in the incident to say more.

boulos · on June 2, 2019

To clarify something: this outage doesn’t appear to be global, but it is hitting us particularly hard in parts of the US. So for the folks with working VMs in Mumbai, you’re not crazy. But for everyone with sadness in us-central1, the team is on it.

digaozao · on June 2, 2019

It seems global to me. This is really strange compared to AWS. I don't remember an outage there (other than s3) impacting instances or networking globally.

murat124 · on June 2, 2019

You obviously don't recall the early years of AWS. Half of internet would go down for hours.

djsumdog · on June 2, 2019

Back when S3 failures would take town Reddit, parts of Twitter .. Netflix survived because they had additional availability zones. I can remember the bigger names started moving more stuff to their own data centers.

AWS tries to lock people in to specific services now which makes it really difficult to migrate. It also takes a while before you get to the tipping point where hosting your own is more financially viable .. and then if you trying migrating, you're stuck using so many of their services you can't even do cost comparisons.

angstrom · on June 3, 2019

Netflix actually added the additional AZs because of a prior outage that did take them down.

"After a 2012 storm-related power outage at Amazon during which Netflix suffered through three hours of downtime, a Netflix engineer noted that the company had begun to work with Amazon to eliminate “single points of failure that cause region-wide outages.” They understood it was the company’s responsibility to ensure Netflix was available to entertain their customers no matter what. It would not suffice to blame their cloud provider when someone could not relax and watch a movie at the end of a long day."

https://www.networkworld.com/article/3178076/why-netflix-did...

aaronblohowiak · on June 3, 2019

We went multi-region as a result of the 2012 inc. source: I now manage the team responsible for performing regional evacuations (shifting traffic and scaling the savior regions).

mkl · on June 3, 2019

That sounds fascinating! How often does your team have to leap into action?

aaronblohowiak · on June 3, 2019

We don’t usually discuss the frequency of unplanned failovers, but I will tell you that we do a planned failover at least every two weeks. The team also uses traffic shaping to perform whole system load tests with production traffic, which happens quarterly.

justinator · on June 3, 2019

Do you do any chaos testing? Seems like it would slot right in, there.

Zobat · on June 3, 2019

I'd say yes. I heard about this tool just a week ago at a developer conference.

https://github.com/Netflix/chaosmonkey

a_t48 · on June 3, 2019

Netflix was a pioneer of chaos testing, right? https://en.m.wikipedia.org/wiki/Chaos_engineering

aaronblohowiak · on June 3, 2019

https://www.oreilly.com/library/view/chaos-engineering/97814... ;)

arainwater · on June 3, 2019

they have invented the term, so probably yes :)

azimuth11 · on June 3, 2019

I think some Google engineers published a free Meap book on service relatability and uptime guarantees. Seemingly counterintuitive, scheduling downtime, without other teams’ prior knowledge, encourages teams to handle outages properly and reduce single points of failure, among other things.

fnord123 · on June 3, 2019

Service Reliability Engineering is on OReilly press. It's a good book. Up there with ZeroMQ and Data Intensive Applications as maybe the best three books from OReilly in the past ten years.

fnord123 · on June 3, 2019

Derp, Site Reliability Engineering.

https://landing.google.com/sre/books/

sulam · on June 2, 2019

I think you’re misremembering about Twitter, which still doesn’t use AWS except for data analytics and cold storage last I heard (2 months ago).

ceejayoz · on June 3, 2019

Avatars were hosted on S3 for a long time, IIRC.

StreamBright · on June 2, 2019

I am not sure if a single S3 outage pushed any big names into their own "datacenter". S3 has still the world record of reliability that you cannot challenge with your inhouse solutions. You can prove it otherwise. I would love to hear a solution that has the same durability, avabiality and scalability as S3.

For the downvoters, please just link here the proof if you disagree.

Here are the S3 numbers: https://aws.amazon.com/s3/sla/

snicker7 · on June 2, 2019

It's not so much AWS vs. in-house. But AWS (or GCP/DO/etc.) vs. multi/hybrid solutions. The latter of which would presumably have lower downtime.

didibus · on June 3, 2019

I don't see why multi/hybrid would have lower downtime. All cloud providers as far as I know, though I know mostly of AWS, already have their services in multiple data-centers and their endpoints in multiple regions. So if you make yourself use more then one of their AZs and Region, you would be just as multi as with your own data center.

zambal · on June 3, 2019

Using a single cloud provider with a multiple region setup won't protect you from some issues in their networking infrastructure, as the subject of this thread supposedly shows.

Although I guess depending on how your own infrastructure is setup, even a multi cloud provider setup won't save you from a network outage like the current Google cloud one.

didibus · on June 4, 2019

Hum, I'm not an expert on Google cloud, but for AWS, regions are completely independent and run their own networking infrastructure. So if you really wanted to tolerate a region infrastructure failure, you could design your app to fail over to another region. There shouldn't be any single point of failure between the regions, at least as far as I know.

solidasparagus · on June 2, 2019

Why would you think that self-managed has lower downtime than AWS using multiple datacenters/regions?

KirinDave · on June 3, 2019

Actually, I imagine that if you could go multi-regional then your self-managed solution may be directly competitive in terms of uptime. The idea that in-house can't be multi-regional is a bit old fashioned in 2019.

StreamBright · on June 3, 2019

For several reasons, most notably: staff, build quality, standards, knowledge of building extremely reliable datacenters. Most of the people who are the most knowledgeable about datacenters also happen to be working for cloud vendors. On the top of that: software. Writing reliable software at scale is a challenge.

ummonk · on June 3, 2019

Multi/hybrid means you use both self managed and AWS datacenters.

dgoldstein0 · on June 3, 2019

Cannot challenge with your own inhouse solutions, you say?

Challenge Accepted... and defeated: https://blogs.dropbox.com/tech/2016/03/magic-pocket-infrastr...

but to be fair, storage is core to Dropbox's business... this is not true for most companies.

disclaimer: I work for Dropbox, though not on Magic Pocket.

qes · on June 3, 2019

> For the downvoters, please just link here the proof if you disagree.

> Here are the S3 numbers: https://aws.amazon.com/s3/sla/

99.9%

https://azure.microsoft.com/en-au/support/legal/sla/storage/...

99.99%

ti_ranger · on June 4, 2019

>> Here are the S3 numbers: https://aws.amazon.com/s3/sla/

> 99.9%

(single-region)

There doesn't seem to be an SLA on S3-cross-region-replication configurations, but I am not aware of a multi-region S3 (read) outage, ever.

> https://azure.microsoft.com/en-au/support/legal/sla/storage/....

> 99.99%

99.99% is for "Read Access-Geo Redundant Storage (RA-GRS)"

Their equivalent SLA is the same (99.9% for "Locally Redundant Storage (LRS), Zone Redundant Storage (ZRS), and Geo Redundant Storage (GRS) Accounts.").

StreamBright · on June 3, 2019

Azure is a cloud solution. The thread is about how a random datacenter with a random solution is better than S3.

ozymandias12 · on June 3, 2019

Wow, he’s comparing the storages SLA of the two biggest cloud services in the world. Pedantic behavior should hurt.

fusl · on June 2, 2019

> For the downvoters, please just link here the proof if you disagree.

https://wasabi.com/

snazz · on June 2, 2019

How can they possibly guarantee eleven nines? Considering I’ve never heard of this company and they offer such crazy-sounding improvements over the big three, it feels like there should be a catch.

ignoramous · on June 2, 2019

11 9s isn't uncommon. AWS S3 does 11 9s (upto 16 9s with cross region replication?) for data durability, too. AFAIK, AWS published papers about their use of formal methods to ascertain bugs from other parts of the system didn't creep in to affect durability/availability guarantees: https://blog.acolyer.org/2014/11/24/use-of-formal-methods-at...

This is a pretty neat and concise read on ObjectStorage in-use at BigTech, in case you're interested: https://maisonbisson.com/post/object-storage-prior-art-and-l...

nullwasamistake · on June 3, 2019

You have to be kidding me. 14 9's is already microseconds a year. Surely below anybody's error bar for whether a service is down or not.

16 9's and aws should easily last as long as the great pyramids without a second worth of outage.

What a joke

agwa · on June 3, 2019

The 16 9's are for durability, not availability. AWS is not saying S3 will never go down; they're saying it will rarely lose your data.

nullwasamistake · on June 3, 2019

This number is still total bullshit. They could lose a few kb and be above that for centuries

deanCommie · on June 3, 2019

It's not about losing a few kb here and there.

It's about losing entire data centers to massive natural disasters once in a century.

jefftk · on June 3, 2019

None of the big cloud providers have unrecoverably lost hosted data yet, despite sorting vast volumes, so this doesn't seem BS to me.

mentat · on June 3, 2019

AWS lost data in Australia a few years ago due to a power outage I believe.

anbop · on June 3, 2019

on EBS, not on S3. EBS has much lower durability guarantees

nullwasamistake · on June 3, 2019

Not losing any data yet doesn't give justification for such absurd numbers

joshuamorton · on June 3, 2019

Those numbers probably aren't as absurd as you think. 16 9s is, I think 10 bytes lost per exabyte-year of data storage.

There's perhaps the additional asterisk of "and we haven't suffered a catastrophic event that entirely puts us out of business". (Which is maybe only terrorist attacks). Because then you're talking about losing data only when cosmic-ray bitflips happen simultaneously in data centers on different continents, which I'd expect doesn't happen too often.

joshuamorton · on June 3, 2019

This is for data loss. 11 9s is like 1 byte lost per terabyte-year or something, which isn't an unreasonable number.

StreamBright · on June 3, 2019

This is why I linked the SLA page which you obviously have not read. There are different numbers for durability and availability.

johnmaguire · on June 2, 2019

For data durability? I believe some AWS offerings also have an SLA of eleven 9's of data durability.

sascha_sl · on June 3, 2019

11 9s of durability, barely two 9s of availability

I'm sure that's okay if you do bulk processing / time-independent analysis, but don't host production assets on wasabi.

StreamBright · on June 3, 2019

I was asking numbers of reliability, durability and availability for a service like S3. What does wasabi has to do with that?

ljm · on June 2, 2019

Always in Virginia, because US-east has always been cheaper.

mooreds · on June 2, 2019

I know a consultant who calls that region us-tirefire-1.

autotune · on June 2, 2019

I and some previous coworkers call it the YOLO region.

thruhiker · on June 2, 2019

The only regions that are more expensive than us-east-1 in the States are GovCloud and us-west-1 (Bay Area). Both us-west-2 (Oregon) and us-east-2 (Ohio) are priced the same as us-east-1.

angstrom · on June 3, 2019

I would probably go with US-EAST-2 just because it's isolated from anything except perhaps a freak Tornado and better situated on the eastern US. Latency to/from there should be near optimal for most eastern US/Canada population.

thruhiker · on June 3, 2019

One caveat with us-east-2 is that it appears to get new features after us-east-1 and us-west-2. You can view the service support by region here: https://aws.amazon.com/about-aws/global-infrastructure/regio....

angstrom · on June 6, 2019

Fair point. It depends on what the project is.

Scoundreller · on June 2, 2019

And for those of us in GST/HST/VAT land, hosting in USA saves us some tax expenditures.

AnssiH · on June 2, 2019

How?

At least in EU services bought from overseas are subject to reverse charge, i.e. self-assessment of VAT (Article 196 of https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:02... ).

Though note that if you are an EU AWS customer, you are not buying from outside EU, you are buying from Amazon's EU branches regardless of AWS region. If Amazon has a local branch in your country, they charge you VAT as any local company does. Otherwise you buy from an Amazon branch in another EU country, and you again need to self-assess VAT (reverse charge) per Article 196.

Scoundreller · on June 2, 2019

My experience is with Canadian HST.

Since AWS built a DC in Canada, I’m paying HST on my Route53 expenses, but not on my S3 charges in non-Canadian DCs.

I’m not an HST registrant (small supplier, or if you’re just using services personally), so there’s nothing to self-assess.

Even if self-assessment was required, you get some deferral on paying (unless you have to remit at time of invoice?).

AnssiH · on June 3, 2019

Makes sense.

I believe it works differently in EU (i.e. US DCs taxed) as per Article 44 the place of supply of services is the customer's country if the customer has no establishment in the supplier's country.

paranoidrobot · on June 3, 2019

AWS is registered for Australian GST - they therefore charge GST on all(ish) services[0].

IBM/Softlayer, Rackspace, Google Cloud, Microsoft and I imagine everyone else large enough to count also does, too.

For Australian businesses, at least, being charged GST isn't a problem - they can claim it as an input and get a tax credit[1].

[0] https://aws.amazon.com/tax-help/australia/

[1] https://www.ato.gov.au/Business/GST/Claiming-GST-credits/

BillinghamJ · on June 2, 2019

You know, normally you still have to pay that tax - just through a reverse charge process

Scoundreller · on June 2, 2019

Not the case in Canada if you’re not an HST registrant (non-business or a small enough business where you’re exempt).

Even if you did have to self-assess, better to pay later than right away.

threeseed · on June 2, 2019

Mostly because those sites were never architected to work across multiple availability zones.

rincebrain · on June 2, 2019

Years ago, when I was playing with AWS in a course on building cloud-hosted services, it was well-known that all the AWS management was hosted out of a single zone, and there were several days we had to cancel class because us-east-1 had an outage, so while technically all our VMs hosted out of other AZs were extant, all our attempts to manage our VMs via the web UI or API were timing or erroring out.

I understand this is long-since resolved (I haven't tried building a service on Amazon in a couple years, so this isn't personal experience), but centralized failure modes in decentralized systems can persist longer than you might expect.

(Work for Google, not on Cloud or anything related to this outage that I'm aware of, I have no knowledge other than reading the linked outage page.)

jonhohle · on June 2, 2019

> it was well-known that all the AWS management was hosted out of a single zone, and there were several days we had to cancel class because us-east-1 had an outage

Maybe you mean region, because there is no way that AWS tools were ever hosted out of a single zone (of which there are 4 in us-east-1). In fact, as of a few years ago, the web interface wasn’t even a single tool, so it’s unlikely that there was a global outage for all the tools.

And if this was later than 2012, even more unlikely, since Amazon retail was running on EC2 among other services at that point. Any outage would be for a few hours, at most.

molesy · on June 3, 2019

Quoting https://docs.aws.amazon.com/general/latest/gr/rande.html

"Some services, such as IAM, do not support Regions; therefore, their endpoints do not include a Region."

There was a partial outage maybe a month and a half ago where our typical AWS Console links didn't work but another region did. My understanding is that if that outage were in us-east-1 then making changes to IAM roles wouldn't have worked.

scarface74 · on June 3, 2019

The original poster said that none of AWS services are in a single AZ, the quote you referenced says that IAMs do not support regions.

Your quote cd mean two things.

- that IAM services are hosted in one region (not one AZ)

And/Or

- that IAM is for the entire account not per region like other services (which is true)

chucky_z · on June 3, 2019

Just this year an issue in us-east-1 caused the console to fail pretty globally.

rincebrain · on June 3, 2019

Quite possibly, it has been a number of years at this point, and I didn't dig out the conversations about it for primary sourcing.

boulos · on June 2, 2019

Where are you based? If you’re in the US (or route through the US) and trying to reach our APIs (like storage.googleapis.com), you’ll be having a hard time. Perhaps even if the service you’re trying to reach is say a VM in Mumbai.

digaozao · on June 2, 2019

I am in Brazil, with servers in southamerica. Right now it seems back to normal.

nodesocket · on June 2, 2019

I have an instance in us-west-1 (Oregon) which is up, but an instance in us-west-2 (Los Angeles) which is down. Not sure if that means Oregon is unaffected though.

not_kurt_godel · on June 2, 2019

us-west-1 is Northern California (Bay area). us-west-2 is Oregon (Boardman).

nodesocket · on June 2, 2019

Incorrect. GCE us-west1 is the Dalles, Oregon and us-west2 is Los Angeles.

not_kurt_godel · on June 3, 2019

What I said is correct for AWS. In retrospect I guess the context was a bit ambiguous.

(I will note that I was technically more right in the most obnoxiously pedantic sense since the hyphenation style you used is unique to AWS - `us-west-1` is AWS-style while `us-west1` is GCE-style :P)

vast · on June 2, 2019

EUW doesn't seem to be affected.

EugeneOZ · on June 2, 2019

My instance in Belgium works fine

captn3m0 · on June 2, 2019

Some services are still impacted globally. Gmail over IMAP is unreachable for me. (Edit: gmail web is fine)

zaporozhets · on June 2, 2019

+1- imap gmail is down for me in Australia

hazeii · on June 2, 2019

Yes, same here in UK (for some hours now).

afiori · on June 2, 2019

Quick update from Germany, both youtube and gmail appear to work fine

ls612 · on June 2, 2019

I’m from the US and in Australia right now. Both me and my friends in the US are experiencing outages across google properties and Snapchat, so it’s pretty global.

the-rc · on June 2, 2019

Fiber cut? SDN bug that causes traffic to be misdirected? One or more core routers swallowing or corrupting packets?

falcon2_0 · on June 3, 2019

It seemed to be congestion in the North East US.

odiroot · on June 2, 2019

> including unfortunately the tooling we usually use to communicate across the company about outages.

There's some irony in that.

boulos · on June 2, 2019

Edit: and I agree!

I’m not in SRE so I don’t bother with all the backup modes (direct IRC channel, phone lines, “pagers” with backup numbers). I don’t think the networking SRE folks are as impacted in their direct communication, but they are (obviously) not able to get the word out as easily.

Still, it seems reasonable to me to use tooling for most outages that relies on “the network is fine overall”, to optimize for the common case.

Note: the status dashboard now correctly highlights (Edit: with a banner at the top) that multiple things are impacted because Networking. The Networking outage is the root cause.

marksomnian · on June 2, 2019

> the status dashboard now correctly highlights that multiple things are impacted because Networking.

this column of green checkmarks begs to differ: https://i.imgur.com/2TPD9e9.png

pm90 · on June 2, 2019

This is a person who's trying to help out while on vacation...can we try being more thankful, and not nitpick everything they say?

boulos · on June 2, 2019

Thanks! I’ll leave this here as evidence that I should rightfully reduce my days off by 1 :).

boulos · on June 2, 2019

The banner at the top. Sorry if that wasn’t clear.

seltzered_ · on June 2, 2019

While not exactly google cloud, G suite dashboard seems accurate: https://www.google.com/appsstatus#hl=en&v=status

TimothyBJacobs · on June 2, 2019

For me, at least, that was showing as all green for at least 30 minutes.

Twirrim · on June 2, 2019

AWS experienced a major outage a few years ago that couldn't be communicated to customers because it took out all the components central to update the status board. One of those obvious-in-hindsight situations.

Not long after that incident, they migrated it to something that couldn't be affected by any outage. I imagine Google will probably do the same thing after this :)

flurdy · on June 3, 2019

The status page is the kind of thing you expect to be hosted on a competitor network. It is not dogfooding but it is sensible.

Reminds me of when I was working with a telecoms company. It was a large multinational company and the second largest network in the country I was in at the time.

I was surprised when I noticed all the senior execs were carrying two phones, of which the second was a mobile number on the main competitor (ie the largest network). After a while, I realised that it made sense, as when the shit really hit the fan they could still be reached even when our network had a total outage.

techslave · on June 3, 2019

> Not long after that incident, they migrated it to something that couldn't be affected by any outage.

Like the black box on an airplane, if it has 100% uptime why don’t they build the whole thing out of that? ;)

captn3m0 · on June 2, 2019

Was just reading it, they made their status page multi-region.

k_bx · on June 2, 2019

Even more irony: Google+ shown as working fine: https://i.imgur.com/52ACuiY.png

chupasaurus · on June 2, 2019

G+ is alive and well for G Suite subscribers, not the general users.

ohazi · on June 2, 2019

> including unfortunately the tooling we usually use to communicate across the company about outages.

So memegen is down?

ChuckMcM · on June 2, 2019

I'm guessing this will be part of the next DiRT exercise :-) (DiRT being the disaster recovery exercises that Google runs internally to prepare for this sort of thing)

bufferoverflow · on June 2, 2019

Well, lots of revenue is lost, that's for sure.

SmokeGS · on June 2, 2019

>nothing is lost

except time

stanfordkid · on June 2, 2019

Can't use my Nest lock to let guests into my house. I'm pretty sure their infrastructure is hosted in Google Cloud. So yeah... definitely some stuff lost.

TeMPOraL · on June 2, 2019

You have my honest sympathy because of the difficulties you now suffer through, but it bears emphasizing: this is what you get when you replace what should be a physical product under your control with Internet-connected service running on third-party servers. IoT as seen on the consumer market is a Bad Idea.

gowld · on June 2, 2019

It's a trade-off of risks. Leaving a key under the may could lead to a security breach.

curiouscats · on June 2, 2019

I am pretty sure there are smart locks that don't rely on an active connection to the cloud. The lock downloads keys when it has a connection and a smartphone can download keys. This means they work even if no active internet connection at the time the person tries to open. If the connection was dead the entire time between creating the new key and the person trying to use the lock it still wouldn't work.

If there are not locks that work this way it sure seems like there should be. Using cloud services to enable cool features is great. But if those services are not designed from the beginning with fallback for when the internet/cloud isn't live that is something that is a weakness that often is unwise to leave in place imo.

andrewjshults · on June 3, 2019

FWIW - The Nest lock in question doesn't rely on an active internet connection to work. If it can't connect, it can still be unlocked using the sets of PINs you can setup for individual users (including setting start/end times and time of day that the codes are active). There's even a set of 9V battery terminals at the bottom in case you forget to change the batteries that power the lock.

This does mean you need to setup a code in advance of people showing up, but it's an under 30 second setup that I've found simpler than unlocking once someone shows up. The cameras dropping offline are a hot mess though, since those have no local storage option.

dltmrcd · on June 2, 2019

It may not be worth the complexity to give users the choice. If I were to issue keys to guests this way I would want my revocations to be immediately effective no matter what. Guest keys requiring a working network is a fine trade-off.

azernik · on June 2, 2019

You can have this without user intervention - have the lock download an expiration time with the list of allowed guest keys, or have the guest keys public-key signed with metadata like expiration time.

If the cloud is down, revocations aren't going to happen instantly anyway. (Although you might be able to hack up a local WiFi or Bluetooth fallback.)

TeMPOraL · on June 2, 2019

So can a compromise of a "smart" lock.

It's a fake trade-off, because you're choosing between lo-tech solution and bad engineering. IoT would work better if you made the "I" part stand for "Intranet", and kept the whole thing a product instead of a service. Alas, this wouldn't support user exploitation.

Pxtl · on June 3, 2019

Yeah, my dream device would be some standard app architecture that could run on consumer routers. You buy the router and it's your family file and print server, and also is the public portal to manage your IoT devices like cameras, locks, thermostats, and lights.

redler · on June 3, 2019

You can get a fair amount of this with a Synology box. Granted, a tool for the reasonably technically savvy and probably not grandma.

ericd · on June 3, 2019

I love my Synology, I wish they would expand more into being the controller of the various home IOT devices.

MarkyC4 · on June 3, 2019

I don't use the features, but I know my Qnap keeps touting IoT so they might be worth checking out as well.

It's also my Plex media server, file server, VPN, I run some containers on there. I used to use it as a print server but my new printer is wireless so I never bothered

swinglock · on June 2, 2019

Don't be ridiculous. Real alternatives would include P2P between your smart lock and your phone app or a locally hosted hub device which controls all home automation/IoT, instead of a cloud. If the Internet can still route a "unlock" message from your phone to your lock, why do you require a cloud for it to work?

Moru · on June 2, 2019

Or use one of the boxes with combination lock that you can screw onto your wall for holding a physical key. Some are even recommended by insurance companies.

debaserab2 · on June 2, 2019

At least you can isolate your security risk to something you have more control over than a random network outage.

rakden · on June 2, 2019

Any key commands they have already set up will still work. Nest is pretty good at having network failures fail to a working state. They might not be able to actively open the lock over the network is the only change.

alphabettsy · on June 2, 2019

One of the reasons why I personally wanted a smart-lock that had BLE support along with a keypad for backup in addition to HomeKit connectivity.

disillusioned · on June 2, 2019

Sure you can, but you'll need to give them your code or the master code. Unless you've enabled Privacy Mode, in which case... I don't know if even the master code would work.

znpy · on June 2, 2019

You should have foreseen this when you bought stuff that rely on "the cloud"

MrAureliusR · on June 2, 2019

Everyone talking about security and not replacing locks with smart locks seems to forget that you can just kick the fucking door down or jimmy a window open.

milesward · on June 2, 2019

Or just sawzall a hole in the side of the house...

justinclift · on June 2, 2019

After you've cut the power, just to be safe? ;)

jedikv · on June 3, 2019

Except kicking the door down is not particularly scalable or clandestine

jakeogh · on June 2, 2019

To bad we don't have google cars yet.

jsty · on June 2, 2019

"Cloud Automotive Collision Avoidance and Cloud Automotive Braking services are currently unavailable. Cloud Automotive Acceleration is currently accepting unauthenticated PUT requests. We apologise for any inconvenience caused."

ganeshkrishnan · on June 3, 2019

Our algorithms have detected unusual patterns and we have terminated your account as per clause 404 in Terms And Conditions. The vehicle will now stop and you are requested to exit.

sdan · on June 2, 2019

Phoenix Arizona residents think otherwise

DonHopkins · on June 3, 2019

They weren't wearing Batman t-shirts were they?

http://www.ktvu.com/news/mistaken-identity-nest-locks-out-ho...

viburnum · on June 2, 2019

I wonder if in the future products will advertise that they work independently (decoupling as a feature).

SmokeGS · on June 2, 2019

holy shit lmao. I'm sorry that sucks.

sdan · on June 2, 2019

and a nice Sunday afternoon

digaozao · on June 2, 2019

And lots of sales on my case

toomuchtodo · on June 2, 2019

And the illusion of superiority over non cloud offerings.

hinkley · on June 2, 2019

I keep trying to explain to people that our customers don’t care that there is someone to blame they just want their shit to work. There are advantages to having autonomy when things break.

There’s a fine line or at least some subtlety here though. This leads to some interesting conversations when people notice how hard I push back against NIH. You don’t have to be the author to understand and be able to fiddle with tool internals. In a pinch you can tinker with things you run yourself.

tzs · on June 2, 2019

> I keep trying to explain to people that our customers don’t care that there is someone to blame they just want their shit to work. There are advantages to having autonomy when things break.

There are also advantages to being part of the herd.

When you are hosted at some non-cloud data center, and they have a problem that takes them offline, your customers notice.

When you are hosted at a giant cloud provider, and they have a problem that takes them offline, your customers might not even notice because your business is just one of dozens of businesses and services they use that aren't working for them.

not_kurt_godel · on June 2, 2019

Of course customers don't care about the root cause. The point of the cloud isn't to have a convenient scapegoat to punt blame to when your business is affected. It's a calculated risk that uptime will be superior compared to running and maintaining your own infrastructure, thus allowing your business to offer an overall better customer experience. Even when big outages like this one are taken into account, it's often a pretty good bet to take.

pts_ · on June 3, 2019

What does NIH stand for?

gdy · on June 3, 2019

Not Invented Here

k__ · on June 2, 2019

How come?

PerfectElement · on June 2, 2019

The small bare metal hosting company I use for some projects hardly goes down, and when there is an issue, I can actually get a human being on the phone in 2 minutes. Plus, a bare metal server with tons of RAM costs less than a small VM on the big cloud providers.

oarsinsync · on June 2, 2019

> a bare metal server with tons of RAM costs less than a small VM on the big cloud providers

Who are you getting this steal of a deal from?

fredoliveira · on June 2, 2019

Hetzner is an example. Been using them for years and it's been a solid experience so far. OVH should be able to match them, and there's others, I'm sure.

etaioinshrdlu · on June 2, 2019

Hetzner is pretty excellent quality service overall. OVH is very low quality service, especially with the networking and admin pane.

nisa · on June 2, 2019

hetzner.de, online.net, ovh.com, netcup.de for the EU-market.

dijit · on June 3, 2019

Anywhere. Really.

Cloud costs roughly 4x than bare metal for sustained usage (of my workload). Even with the heavy discounts we get for being a large customer it’s still much more expensive. But I guess op-ex > cap-ex

oarsinsync · on June 4, 2019

Lots of responses, and I appreciate them, but I'm specifically looking for a bare metal server with "tons of RAM", that is at the same or lower price point as a google/microsoft/amazon "small" node.

I've never seen any of the providers listed offer "tons of ram" (unless we consider hundreds / low thousands of megabytes to be "tons") at that price point.

HighPlainsDrftr · on June 3, 2019

I've had pretty good luck with Green House Data's Colo Service and their Cloud offerings. A couple of RU's in the data center can host 1000's of VM's in multi-regions with great connectivity between them.

bmelton · on June 2, 2019

Care to name names? I've been looking for a small, cheap failover for a moderately low traffic app.

chucky_z · on June 2, 2019

In the US I use Hivelocity. If you want cheapest possible, Hetzner/OVH have deals you can get for _cheap._

avereveard · on June 3, 2019

I've a question that always stopped me going that route, what happens when a disk or other hardware fails on these servers? beyond data loss I mean, like physically what happens who carries out the repair how long does it takes

nisa · on June 3, 2019

For Hetzner you have to monitor your disks and run RAID-1. As soon as you get the first SMART-Failures you can file a ticket and either replace ASAP or shedule a time. This happened to me a few times in the past years it always has been just 15-30m delay after filing the ticket and at most 5 minutes downtime. You have to get your Linux stuff right through i.e. booting with a new disk.

If you don't like that you can order a KVM-VM with dedicated cores at similiar prices and the problem is not yours anymore.

chucky_z · on June 3, 2019

Most bare metal providers nowadays contact you just like AWS and say "hey your hardware is failing get a new box.". Unless it's something exotic it's usually not long for setup time, and in some cases just like a VM it's online in a minute or two.

avereveard · on June 3, 2019

thanks!

bmelton · on June 3, 2019

Thanks a million. Those prices look similar to what I've used in the past, it's just been a long time since I've gone shopping for small scale dedicated hosting.

Moru · on June 2, 2019

You weren't kidding, 1:10 ratio to what we pay for similar VPS. And guaranteed worldwide lowest price on one of them. Except we get free bandwidth with ours.

toomuchtodo · on June 2, 2019

There are some whole argue that the resiliency of cloud providers beats on prem or self hosted, and yet they’re down just as much or more (GCP, Azure, and AWS all the same). Don’t take my word for it; search HN for “$provider is down” and observe the frequency of occurrences.

You want velocity for your dev team? You get that. You want better uptime? Your expectations are gonna have a bad time. No need for rapid dev or bursty workloads? You’re lighting money on fire.

Disclaimer: I get paid to move clients to or from the cloud, everyone’s money is green. Opinion above is my own.

TeMPOraL · on June 2, 2019

Solutions based on third-party butts have essentially two modes: the usual, where everything is smooth, and the bad one, where nothing works and you're shit out of luck - you can't get to your data anymore, because it's in my butt, accessible only through that butt, and arguably not even your data.

With on-prem solutions, you can at least access the physical servers and get your data out to carry on with your day while the infrastructure gets fixed.

chupasaurus · on June 2, 2019

Any solution would be based on third parties, the robust solution is either to run your own country with fuel sources for electricity and army to defend the datacenters or rely on multiple independent infrastructures. I think the latter is less complex.

dijit · on June 3, 2019

This is a ridiculous statement. Surely you realise that there is a sliding scale.

You can run your own hardware and pull in multiple power lines without establishing your own country.

I’ve ran my own hardware, maybe people have genuinely forgotten what it’s like, and granted, it takes preparation and planning and it’s harder than clicking “go” in a dashboard. But it’s not the same as establishing a country and source your own fuel and feed an army. This is absurd.

HighPlainsDrftr · on June 3, 2019

Correct. Most CFO's I've run into as of late would rather spend $100 on a cloud vm than deal with capex, depreciation, and management of the infrastructure. Even though doing it yourself with the right people can go alot further.

chupasaurus · on June 3, 2019

The GP's statement is about relying on third parties, multiple power lines with generators you don't own on the other side falls under it.

Fun related fact: My first employee's main office was in former electonics factory in Moscow's downtown powered by 2 thermal power stations (and no other alternatives), which have exact same maintenance schedule.

Godel_unicode · on June 2, 2019

Assuming you have data that is tiny enough to fit anywhere other than the cluster you were using. Assuming you can afford to have a second instance with enough compute just sitting around. Assuming it's not the HDDs, RAID controller, SAN, etc which is causing the outage. Assuming it's not a fire/flood/earthquake in your datacenter causing the outage.

...etc.

killjoywashere · on June 2, 2019

Ah, yes, I will never forget running a site in New Orleans, and the disaster preparedness plan included "When a named storm enters or appears in the Gulf of Mexico, transfer all services to offsite hosting outside the Gulf Coast". We weren't allowed to use Heroku in steady state, but we could in an emergency. But then we figured out they were in St. Louis, so we had to have a separate plan for flooding in the Mississippi River Valley.

hinkley · on June 2, 2019

Took me a second.

I didn’t know the cloud-to-butt translator worked on comments too. I forgot that was even a thing.

trevyn · on June 3, 2019

Oh that’s weird, because it totally worked for me with “butts” as a euphemism for “people”, as in “butt-in-seat time” — relying on a third-party service is essentially relying on third party butts (i.e. people), and your data is only accessible through those people, whom you don’t control.

And then “your data is in my butt” was just a play on that.

TeMPOraL · on June 2, 2019

I keep forgetting that I have it on, my brain treats the two words as identical at this point. The translator has this property, which I also tend to forget about, that it will substitute words in your HN comment if you edit it.

But yeah, it's still a thing, and the message behind it isn't any less current.

crankylinuxuser · on June 2, 2019

There is a cloud I've developed that is secure and isn't a butt :P

https://hackaday.io/project/12985-multisite-homeofficehacker...

I made IoT using cheap (arduino, nrf24l01+, sensors/actuators) for local device telemetry, MQTT, node-red, and Tor for connecting clouds of endpoints that aren't local.

Long story short, its an IoT that is secure, consisting of a cloud of devices only you own.

Oh yeah, and GPL3 to boot.

countbackula · on June 2, 2019

And reputation. With this outage the global media socket is going to be in gCloud nine.

jussij · on June 3, 2019

and reputation.

foobarbazetc · on June 2, 2019

Seems to be the private network. The public network looks fine to us from all over the world?

nodesocket · on June 2, 2019

Not on my end. Public access in us-west2 (Los Angeles) is down for me.

foobarbazetc · on June 2, 2019

Hmmm... why is our monitoring network not showing that?

Edit: ah, looks like the LB is sending LA traffic to Oregon.

teamspirit · on June 2, 2019

Our Oregon VMs are up.

Double_a_92 · on June 3, 2019

> but there is serious packet loss at the least.

Can confirm with Gmail in Europe. Everything works but it's sluggish (i.e. no immediate reaction on button clicks).

gingabriska · on June 3, 2019

We are also hosted on GCP bit nothing is down for us. We are using 3 regions in US and 2 in EU.

breadandcrumbel · on June 3, 2019

What can be the reason for the outage? Can it be a cyber attack to your servers?

ikiris · on June 3, 2019

go/stopleaks :)

foota · on June 3, 2019

Hm, isn't releasing go links publicly also verboten? :)

123jay7 · on June 2, 2019

This happened to Amazon S3 as well once. The "X" image they use to indicate a service outage was served by... yup, S3, which was down obviously.

hinkley · on June 2, 2019

One of the projects I worked on was using data URIs for critical images, and I wouldn’t trust that particular team to babysit my goldfish.

Sounds like Google and Amazon are hiring way too many optimists. I kinda blame the war on QA for part of this, but damn that’s some Pollyanna bullshit.

dosy · on June 3, 2019

You're brave to jump on here when on holiday!

Shouldn't that outage system be aware when service heartbeats stop?

Could this be a solar flare?

Yrlec · on June 2, 2019

Now is a good time to point out that the SLA of Google Cloud Storage only covers HTTP 500 errors: https://cloud.google.com/storage/sla. So if the servers are not responding at all then it's not covered by the SLA. I've brought this to their attention and they basically responded that their network is never down.

crazygringo · on June 2, 2019

Ironically I can't read that page because, since it's Google-hosted, I'm getting an HTTP 500 error... but which means at least that service is SLA-covered...

Cloud services live and die by their reputation, so I'd be shocked if Google ever tried to get out of following an SLA contract based on a technicality like that. It would be business suicide, so it doesn't seem like something to be too worried about?

based2 · on June 2, 2019

https://www.reddit.com/r/sysadmin/comments/bw1gye/most_googl...

https://www.zdnet.com/article/some-internet-outages-predicte... 768k Day

_Marak_ · on June 3, 2019

This should be voted higher up.

According to https://twitter.com/bgp4_table, we have just exceeded 768k Border Gateway Protocol routing entries, which may be causing some routers to malfunction.

dreamer_soul · on June 3, 2019

Isn't it weird that it's happening now even though that number was surpassed nearly a month and half ago?

dyu · on June 3, 2019

Different locations see different counts because of aggregation/de-aggregation.

juanuys · on June 3, 2019

Will this affect more than just Google? I haven't seen any outages from other cloud providers.

namibj · on June 3, 2019

packet.net was hit. Specifically, also their San Jose DC. Internet only. It took less than an hour to recover. More than 20 minutes. I didn't ping it continuously, and I can say that the traceroute got stuck in Frankfuhrt (Where my ISP and their ISP first (as seen from me) meet).

I was actually surprised, as they tend to have excellent networking. Now I'm not nearly as distrusting as I was initially, knowing it was likely their ISP getting screwed by routing table overflow.

tntn · on June 2, 2019

There goes 3 nines for June and for Q2. I guess everyone gets a 10% discount for the month? https://cloud.google.com/compute/sla

OkGoDoIt · on June 3, 2019

Remember to request the credit!

From that linked page:

"Customer Must Request Financial Credit

In order to receive any of the Financial Credits described above, Customer must notify Google technical support within thirty days from the time Customer becomes eligible to receive a Financial Credit. Customer must also provide Google with server log files showing loss of external connectivity errors and the date and time those errors occurred. If Customer does not comply with these requirements, Customer will forfeit its right to receive a Financial Credit. If a dispute arises with respect to this SLA, Google will make a determination in good faith based on its system logs, monitoring reports, configuration records, and other available information, which Google will make available for auditing by Customer at Customer’s request."

gundmc · on June 2, 2019

A couple more hours and everyone will get 25% off for June.

twistedpair · on June 2, 2019

Does that apply to the rest of June?

Might be a good month to rebuild all your models ;)

quickthrower2 · on June 2, 2019

The vultures are circling.

CamelCaseName · on June 2, 2019

Ironically, the SLA page returns a 502 error.

londons_explore · on June 2, 2019

The discount seems way too small.

I would pay a premium for a cloud provider happy to give 100 percent discount for the month for 10 minutes downtime, and 100 percent discount for the year for an hour's downtime.