Yeah, per the status page, there's a temporary mitigation rolled out while the team tries to figure out more... so no more 404s but load balancers will be locked down for a while while investigation continues.
Parts of google are also down it seems, if one clicks three lines on the top right on google.com in mobile browser, then the https://www.google.com/mobile/?newwindow=1 opens, which also appears 404 not found.
I’m assuming all these websites point to Google web frontend servers and for some reason it’s no longer able to map the Host header to the proper backend to proxy to.
This is a sort of repetitive outage for Google. They've wiped out the GSLB configs before. A year ago there was also that big outage where they blanked out the whole Gmail delivery configs and started rejecting all mails (even for gmail.com). Config safety is not their strong suit.
Working now, yes. Though deno.land's current IP still nslookup's to something inside googleusercontent.com so maybe google has fixed their Side. But it was definitely offline earlier today (cf. our github CI failures and downforeveryoneorjustme)
Looks like perhaps an issue with Google Load Balancer. We have a load balancer in front of Google Storage Buckets, and can access resources directly from the buckets, but getting 404 when going through the load balancer.
Non-engineer here - Is there an easy way to multi-provider redundancy around this? Can you have LBs on multiple clouds and use dns to move around or something? Or does your LB have to be at the provider the app is at? Sorry if this makes no sense. :o
Half the internet is down because of a Google Cloud global issue on their load balancers, including Spotify and Etsy and GCP status is all green: https://status.cloud.google.com
If you ever wondered why GCP is a distant third runner in the enterprise cloud space, here is your answer.
Expecting an instant public post is a bit unrealistic. They had a post up just over 20 minutes after the incident start, which is not that crazy, given that they needed time to triage all of the alarms and understand which component was actually breaking and confirm some technically correct information around it, even if the actual internal incident response can run without the public post.
If the automation is working the services will be up. When an incident is happening it's because something is significantly broken, and automation won't properly understand what is and is not working.
For instance, lots of follow-on alarms might be firing for what are not actually issues with the things being monitored: As an example, I would imagine that datacenter temperatures and fan speeds dropped due to the incident, which might cause automation to suspect a facilities issue, but announcing a facilities issue would be misleading.
Or metrics around instances live might be tanking as autoscaling groups start downsizing. This would not be an issue with the autoscaling service, and automatically announcing an autoscaling outage would again be misleading.
In an incident, taking the available data and reaching a conclusion about what is broken and what are effects is something which requires skilled manual effort and is error-prone.
The automation doesn't need to do that, it doesn't need to analyse the situation. It needs to communicate "Hey. Our systems have seen this and have pinged humans, bear with" rather than "nope even though half the internet is down rn, it's all good baby"
Make a green tick a blue questionmark or something. It doesn't even need to admit fault, it just needs to not be useless. My goal visiting the page is to get a link I can send clients "Updates will be posted here". Nothing more.
Also if you're hosting your monitoring system on the same system it's monitoring you've just completely missed the point. At least use a different region within your cloud provider, better would be completely different provider. I'd even go as far as using different domains/TLDs to host the page if I was Google sized
I think that's tangential to my point. The concerns in my post you replied to about system interdependence making it hard for a monitoring system to separate cause and effect, even if that monitoring system is itself working properly.
...and there's a reason for that. Automations to update the status page are rarely acceptable, since the status page statuses have legal and financial implications. Therefore, the IM usually has to update it (or tell someone to update it). But, realistically, when you get paged, you first need to figure out what exactly is wrong and at least a vague idea of why. Then, you need to tell someone to update the page. Then, it gets updated.
The status page will always lag the outage. It's not a conspiracy.
Status pages should be driven that way, though. "legal and financial" implications and "It's not a conspiracy" is a poor excuse.
Now, I'm on Azure, but it seems like from the comments the situations are similar. So, instead of an automatically updated status page that would help engineers do their jobs, we get a status page that isn't accurate, and customers have pull teeth to get a service credit where/when one is due. And it seems like you can have the cake and eat it too here: while IANAL, a footnote in the SLA or the status page that "this is a machine estimate and not reflective of what goes into the SLA" should do it, no?
Ironically it appears that IsItDownRightNow? is also down, although that could because they're experiencing what is basically going to be the equivalent of a DDOS.
You don't need legal discovery for that. Every "X as a service" contract you sign will explicitly say that SLAs aren't dependent on dashboards/ping tests but rather a mostly subjective measure of "availability".
Dang doesn't see messages like that unless you use the footer Contact link, but I remember a comment from him a while back that I would summarize as "Some site users think it's a good use of HN, and other site users disagree and flag it, and we downweight/dedupe them sometimes and/or if someone emails us with the Contact link". I just didn't want you to wait for a reply that'll never come unless you Contact them.
Yes confirmed its because netlify's origin servers use Google Load Balancers. According to this blog they claim "We can easily move the entire brains of our service between Google, Amazon, and Rackspace in around 10 minutes with no service interruptions.". Lets see if thats true. https://www.netlify.com/blog/2018/05/14/how-netlify-migrated...
Sites that are down according to https://downdetector.com include Spotify, Discord, Snapchat, Etsy, Pokemon Go, Epic Games, Target, Paramount+, Evernote
I wondered why our alerts started going nuts. Seems like basically every global Google Cloud load balancer went down. Doesn't seem to affect single-region network load balancers.
Edit: All of ours are back up. Some other services still seem down though.
Funny thing is when you google Home Depot or Paramount Plus you get ads served by Google as the first result. When click on it Google then shows you a 404 page. I wonder if they'll get a refund on their Adwords campaign.
One of my pet peeves with so many services! Their obnoxious pre-ads can play flawlessly (stealing your time/eyeballs and giving them benefit), and they can still fail to give you the content you exchanged your time/eyeballs to see. Worse, they can repeatedly fail and repeatedly drill the same ads into your brain.
There ought to be a law that essentially says if ads are “paying” for content, there must be a flawless link between ads and content such that the system can tell if the content is available (or detect after the fact that something was not delivered properly). And then, based on that, it either is required to ensure the ad never plays (since the content cannot be delivered), or that the user must be compensated in some way (e.g. we see you were forced to see an ad but got nothing so we are crediting $1 to your account).
I also wonder how many companies didn't want to admit they were using Google for their infrastructure. Downdetector shows AWS being affected, it'd be embarrassing if they were caught using Google Cloud Platform.
Seriously doubt that AWS, Facebook are using Google for infra. There's probably some other effect at play, like people using a Google service to connect to these things. Also don't see any effects on those services personally.
Seeing the same - I have projects in us-east1 that went offline first, then us-west1 went offline a few minutes after. Everything green on their status page and nothing in the dashboard - everything returns a 404 so I'm assuming a really high level LB just took a dump.
Affecting us. Busiest time of the year and now down 20 min. It's the Global Load Balancer, so god knows what bit of the global edge has been taken out.
For what it's worth I don't think you should have been flagged or even downvoted just for being wrong. Corrected yes, absolutely. Sinking to the bottom of low contrast lake was a bit much.
Flags being gigadownvotes on this site suuuuuucks now. How do people learn if nobody can correct them? Daft.
I used to have a vouch option but I guess I used it too much haha
This sure makes it easy to know who is hosted by google by going to downdetector.com.