Looks to be a global outage across Americas, Europe, APAC, and Africa. Office 365 is still up for us, but colleagues are saying theirs is down as well.
Looks like all three major cloud platforms are experiencing problems right now. Could be internet related?
everything is labeled severity 1. I've had three dozen of them in the last two months.
only 2 of them were actually severity 1 when seen from my point of view, most of the other had impact on part of the infra I don't use or datacenter we don't have res, so filtering relevant messages would be a good start.
(that's not to say the other sev 1 were mislabeled, but we keep our cloud services integration to a minimum precisely to avoid these interruption from failing dependencies.)
but what is really annoying is that when one actual sev 1 that requires machine reboots happens (we had 3 in two years, so not even that bad), there's no way to manually anticipate the reboot ourselves to perform a staggered reload.
On a more serious note it's cool to see how far the IBM cloud has come (terrible name, but oh well) but they still seem a few years behind the major players
What are you trying to say exactly? And why should customers turn their attention to IBM?
All US based really is not a selling point.
As a European, I wouldn't bet my company (if I had one ;)) on a provider who's engineers are all in a TZ 5-8 hours different from me.
(you might want to put some sort of contact method in your profile in case people want to reach out to you)
If you want to promote your product, more apropos and therefore constructive would be a comment detailing why it did not suffer the same outage, or justified speculation why its competitors did. Engage in the conversation rather than just naming a list of features.
But, I see things like https://github.com/IBM-Cloud/terraform-provider-ibm exist, so it must be at least reasonably usable.
What's a reason why someone would consider this over the big three, besides sweetheart deals for big customers?
Seems like they have tons of global dependencies within their services which cause these cascading failures rather often... Seems like only a few months ago we were reading about a global outage that affected auth?
Tho for w/e reason i cannot convince higher ups, that the switch was a stupid idea. But hey! We got some credits to spend in Azure to compensate. (Too bad we had to pay our clients with real cash)
I'm exaggerating, of course, but from my POV Azure looks subsidised.
Our interaction with the GCP sales team has repeatedly been aggressive US based sales agents, whereas Microsoft sent us to deal with (incompetent) sales agents out of India after promising us significant startup credits when we crossed paths with them at a conference near Seattle.
They had cooling issues at one of their data centers that took most of South Central off line for hours to days, depending on the services you were using
Honestly, I can't think of a single reason why I would recommend them over anyone else at this point, no matter what your hosting or storage or computing needs were. Can't see a single area where they are better than the competition.
We used to use AWS and had issues a lot. I've seen global GCS outrages. Today we were effected be the Azure outage, but only for about 15 minutes.
In any case, I think you are right that apples-to-apples for most things they generally shake out the same in theory.
Azure is a freaking mess. If it hadn't been Microsoft we were selling to, we'd have never used it.
This isn’t “legacy” code that has migrated from cobol to vb to c#, this is modern code and to suffer this bs time and time again is unforgivable
We SE keep whining about legacy code but forget that they made it to legacy for a reason...
Our customer was bashing us if they were unable to use our product for several minutes but when Azure was down for a day there were no complaints to Microsoft. (they were hosting our product at Azure)
I don't understand their love of Microsoft. I guess because they have Microsoft certificates like generals have.
The entire "heres free ELA credits for Azure, please please Mr Sr Director/CIO use Azure" seems to be working, but then they go and do stuff like this.
AWS AZs on the other hand are always geographically separated, and they even take into consideration the landscape to see if they need to be further away (earthquakes etc).
It’s staggering how much of a lead AWS has on others when in comes to their AZs and global network infrastructure.
System.ComponentModel.Win32Exception: No such host is known
You could say certain failures only occur and cascade under Special Circumstances. :)
That said, I'm not sure this would have helped in this case. It seems like some or all of the problem was internal Azure zones were failing to resolve. Which no third party DNS provider would be able to mitigate.
* Azure: 12 user-facing outages this year so far. source:
* Amazon: They don't make their status history public it seems, but I
see 5 easily searchable outages from last year. I would venture to
bet their actual outage history is not any better than Azure or Google.
* Google: 23 from this year so far source: https://status.cloud.google.com/summary
These numbers are from when I counted in March 2019, so ~ 3 months, their outages are in double digits. That's TERRIBLE.
Obviously the above numbers are not across _all_ regions, they are just a total count across all regions.
If I really need perfect availability (I probably don't), then I can leverage the multiple regions and build a failovers strategy.
I do not expect global outages which affect multiple regions at the same time. That's just amateur hour when it is a common occurrence.
There are many ways to handle outages.
I'm just saying if you are averaging 4+ outages PER MONTH across all of your services, you are probably doing something wrong.
These providers are on schedule to pass 50 outages this year!
All the places I've ever worked, if we can count our customer facing outages per year past the number of planets humans live on(1) we consider ourselves a total failure that year and re-work our planning. We haven't had to re-work our plans in ~ 5 years. Our longest outage in the past 5 years? 1 hr long. Granted I'm not the scale of $CLOUD, but there is zero reason they can't get serious about uptime, they just don't bother. People are still flocking to $CLOUD in droves, despite their crappy uptimes.
It's not rocket science to make stuff work, you just have to get Boring(tm).
Where Boring here means using standard, boring, well understood technology that everyone knows and understands and not $NEWHOTNESS. $NEWHOTNESS is guaranteed to break in new and exciting ways.
I'll be interested in the post-mortem from Azure on this one.
Well, that's interesting. We occasionally see getaddrinfo() calls fail claiming domains that we know exist at the failure time (b/c the records are completely static) don't exist. (We've not got a reproducible case for this yet, and it's incredibly rare for any given VM/service. But across our fleet, it crops up fairly regularly.)
That said the most common cause of authoritative nxdomain is if youre adding/deleting records and querying them before propagation is complete. You may want to log/poll your rrset change status separately to correlate.
The other is that depending on networks intermediate dns tampering happens all the time. Qname, rname, rtype, all get modified. Responses and queries are duplicated, intercepted, and manipulated. Some good research out of dns oarc and a dude out of australia (iirc).
That could be whatever resolvers you're hitting failing rather than an issue with Route 53 authoritative nameservers, though. The resolving DNS servers in EC2 are not actually part of Route 53, for example.
But, the stuff that hits this problem the most often is of the quality level that I wouldn't find that terribly surprising. Seems AWS "documents" this as,
> The number of DNS queries per second supported by the Amazon-provided DNS server varies by the type of query, the size of response, and the protocol in use.
(I had to, see username!)
edit: seriously,-3 ? it was a joke.
There was the status page, but S3 being down in us-east-1 didn't effect S3 in ap-southeast-1, etc. The big DynamoDB outage a few years back also was limited to us-east-1
Around the same time as Microsoft. So it might have been a regional internet outage.
Here is screenshot for posterity: https://i.imgur.com/dHnD6BO.png
Also, see the top comment on the post: https://news.ycombinator.com/item?id=19814181