Hacker News new | past | comments | ask | show | jobs | submit login
Azure is experiencing DNS issues in all regions
181 points by HEHENE on May 2, 2019 | hide | past | favorite | 85 comments

Looks to be a global outage across Americas, Europe, APAC, and Africa. Office 365 is still up for us, but colleagues are saying theirs is down as well.

Gitlab is also currently reporting issues with GCE. I'm unable to push to their network right now.

Looks like all three major cloud platforms are experiencing problems right now. Could be internet related?





the sev1 messages in my inbox currently begs to differ. there's no issue maybe with the dns at this very moment but the platform is thoroughly fucked up

I am one of the senior Engineers and would be glad to go over anything you want to talk about if you would like me to contact you.

well, talking would not even be cathartic, but if you want to look into the annoying things:

everything is labeled severity 1. I've had three dozen of them in the last two months.

only 2 of them were actually severity 1 when seen from my point of view, most of the other had impact on part of the infra I don't use or datacenter we don't have res, so filtering relevant messages would be a good start.

(that's not to say the other sev 1 were mislabeled, but we keep our cloud services integration to a minimum precisely to avoid these interruption from failing dependencies.)

but what is really annoying is that when one actual sev 1 that requires machine reboots happens (we had 3 in two years, so not even that bad), there's no way to manually anticipate the reboot ourselves to perform a staggered reload.

If there were problems, would anybody notice?

On a more serious note it's cool to see how far the IBM cloud has come (terrible name, but oh well) but they still seem a few years behind the major players

Huh? I did not have to look very long to find evidence [1] of rather sobering issues with IBM "Cloud". In fact, that article reported IBM as dead last in the 2018 Gartner's Magic Quadrant ranking of IaaS cloud offerings.

What are you trying to say exactly? And why should customers turn their attention to IBM?

[1] https://www.theregister.co.uk/2018/07/04/ibm_cloud_outage/


> All the IBM Cloud core engineering teams are based in the US which I strongly agree with. Have a nice day!

All US based really is not a selling point.

As a European, I wouldn't bet my company (if I had one ;)) on a provider who's engineers are all in a TZ 5-8 hours different from me.

You didn't [make me upset]. I'm also looking forward to see what Red Hat is capable of with IBM's resources. Thanks for clarifying.

Competition is good and I didn't know IBM had a bare metal offering.

(you might want to put some sort of contact method in your profile in case people want to reach out to you)

Your comments come off as unvarnished self-promotion, which is generally frowned upon here.


That is fine, but entering the conversation with a comment that is self-promoting but otherwise unrelated to the conversation at hand is obnoxious, just as much so as someone doing the same for their own person at a party.

If you want to promote your product, more apropos and therefore constructive would be a comment detailing why it did not suffer the same outage, or justified speculation why its competitors did. Engage in the conversation rather than just naming a list of features.

I think it's good to hear someone talking about it. I've never used it - perhaps I have the impression that it's not even something I could just go sign up for with a credit card and start using as a non-business entity.

But, I see things like https://github.com/IBM-Cloud/terraform-provider-ibm exist, so it must be at least reasonably usable.

What's a reason why someone would consider this over the big three, besides sweetheart deals for big customers?

One of the cooler things to me is you can order an actual physical server and its entirely only yours. It gets the same infrastructure and care we give everything else but you own and control it entirely. You can select the specific components you want or select a common build that we keep prebuilt for "fast provision". These prebuilt ones are sitting ready for booting via IPMI/BOOTP so with a few clicks you can get a baremetal server thats only yours in a few minutes. You can totally lock us out of it in every way if you want to. Also, the portions of the network where this happens is not an "overlay network" so that translates to extremely predictable latency and jitter characteristics that are actually better than the others can provide. Game companies especially like this but the downside is decreased flexibility because of course its a physical server somewhere so your responsible for managing any container or VM you choose on it. We of course have container and VM and many other things as services as well.

Which is great and all, if done properly. Was this fixed properly? What’s the IBM equivalent to AWS’ Nitro or GCP’s Titan?


So you provide as evidence an article from July of last year?

Man... Azure seems to be an order of magnitude worse than AWS and GCP when it comes to reliability.

Seems like they have tons of global dependencies within their services which cause these cascading failures rather often... Seems like only a few months ago we were reading about a global outage that affected auth?

When it comes to outages, just recent Azure downtimes broke our SLA 5 times over for the whole year..

Tho for w/e reason i cannot convince higher ups, that the switch was a stupid idea. But hey! We got some credits to spend in Azure to compensate. (Too bad we had to pay our clients with real cash)

Sometimes I'm not sure how Azure makes any money with the amount of credits they give out.

I'm exaggerating, of course, but from my POV Azure looks subsidised.

Azure is subsidized by Microsoft so they retain customers and mindshare, they also seem to have a mix of talent working for them. In some areas like reporting dashboards and such, Azure is head and shoulders better than AWS and GCP, but few if any of their large clients use these features (missing the primary value you'd get from using Azure).

Our interaction with the GCP sales team has repeatedly been aggressive US based sales agents, whereas Microsoft sent us to deal with (incompetent) sales agents out of India after promising us significant startup credits when we crossed paths with them at a conference near Seattle.

I think the regular prices for cloud services are super high. a1.xlarge has 4 vcpus and 8GB memory for $500 a year? They can afford to give a lot for free.

Yep [0].

They had cooling issues at one of their data centers that took most of South Central off line for hours to days, depending on the services you were using

[0] https://news.ycombinator.com/item?id=17910916

Coincidentally, a neighbor of mine recently retired from consulting and told me, a couple of days ago, that he's still been getting a lot of calls from people wanting to move off Azure due to reliability issues. He said he's thinking of jumping back in due to money he thinks he's now left on the table to be made.

What does the consultant do to help them move? Sounds like a cushy job!

Of course it isn't! You're talking about moving server software from Azure to something else which isn't likely going to be something from Microsoft so you're talking Linux or BSD or UNIX--a completely different platform.

To me that sounds like fun, but maybe I’m weird. But I doubt the consultant does it. They probably draw some architecture diagrams and get a team to do it.

Azure has been magnitudes more expensive and magnitudes more unreliable for the one of our clients that demands we use it to host their stuff than either AWS, Digital Ocean, or Heroku.

Honestly, I can't think of a single reason why I would recommend them over anyone else at this point, no matter what your hosting or storage or computing needs were. Can't see a single area where they are better than the competition.

Magnitudes more expensive? Huh? On what product?

We used to use AWS and had issues a lot. I've seen global GCS outrages. Today we were effected be the Azure outage, but only for about 15 minutes.

I should clarify to say, for my current use cases I have found I get charged for way more resources than I would expect to be and Azure's billing breakdowns are so useless and difficult to parse that I can't figure out where the extra charges are coming from :D But could be a PEBCAK error.

In any case, I think you are right that apples-to-apples for most things they generally shake out the same in theory.

I completely agree. I ran a 100+ node compute cluster a couple years back, and the uptime of our nodes was three nines at most. This wouldn't have been bad if they had had any form of reliable storage or quick recovery, but at the time their s3 competitor was limited to 20t per bucket (because it was just NetApp hardware).

Azure is a freaking mess. If it hadn't been Microsoft we were selling to, we'd have never used it.

Azure does have the only HIPAA compliant media transcoding service (as far as I know). The API is terrible but it works.

For what it’s worth AWS’s MediaConvert is HIPAA compliant. Have one client whose been using it with that in mind. https://aws.amazon.com/compliance/hipaa-eligible-services-re...

Might I ask the use case. Didn't even realize that HIPAA compliant transcoding was a thing

I am guessing more than compliant it is just some corporate captive market thing where there is only one "compliant" transcoder

I’m at a complete loss how in 2010’s they can design build and implement a cloud provider, from the ground up that has so many “global outages”

This isn’t “legacy” code that has migrated from cobol to vb to c#, this is modern code and to suffer this bs time and time again is unforgivable

That's actually exactly the reason why it's unstable: clean code from scratch that doesn't cover all the edge cases yet.

We SE keep whining about legacy code but forget that they made it to legacy for a reason...

This is not new. DNS queries didn't respond to Outlook (Hotmail) servers from within Azure. Thus the application I was responsible was unable to send emails via SMTP. And this issue appeared few times in the last few years.

Our customer was bashing us if they were unable to use our product for several minutes but when Azure was down for a day there were no complaints to Microsoft. (they were hosting our product at Azure)

I don't understand their love of Microsoft. I guess because they have Microsoft certificates like generals have.

I wonder if this DNS outage will cause data lose like the last one....

The entire "heres free ELA credits for Azure, please please Mr Sr Director/CIO use Azure" seems to be working, but then they go and do stuff like this.

Yep, we've just had one of our environments alert us. Services totally unresponsive - even the Azure portal was unresponsive. Affecting services we run in the US and Europe.

Ok, many of these cloud platforms charge more if you have things duplicated across regions/zones. EG. AWS has multiAZ. Why pay for this? When AWS has a major issue - all zones/regions are fucked in some way.

That’s actually a point that AWS tried to hit on during their last re:Invent conference. When their competitors say multiple availability zones, read the fine print, and you’ll see that it is far from being as resilient as AWS. Sometimes other cloud providers have even located the two AZs in the same building....

AWS AZs on the other hand are always geographically separated, and they even take into consideration the landscape to see if they need to be further away (earthquakes etc).

It’s staggering how much of a lead AWS has on others when in comes to their AZs and global network infrastructure.

Because by and large, you don't see these outages in GCP/AWS. Both of those companies have invested a ton of energy into creating regional fault tolerance and you rarely if ever see global issues. AWS has seen a few in years past but it's an exception, not the norm. Azure seems to have these issues on a regular cadence.

When has AWS had a "major issue" that affected multiple regions (not AZs)?

This would explain why my app kept throwing this exception when attempting to call an Azure SQL instance:

  System.ComponentModel.Win32Exception: No such host is known

Same here. Just keeping the uptime good on our stack is challenging enough. Now I have to add Azure's downtime to mine.

Yes, same here

Looks like the core problem today was a DNS zone delegation issue during a migration off of legacy DNS servers. I'm not sure how this type of issue can easily be segmented by region due to the way the DNS service is designed and fundamentally zone replication works.


Yeah, DNS is pretty much an unavoidable single point of failure. Gotta do the best you can to keep global touches to it as light as possible.

It looks like the majority of their outages are listed as configuration changes as being the cause. You’d think they’d have an “RCA” at some point to address making configuration changes safer.

Forgive my ignorance, but how do things like this happen at such a massive scale? I would understand maybe one big area shutting down because of some sort if internet network issue, but then there should be some redundancies in place no? I don't get how it just all goes down across the globe, is there one master-computer that has just gone down, taking everything with it?

The saying goes, "complex systems fail in complex ways". Check out some of the cloud provider postmortems here for a few fascinating and detailed examples: https://github.com/danluu/post-mortems

You could say certain failures only occur and cascade under Special Circumstances. :)

One would think after that 2016 Dyn outage people would strongly consider having an alternate DNS provider....

First of all, yes, people should be using multiple DNS providers for services that require very high uptime. Amazon.com itself notably uses Dyn and UltraDNS as its nameservers.

That said, I'm not sure this would have helped in this case. It seems like some or all of the problem was internal Azure zones were failing to resolve. Which no third party DNS provider would be able to mitigate.

Tons of people arguing one way or another here. Does anyone have actual statistiks on reliability og different cloud providers across different zones?

My company is working on a project to move a customer web portal to Azure. This scenario was brought up and dismissed even though when events like this happen many of our customers would certainly come to our web portal as part of assessing the impact. I don't expect this to impact the project though. The decisions have already been made.

All the big clouds regularly have outages.

In _all_ regions? No.

in _all_ regions, regularly, Probably not... Depends on how you define "regularly".. but even one or two a year could be counted as regular for some definition of regular.

* Azure: 12 user-facing outages this year so far. source: https://azure.microsoft.com/en-us/status/history/

* Amazon: They don't make their status history public it seems, but I see 5 easily searchable outages from last year. I would venture to bet their actual outage history is not any better than Azure or Google.

* Google: 23 from this year so far source: https://status.cloud.google.com/summary

These numbers are from when I counted in March 2019, so ~ 3 months, their outages are in double digits. That's TERRIBLE.

Obviously the above numbers are not across _all_ regions, they are just a total count across all regions.

Eh, just counting outages doesn't do anything imo. Most providers consider something like < 99% availability for > 5 minutes an outage. Outages like that are expected in any single region and product.

If I really need perfect availability (I probably don't), then I can leverage the multiple regions and build a failovers strategy.

I do not expect global outages which affect multiple regions at the same time. That's just amateur hour when it is a common occurrence.

Of course there are solutions to outages. Multiple regions, failover, multiple cloud providers, or just don't put all your eggs in $CLOUD, and do some on-prem as well, etc, etc, etc.

There are many ways to handle outages.

I'm just saying if you are averaging 4+ outages PER MONTH across all of your services, you are probably doing something wrong. These providers are on schedule to pass 50 outages this year!

All the places I've ever worked, if we can count our customer facing outages per year past the number of planets humans live on(1) we consider ourselves a total failure that year and re-work our planning. We haven't had to re-work our plans in ~ 5 years. Our longest outage in the past 5 years? 1 hr long. Granted I'm not the scale of $CLOUD, but there is zero reason they can't get serious about uptime, they just don't bother. People are still flocking to $CLOUD in droves, despite their crappy uptimes.

It's not rocket science to make stuff work, you just have to get Boring(tm).

Where Boring here means using standard, boring, well understood technology that everyone knows and understands and not $NEWHOTNESS. $NEWHOTNESS is guaranteed to break in new and exciting ways.

It's DNS, so it is somewhat inherently global. Route53 isn't region specific either, so I could see an issue with that having a global effect, too.

DNS is also inherently distributed. This should make it resilient to all of the most common outage scenarios, and is likely why AWS offers a 100% uptime SLA for Route 53.

I'll be interested in the post-mortem from Azure on this one.

> likely why AWS offers a 100% uptime SLA for Route 53

Well, that's interesting. We occasionally see getaddrinfo() calls fail claiming domains that we know exist at the failure time (b/c the records are completely static) don't exist. (We've not got a reproducible case for this yet, and it's incredibly rare for any given VM/service. But across our fleet, it crops up fairly regularly.)

I used to work on route 53 for a few years. I cant speak to your specific issue. Too much depends on your clients, your networks, your resolvers. But ... turn on query logging at a minimum. You should get a timestamp, qname, and rtype to identify nxdomain.

That said the most common cause of authoritative nxdomain is if youre adding/deleting records and querying them before propagation is complete. You may want to log/poll your rrset change status separately to correlate.

The other is that depending on networks intermediate dns tampering happens all the time. Qname, rname, rtype, all get modified. Responses and queries are duplicated, intercepted, and manipulated. Some good research out of dns oarc and a dude out of australia (iirc).

> We occasionally see getaddrinfo() calls fail claiming domains that we know exist at the failure time (b/c the records are completely static) don't exist.

That could be whatever resolvers you're hitting failing rather than an issue with Route 53 authoritative nameservers, though. The resolving DNS servers in EC2 are not actually part of Route 53, for example.

I'd think that would correspond to EAI_AGAIN or EAI_FAIL, whereas I'm pretty sure we're getting a EAI_NONAME.

We’ve experienced the same thing. I’ve never been able to figure it out. If you ever do, please let me know! I’ll owe you a beer ;)

You may be hitting ec2 dns rate limits.

I would expect EAI_FAIL or EAI_AGAIN, but I'm pretty sure we're getting EAI_NONAME.

But, the stuff that hits this problem the most often is of the quality level that I wouldn't find that terribly surprising. Seems AWS "documents" this as,

> The number of DNS queries per second supported by the Amazon-provided DNS server varies by the type of query, the size of response, and the protocol in use.

How specific.

Do they typically provide a postmortem?

It's Microsoft. I'm sure they just rebooted it!

(I had to, see username!)

edit: seriously,-3 ? it was a joke.

Sure, but that's hypothetical, and I don't recall AWS having any such issue in recent history.

Amazon AWS had/has a dependency on one of their first datacenters last year for example

The us-east-1 outage of S3 in spring of 2017 comes to mind. The AWS status page went down and no one could figure out what was going on because the status page had a hard dependency on us-east-1 S3.

To my knowledge, there has never been a global outage for any AWS service, and I've been using them for a long time.

There was the status page, but S3 being down in us-east-1 didn't effect S3 in ap-southeast-1, etc. The big DynamoDB outage a few years back also was limited to us-east-1

Could this be causing Slack’s downtime?

Slack uses AWS.

AWS was also down for a bit

What region/service? None of my customers called me complaining, so this seems a little unlikely :D


Around the same time as Microsoft. So it might have been a regional internet outage.

Here is screenshot for posterity: https://i.imgur.com/dHnD6BO.png

Also, see the top comment on the post: https://news.ycombinator.com/item?id=19814181

I wonder if this is why www.cdc.gov is (was? appears to be back) offline.

As far as the status page was concerned, this didn't impact Azure Government

Is it possible that they didn't need the extra properties of the "government edition" and ran on public Azure (e.g. because it was cheaper or less red tape)?

Then someone or something is mining-a-like do congestion.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact