It's still pretty obnoxious, because they basically said "there's something suspicious going on, we won't tell you what, but you better fix it or we'll shut you down." Gee, thanks. After a week of daily emails they finally responded with a network trace... that showed we're doing some outbound HTTP calls. That's it - they redacted everything except for the ports and the first two octets of the destination IP addresses. So very helpful, and certainly looks suspicious... </sarc>
It's still possible there is something malicious going on, but they've been stunningly unhelpful in finding it. Mostly they've just asserted that there's a needle somewhere in our haystack, and we'd better go find it. They've been threatening to shut us down, but haven't actually done it.
(yes we do have AWS, too)
'Unable to Submit Request
We are unable to complete the incident submission process at this time. Please refer to this page for phone numbers to call for Azure support.'
Timing couldn't have been any better for me. Some Alanis material there:
Now I'll have to distribute between AWS and Azure, too.
That doesn't help with the awful timing though. Ouch. I just Buffer-retweeted your BizSpark tweet above, scheduled for tomorrow. Maybe a little bump now that you're back up and running will help ease the pain...
We're changing that now, will need to replicate across different cloud providers, too. We're changing a lot because of last night's outage.
Alternately, can't you just have multiple A records to distribute your load across cloud platforms and just drop the one for whichever platform is having an outage?
Also, you can associate multiple addresses with a record. It's up to the client to retry on failure, but all browsers do (as far as I know)
It's a thick layer of caches. Your browser, OS, router, ISP, and a bunch of intermediaries can cache the DNS. So even at 60s, you get good cache hits (the busier, the more true that is, of course)
Also, the update can always happen asynchronously. You and 9999 people ask your ISP for Facebook's IP. It serves all of you a slightly stale IP and asynchronously fetches a new one (thus turning 10000 requests into 1). AKA: thundering heard problem.
DNS mostly uses UDP, which is more efficient for the server and harder to DOS (the server doesn't have to maintain state per request).
Finally, # of requests is usually (always?) a factor in the price of DNS services. So the cost is borne by the clients, not the service providers. And since DNS hosting is seemingly profitable, I assume they're more than happy to build up the infrastructure to deal with additional requests.
1. I was never notified of the outage. I noticed it myself when attempting to log into one of my VMs and then started looking for status updates. Sadly, the best status updates I got were here on Hacker News.
2. When my servers did come back up, at least one of my IP addresses had changed, which meant I had to update all of the relevant DNS entries (which, as everyone here no doubt knows, can take up to 48 hours to propagate). I was never notified of this change in any way.
External IP: http://msdn.microsoft.com/en-us/library/azure/dn690120.aspx
Internal IP: http://msdn.microsoft.com/en-us/library/azure/dn630228.aspx
Also this is a PITA if you use the @ entry in your DNS.
Secondly, you are using an IP address and expecting that to be static? The recommended approach is to use a CNAME so you don't hit that issue, alternatively, you can have up to 5 Reserved-IPs per subscription and attach that Reserved-IP to your VM : New-AzureReservedIP from powershell
Edit : see http://azure.microsoft.com/blog/2014/05/14/reserved-ip-addre...
I think that's been largely dispelled.
I can't find the link right now unfortunately but I remember a post looking into DNS propagation realities from either this or last year, and they found that overwhelming majority of DNS servers they tried (99%+) respected the TTLs set exactly as they should. *
My personal rule of thumb is, if it hasn't propagated within an hour, I need to look at it again because I messed up.
Tools like this are invaluable when you're paranoid about whether your new record has propagated.
* ugh. Does anyone know which post I'm talking about? My google-fu is failing me hard.
Google App Engine has had numerous outages like this, the only one I can find any public documentation for being a 6 hour outage in 2012: http://googleappengine.blogspot.co.uk/2012/10/about-todays-a... (and let's not forget the old-style Datastore corruption incident, where every App Engine user got to manually merge split-brain database tables after a messed up failover)
The App Engine team has a proactive policy about posting about downtime:
Since the team highlights basically anything that looks like it is impacting customers, the issues don't always warrant a stand-alone blog post, but you'll notice that generally speaking the last post in each thread is a full public post-mortem with diagnosis and remediation.
Let me know if there's more you think might be useful for you as a GAE customer. Thanks!
Azure isn't perfect but their support is way more responsive, at least to us eventhough we don't have any fancy support contract. Eventhough my GAE experience was 2-3 yrs ago, I have to say that Azure has way less issues.
Question - are AWS or GCE better at proactively messaging when there's an outage?
See https://groups.google.com/forum/#!forum/gce-operations and https://groups.google.com/forum/#!forum/google-appengine-dow....
And I find out about it by yelling at Heroku - they told me that Amazon is having issues before Amazon's status turned yellow.
Also, funny if you try to zoom out in Chrome to see the whole thing, the row headers get out of alignment.
Why would I want to 'X' out specific rows/columns in the table? It was so complicated to begin with, someone thought adding more complication through end-user customization was a good idea? I just noticed, you can even expand some of the rows...
Seriously, a status page should tell you either "It's up" or "What's down". It's not even showing history over time, this is just a snapshot. The text at the top directly contradicts the icons in the table, making the whole thing even more ridiculous.
The footnote at the bottom is the best, "The Australia Regions are available only to customers with billing addresses in Australia and New Zealand." Thanks for that useful nugget! /s
Mistakes happen, services go down, I can get over that. What matters is how its dealt with. At the moment I would not want to be an Azure customer dealing with 9 hours+ downtime whilst MS are saying everything is great. At the very least change it to "Having some issues" or similar!
Looking forward to the post mortem.
EDIT2: Now the databases are down, this is costing us a lot of money.
EDIT: Just went up again.
It would be great if anyone knows how to mitigate these in the future - what can I do to protect myself against this in the future? (Except leave Azure)
Obviously there is a segnifigant cost associated with engineering this level of cross platform redundancy which is why reliability is an important factor in making your platform choices. If you can tolerate some downtime, you can be more flexible, otherwise it will costs one way or the other.
In any case you should consider having a user notification site setup on a completely different service (or two) so that when things go wrong you can redirect everyone to that site to keep your customers informed. This is especially important when you have partial outages that could create inconstancies in your database or application state if you where to continue to allow users to interact with it in a degraded state.
Our big hosted site is hosted in Europe is actually working but our blogs and a news website are both down. We offer a paid service at 600$ a year and if the main site was down it would be very bad for our reputation.
Our DNS points to Azure on all these domains and things are hosted as "Azure Web Site" - how would notifications work if Azure itself is failing? Would I need to proxy the traffic through elsewhere?
Are there any services that solve this problem for me? I really don't mind paying a few dollars every month and not worry about this.
You may also want to google for DNS failover services, to help you automatically redirect traffic in more catastrophic failure cases. There are offerings from google, AWS, and others.
The only manual step was to delay the switch back until our vms where working fine and had all resources. We do this changing route53 health check to one that is always failing.
We had also to purge our crashed mongo nodes because the journal was broken.
It really depends on how much risk you're willing to accept, and how much that is worth to you. It can be quantified via revenue lost, but reputation is much harder to put a number on.
This is not the first time this has happened in the last two months (after a relatively reliable year). The problem is I'm not sure any other hosting provider would do any better.
Be mad at the service provider if they don't live up to the number of nines they promised. Be mad at yourself if you expected more nines than they can deliver.
We have failover loadbalancing running between multiple datacenters, no issue here!
edit : 99.99%
Do Microsoft say this about Traffic Manager or are you suggesting you have to pay for extra services to get the advertised reliability figure?
Who was selling that to you? Because I'm pretty sure it wasn't Microsoft…
9 hours of downtime means they are down to at most 98.75% for this cycle.
(not that convenient to copy paste the OP link from a mobile device)
While you're obviously going to be unhappy with downtime, this is a genuine part of calculation you should have made when you decided to outsource all your eggs into one basket.
How much of the user's data would be forever lost in such an event ?
The other aspect is privacy - in theory, all user's data can be stored and accessed forever, eg. 20 years from now, when the reincarnation of someone like Stalin comes to power.
Anyway, the point I'm trying to make is that we should design our services or apps with this in mind - the cloud can and will fail from time to time, maybe forever.
So, if possible, use the cloud as a 'bonus' feature, a means to back up data and store user's data offline for when the dark day comes at least the user still has his data.
Is havin your stuff stores locally any more secure in that situation. If someone wants your data they'll knock on your door and beat you and your family until you give it to them
When the cloud is down, all we can do is fiddle our thumbs and hope it doesn't happen again. Or maybe we could send an angry letter to Microsoft, and hope somebody reads it.
It's great for convenience, and it's great for managing without certain skillsets that may be hard to obtain, and it's great for temporary capacity, but it's not cheap.
Unless your base load cloud costs are more than the cost of full time, ready at a moments notice, experienced ops people you don't get close to any guarantee of uptime by non-managed hosting. The salary cost alone of that is substantial, let alone hardware spread across multiple locations. My firm pays at least 7 figures a year on IT ops and don't come close to 99.9% uptime across everything.
If you're using your own servers, or even VPS, you do have control over infrastructure, and can plan for changes and mitigate problems quickly, and you can run for years without downtime if nothing is changing significantly. Depending on your staff, funding, etc that might be attractive or not. Each has its own advantages, and disadvantages.
Interesting to think about the potentially compounding failure modes these services are dealing with.
I'll have to look that incident up to check out their postmortem vs. what Microsoft ends up putting out.
Thanks again ~
I wonder how many customers Azure just lost do to their unexpected 2 day fiasco
Amazon had a number of EBS fiascoes and survived just fine. I'd expect Azure to do the same.
It's obviously not going to destroy anyone's business, but there is a lot more competition than there used to be.
Seriously considering another layer above azure to mitigate this in the future. Very disappointing to see.
At least initially their status indicated they're handing the problem but lately it's just been "All Good" and they said they resolved it on twitter but it's not at 100% yet: http://azure.microsoft.com/en-us/status/
Put your servers in different regions, use Azure/Google, BlueMix/AWS, or even hybrid cloud, do something. Have a DR plan.
If the disaster strikes my region, I probably have better things to do than IT things (like running for my life :-).
But with the cloud the disaster could be thousand of kilometers away and still affect me. That's the problem with the cloud : why should I stop working in my remote French town because there's a landslide in Ireland (or wherever they put the European cloud data centers) ?
I don't say the cloud doesn't have it's uses (especially as a redundant backup far far away) but the all cloud model has way more risks than what people think ... and vendors don't rush to explain that.
I'm one of those guy that think the future will be more and more harsh for the western civilization (think collapse of the Soviet Union). There will be less money for everything, infrastructure in particular, things will fail and you will have to deal with it locally and the DIY way.
Their error pages are less graceful than mine.
Didn't receive any calls yet, but i don't think that will take long.
Disgusting management interface
Way to fuck up a mustard sandwich Microsoftie
We moved everything we had away from that Virus named Azure.
What does your client/customer think of you being on Azure? That you chose the crappy solution because your low-tech infrastructure still uses windows, which does not carry a lot of tech cred.
20% of Azure VMs are Linux.
You are not well informed.
More likely the have _something_ which runs on Azure. Fortune 500s are, pretty much by definition, quite large - and probably have tons of departments and sub departments. And at least one of those departments probably has a task of trying out new things, like Azure, by running something on it.
What surprises me is that nearly 20% of Fortune 500s _don't_ have something running on Azure.
(I wonder what percentage "run on" Amazon)
* Most major companies use more than one cloud provider
* "Use" is a very loose term here. It could mean anything from "the accounts team in some branch office uses S3 to back up their Sage data (or uses an online backup service that uses S3 in the back end)" to "they run their main product on our infrastructure".