Hacker News new | more | comments | ask | show | jobs | submit login
Microsoft Azure data deleted because of DNS outage (sophos.com)
217 points by stonewhite 22 days ago | hide | past | web | favorite | 84 comments

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." -- Leslie Lamport. I feel this should be rephrased for cloud computing at this point. The more people rely on cloud, the more these global fuck-ups are going affect them. Makes me feel pretty good about that server rack in my garage that addresses most of my own (and my business') compute needs.

If you already have a rack available with spare capacity, and your business is not expected to blow up overnight, and can tolerate failures(and the turnaround time for your or your employees to fix stuff, order spare parts, etc), sure, why not?

The capacity is there, might as well use it.

That said, if you didn't have said rack, I'm not so sure it would be worth it to even make a purchase order. Sure things outside of your control may break when you are using a cloud. But guess what, things outside your control will also break, on-prem. Particularly hardware, and network connectivity. There is no way your networking can be better than, say, GCP's own networking, or that you can deploy redundant workloads across availability zones (or even regions!) yourself.

By the time a purchase order for a new server can arrive, we can have a production-ready system running, with redundancy across availability zones, automatic failover, CDNs, backups, the works.

Basically, I don't care if someone knocks out power in my block, if someone cuts a network cable, or even if a machine goes up in flames.

One thing I would say is: even if you are very happy with your current setup, if you have some time to automate a similar setup on the cloud (keyword: automate), then I would suggest doing just that, and offload backups to the cloud too. Even if only as a business continuity thing.

My networking is good enough for my needs: 10GbE copper. Not 100GbE, but 100GbE is hard to actually fully utilize without jumping through major hoops.

My business is mainly deep learning R&D. Current cloud GPU, networking, and storage pricing gives me ulcers given my compute needs and the size of my datasets.

I do run my website in cloud, with redundancy and all that. I also use cloud storage for backup, and for K8S registry. If I was selling e.g. inference services, I'd be running them in cloud too (passing the costs onto the clients). Most of my local workloads could easily be shifted right across to any decent K8S provider.

But the fact is, my lone rack has been humming along with zero unscheduled downtime for 3 years now. I can count several global outages in each of the three major cloud services in this timeframe (most of them during US work hours, BTW), so I'm inarguably better off with the setup I have now than I would be if I moved it all to the cloud. Not to mention it already paid for itself several times over even though I burn through several hundred dollars a month in electricity.

If your needs are fairly static, then that's all the more reason to go on-prem if it's not too hard to set up initially. If you expect to need to upgrade systems on a regular basis because you need the extra CPU/disk/memory capacity, or switching infrastructure to support increased network utilization, that all gets very annoying very quickly.

Also, if you're sourcing components yourself or from a third party that isn't shipping specific configurations that have been tested by hundreds or even hundreds of thousands of times in other installations (i.e. not Dell), then be prepared for the possibility that you'll run into weird errors from the specific combination of mainboard, CPU, memory, NIC, RAID, etc used in a system that can be extremely painful and time consuming to to diagnose for RMA. I mean, if you're lucky, that will exhibit itself as an a crash that happens early and is easy to reproduce.

At this point, I see the premium you pay to large name providers like Dell/HP/etc as not just insurance that they will quickly accept and replace any possibly faulty hardware but also as the fee they charge for ironing out all the kinks of a platform so you don't have to worry about them for the most part.

It's totally possible to build your own hardware for cheap, but I no longer have the will to do so. To me, cloud services are like that, just to a lesser degree. I can handle the hardware locally (even if not custom built), but I'm finding less of a benefit to doing so as I get better at dynamically handling load (which has benefits in areas besides cost as well).

I'm not arguing this is the best solution for everyone. I'm just saying it's better for me, from both the reliability and cost standpoint. Frankly I couldn't care less what everybody else is using.

I have just checked the pricing of what my approximate (GPUs are different in my case) setup would cost to run on GCP, and it totals up to $40K a month. Not a typo, I did not add any extra zeros there. That's not including storage or networking costs, and _including_ the sustained use discount. That's nuts.

I understand, I just wanted to express how I think it's really specific to to the task at hand (and vent a little about annoyances of the past). CPU intensive and or GPU might be a bad case, but it also depends on how actively you utilize the systems. For some of my stuff, I have what's likely close to an ideal case, where systems are only needed for about 5 continuous hours each day, Monday through Friday. I can turn them up on start or work, and off 5 hours later, and pay a fraction the cost for a full time deployment in the cloud since other than disk I only pay for the compute used.

> it totals up to $40K a month

Yeah, it sounds like your use case is bad for cloud (not that I'm an expert), especially if you're able to get that all in one rack. It's a lot more useful if you have different layers that may need to grow at different speeds. For example, a front-end load balancing set of systems, a group of web servers to aggregate some back-end services for page display, different service clusters, and a set of databases (read-only replicas and master-master replication, etc). Both being able to scale some systems vertically (just allocate more RAM and/ro CPU to it) and other horizontally (just add more servers to a load balanced set) can be extremely useful, especially when the unexpected hits and you find you load is multiple times the normal amount all of a sudden and you need to provision now.

This shows an utter ignorance of the market, you can get a dedicated server provisioned for you in minutes from most hosting companies.

Your comment applies to your self as well. The vast majority of people don't need dedicated servers provisioned to them.

Another perhaps more relevant way to think about this is that security and availability are at odds with each other, and in this case a system designed for security made a secure choice.

The more secure a system is designed to be, the more likely it is to treat unusual conditions as an attack and possibly perform some destructive action to thwart the assumed attacker. Think of phones configured to delete all data after X incorrect password attempts, HSMs with anti-tamper switches, etc.

I don’t quite get what security is being accomplished in this case. You can’t access the key store, so you nuke the encrypted database?

Disclosure: I work on Google Cloud.

I’ve always enjoyed this quote, but my problem with [the description of] this outage is the third-party dependency.

Packets can’t get from your cloud provider to downstream users of CenturyLink? That’s fair.

Your cloud provider can’t send packets to/from CenturyLink, so they nuke your database? I literally don’t understand.

Is the service described actually a third-party service that’s been white boxed? (I mean this in the most honest way possible. I do not understand the details, and I found the article surprising).

Yes, sounds like bad design to me. But the uber-problem here is with complexity. Google had its own share of screwups (most of which users never saw, but some of which they did), albeit not quite to the extent of destroying data at scale (and when GMail did destroy data at scale, it had tape backups, so it was never actually lost).

The root cause of nearly all of these screwups is that large, complex systems can't be fully understood or observed, and that a good chunk of knowledge about such systems is institutional, rather than explicit. So from time to time people _will_ make assumptions that don't match reality, and reality will punish them for it. Which is what, I strongly suspect, happened in this case.

I feel that while valid for you, this sentiment is highly out of touch with the reality of the needs, resources, and capabilities of most people who need these kinds of systems.

It reminds me of a friend who wonders why his parents don't just install Ubuntu because windows is so awful.

Because nothing could happen to your rack (or garage)?

Bad things can happen to it, but at least nobody futzes with it all the time. Most of e.g. Google's major outages (Cloud and Borg) are configuration error or human error. I run my own systems very conservatively. I use conservative technology choices, my kernels are years behind bleeding edge (although I do install security updates, of course), and I don't try to fix or "improve" what's not broken because my paycheck does not depend on that. And in the unlikely event something does get fucked up, I can fix it within less than an hour and I'm the only one affected. It's also extremely unlikely to have any kind of "global" outage.

Don't get me wrong, I use cloud (GCP, if you must know) too, and if my business grew massively, I'd probably use it more. But frankly I'm more satisfied with my own "on prem" solution. Single rack which basically pays for itself every 3 months or so in cloud costs, what's not to like?

I totally understand what you’re saying, but at what point does it become “putting money under your mattress” versus using a bank?

Yes a major outage that’s no fault of my own could take down my site - and that isn’t great. On the other hand, wouldn’t you trust them more than yourself? Just like you trust a bank more than yourself to keep your money safe? I’m sure the average major cloud service has far better uptime than the average on prem solution. At what point does it become hubris to spin up your own on prem solution?

>> wouldn’t you trust them more than yourself


>> versus using a bank

If my bank charged 25% of my balance every month, I'd very quickly choose the mattress. :-) But I do recognize that my needs are pretty uncommon.

What if you're on vacation when something happens, such as a power outage? What about a massive event like an earthquake/flood/wind storm/fire/etc?

Far less likely events than an SRE pushing a bad config.

I can’t say for sure, but I’m assuming they meant because it is (in comparison) not a distributed (or complex) system.

I just don't hear any good things about Azure. That is unfortunate, because I'd love AWS to have some competition.

> I just don't hear any good things about Azure. That is unfortunate, because I'd love AWS to have some competition.

Google Cloud is fair competition – provided they have the service you need. AWS and Azure both beat them in number of services. If Google has it, then it should behave as expected, and some are downright impressive (GKE and VM auto migration on GCE).

Azure is... infuriating. Inconsistent, unreliable APIs, surprising behavior everywhere (attach an internal load balancer, lose internet connectivity!?), lots of restrictions on which features can be used with which SKUs.

I see improvements and it is difficult to beat them in the enterprise, but speaking as an engineer, man Azure is infuriating.

> attach an internal load balancer, lose internet connectivity!?

Ah, yes, the "TCP and UDP egress work unless you define an _ingress_ load balancing rule for either protocol, at which point the other protocol breaks until you create a dummy rule for the other protocol that will trigger your security team to send you tickets every few months."

Disclosure: I work at Azure.

We're doing our best, but we're not going to suggest there's not more to do. Every major cloud provider has had issues at one point or another (I formerly worked at Amazon and Google), and I'll just say - we hear you, and we are fiercely committed to earning your trust.

I work with Azure daily, my company uses tons of IaaS on all cloud providers, very dynamically. I'm mostly concentrated on Azure.

First I want to say, thank you for your service, keep it up, there is a lot to be done, but I see progress.

Secondly, please, get your teams together and start communicating. We encounter a lot of issues with things that should just work or should be much simpler. Sometimes we contact support and just get handed from team to team without actually finishing something.

Third, please, oh, please, get your SDKs (especially the Python one) fixed. It looks like every new build breaks something, sometimes even the same version on multiple installations since there is a lot of variable versioning done under the hood...

Sometimes I get a feeling a lot if things are "leaking" towards the customer. Wanna change an instance type? You get a "instance not available in cluster" error, or something similarly undocumented. Wanna copy a snapshot between regions? Good luck with that, and hope you've got some retry logic and a hell of a timeout.

Keep pushin'! :)

I'm sure you are, and I wish you the best of luck. I'd rather spend money with Microsoft than Amazon, everything else being equal. One way that Azure could really improve on Amazon is docs/usability, for which AWS is terrible.

Do I understand your comment to mean that Amazon has terrible documentation around AWS? Because if that's the case, I'm truly curious what documentation you find terrible. It can be confusing at times for me, but from my experience, that's mostly due to the fact that there's so many different options for some things.

I just ran into an issue with the way they blog their documentation. It isn't hierarchical from what I can tell and they don't go back and update blogs to point you at the latest document. I wasted a lot of time going step-by-step through one to find out it was deprecated.

It is a problem because a lot of the good documentation about overall usage or architecture for some reason is in this blog format but I can see it just being written off because "its just a blog".

They also leave a lot out of the true documentation such as Landing Zone stuff (in fact they hide this one entirely) and Application Discovery Service (they make no mention how the agents work). There are plenty of other minutia that I have had to contact our rep. to get from the product team.

The most common or basic stuff has some great documentation. But once you stray into the "weird" it isn't uncommon to see 2 or 3 line documentation pages about something.

The deprecated recipes etc. problem is not limited to AWS. I’ve trained myself to observe dates on what I’m reading, and digging deeper if it’s more than 2 years old (or 1 year old for moving targets like newer frameworks/services) especially when learning something new.

For one, it is practically impossible to work out how much something is going to cost.

The CloudFormation docs could definitely use some help. Often the examples they provide for Resources definitions are poor in content. Especially on some lesser known Resource types.

We use Azure and love it. We've had more problems over the past few years from downstream services being broken by AWS (S3 outage, etc.) than our primary apps being broken by Azure.

> S3 outage

Wait what? S3 is 11 nines. That's like 1 file every 8 PB/year-years.

11 9s is for durability, not availability. ie, S3 will almost never lose your file, but sometimes you won't be able to retrieve it immediately. There was a pretty significant S3 availability outage not too long ago, which is probably what the parent comment is referring to.

Availability: in practice 5 hours outage in us-east in 2017: https://www.theregister.co.uk/2017/03/01/aws_s3_outage/

Azure's ILBs are still just bizarre to me. It's the first load balancer I've ever worked with where sometimes a member of the pool can't reach the load balancer. At the packet level given their implementation, this makes sense but tell me how many times you've ever had to write a stats endpoint where the stats nodes had to do workarounds to send their own stats?

Source: https://docs.microsoft.com/en-us/azure/load-balancer/load-ba...

Do you have any information on their implementation? I only noticed that something was amiss once I connected a load balancer and my instances suddenly lost internet access. And also due to the limitation on how many load balancers you can have.

It seems that once you add a load balancer, all traffic gets funneled through it, doesn't matter if it was addressed to it or not. Which is unlike any other load balancer I have ever seen.

Coming from other clouds, this was a shock.

The only thing comparable is AWS's NLB. Because that load balancer is so transparent, clients appear to be connecting directly, with the original source ip. Which caused issues when I wanted to deploy my own Elasticsearch and use an internal NLB for master discovery (whenever a request got routed to the same machine packets got discarded by the kernel). But you can just switch to another load balancer then.

Disclosure: Engineer at Microsoft

There is an excellent Network Academy video on YouTube [1] that covers some of the internal implementations of the layer-4 load balancers (in the context of HA for NVA's).

[1]: https://youtu.be/oYJBrWcOXfY

Try google cloud?

I'm trying it... starting to have second thoughts though. They make it look like it's all ready to go and then you start uncovering bits and pieces of things that seem to be half-done, half-legacy, half-dead at the same time. TBH it looks like they got spread too thin and can see how a good amount of services are solid while many are just there.

Which parts specifically?? I have worked on GCP extensively. Things that annoyed me were appengine standard using Python 2.7 and AWS being more innovative in bringing in new offering much faster than Google can. But Google recently announced Python 3.7 for appengine std.Other than it terms reliability GCP is amazing.

> I'm trying it... starting to have second thoughts though. They make it look like it's all ready to go and then you start uncovering bits and pieces of things that seem to be half-done, half-legacy, half-dead at the same time. TBH it looks like they got spread too thin and can see how a good amount of services are solid while many are just there.

Interestingly, that's my experience on Azure. Seems like a lot of offerings were smashed together in order to try to form a coherent product.

Seems like a lot of offerings were smashed together in order to try to form a coherent product.

Seems like that would apply to many, many web services and products in the last decade.

True. But even the worst AWS offerings ahemEKSahem are still integrated across the product line. You can expect easy integration with other AWS services. And indeed what little it offers, it's integrated (AWS AutoScaling groups, NLBs, IAM, etc)

Azure's AKS is way better as a whole, but until recently, wasn't integrated with their own scalesets. How does that even happen?

To be fair to Azure, AKS is an example on how to rollout features. All the knobs are there (even more so than GKE, you can even set cluster ip ranges), it just works and no silly surprises. At least, I haven't found surprising behavior yet.

Could you be more specific?

1 point by random3 0 minutes ago | edit | delete [-]

To be more specific.. Disclaimer: I've built distributed infra on AWS, Azure and GCP as well as on premise. There's a reason I'm using GCP. I'm not stating that reason below. Instead these are a few of the reasons I don't like it. In summary I believe they focus on too many things with too little depth. This is not quite unique to GCP but rather new to GCP IMO.

Take endpoints for example. You get a nice feature that documents your API based on the code-level docs. To update it you also get an API, but it only works if you do one manual sync first from the console (https://cloud.google.com/endpoints/docs/grpc/dev-portal-sync...)

Then you discover that your documentation starts disappearing and debug for a while just to figure out that the last refactoring got the total length of your service name, RPC name and parameters over 80 characters and your doc doesn't show up if that's the case.

Then you use the tracing capabilities only to discover that the traces don't propagate across services. There's something that ESP (their nginx proxy doesn't do). You take a look at it, try to build it, but discover it uses a Bazel version that is two years old.

Then you look at quotas / throttling First you'll notice that the examples don't work. You just get errors (and apparently they also don't get fixed after sending feedback) https://cloud.google.com/endpoints/docs/grpc/quotas-configur.... You look at the example and notice it's copy-pasted-modified from some JSON and the field names are incorrect.

Then you see it throttling only works with API keys, but they (and everyone else) don't recommend you use API keys and instead use IAM service accounts. Except that a bunch of features won't work unless you use API keys. So you use API keys and then discover you can't provision them, because there's no API. You have to, again, do it manually through the console. You talk to support and they'll recommend you use IAM service accounts because they are much better, although you won't be able to use the api-key specific features (https://cloud.google.com/endpoints/docs/grpc/when-why-api-ke...).

If you take a look at service accounts and IAM you discover new things.. I won't go into details here, but let’s just say figuring out what if a policy should work in a particular case is more art than science.

Then in GKE you want to enable TLS on your gRPC service. It should be easy. https://cloud.google.com/endpoints/docs/grpc/enabling-ssl It's just that it doesn't work, or that different documents say both that it works and that it doesn't work.. Take a look at https://github.com/kubernetes/ingress-gce/issues/18 From certificate provisioning to functional service, It took roughly 5 days involving tcpdump, looking at ESP source code, all the logs, raising bugs on github, etc.

Then see Service Catalog (https://cloud.google.com/kubernetes-engine/docs/how-to/add-o...). You try to download from the link they provide and it will throw an error. You debug maybe talk to support, and they'll tell you that there was a patch and it fixes it. Just that it wasn't released since April 2018...

Then one day your GKE pod won't deploy and you see no errors. It takes a while to realize that if the port name is over a certain number of characters it won't work.

There are many of these and I could keep going :)

This was my experience when I briefly considered their relational database solution.

There are other large software companies with competing cloud offerings, including Oracle[1], IBM[2], Alibaba[3] and more.

[1] https://cloud.oracle.com/home

[2] https://www.ibm.com/cloud/

[3] https://us.alibabacloud.com/

Why on earth would you want oracle cloud? Pay in collected souls?

Your soulgem has insufficient charge for this operation.

IMO HN is not the place for jokes, but that was really very very funny

I’m not sure it was a joke. I had to deal with Oracle 9i years ago and I still have the emotional scars of being buggered hard for some licenses for it. The only way we could circumvent a lot of cost, allowing us to keep humans on budget, was reusing an older oracle upgraded license but this was only licensed to work on a SPARCstation LX and the tool was Java and took 11 minutes to start and no one complained because they were so depressed and demotivated from the job that it hurt to do so.

I quit and slithered off to work with SQL Server 2000 and I’m sure my soul slipped back in one night.

None that can match the maturity or performance of AWS. I always hear really terrible things about Oracle and IBM.

edit: Mostly I've heard it just generally is not great, plus you have to deal with typical Oracle badness.

"IBM Cloud" is just the rebranded name of Softlayer (which was acquired.) They're decent for managed bare-metal hosting.

yeah it's the offering on top that's impossible to use. I mean once you set it up it works like everybody else does, but finding how to access those managed services beyond vm and hosted is exceptionally convoluted

Do you have specifics to share about Oracle clouds. That's the one you hear the least about publicly I think. You at least can get some ideas of the problems of AWS, GCE and Azure.

Oracle offer quite a compelling free trial.

I got a VM for 30 days with 16 gigabytes (!) of RAM and it didn't cost me a penny. In fact, the machine continued working for a few months after the trial ended, which I thought was very generous. They just blocked port 80 and 443 inbound.

They even have a high-touch sales process where a real person sends you emails and replies when you send them questions. Imagine that at AWS!

I was actually tempted to keep paying for the VM from Oracle, but the high-touch sales person literally could not tell me how much it was going to cost. She could point me towards the price calculator page, but the VM type I had been given in my free trial did not exist in their price calculator. I asked for just a simple number: how much is this going to cost me per month? She could not give me an answer. So I didn't buy it. I still have no idea whether it would have been a good deal or not.

They even have a high-touch sales process where a real person sends you emails and replies when you send them questions. Imagine that at AWS!

You haven’t had the joy of having a business support plan from AWS. You can start a chat and get immediate help. They will do a screen share with you if necessary. I have had a 100% success with them. This is everything from CloudFormstion issues, S3 cross account permissions with CodeBuild/CodePipeline, API gateway/Route53 set up issues, IAM permissions to read and write from S3 to Aurora/Redshift, a best practices question with DynamoDB/Dax with Python, and a Route 53/Security group/autoscale group/load balancer issue.

All of these issues were because of my lack of knowledge on the platform.

> They even have a high-touch sales process where a real person sends you emails and replies when you send them questions. Imagine that at AWS!

The do that too once you are big enough.

It’s not about being big just willing to spend the money on a business support plan - 10% of your monthly spend with a minimum of $100/month.

I briefly tried to use Oracle Cloud for a small project. Its UI is close to the worst I have ever used: the whole product is inconsistent and confusing. Despite having multiple thousands of dollars free credit available, I quickly switched to GCP!

Well, I'm not sure that's true.

I only hear bad things about AWS and Google Cloud, and I hear nothing much about Azure.

That's because Azure is a poor service with mediocre support.

My anecdotal experience: I spent a couple of weeks (!) setting up our environment (Bitbucket, Django, Ubuntu, Dockerized) on Azure App Service and Azure Pipelines. Their documentation was incomplete, out-of-date and MS support staff struggled to help if you didn't have a Windows machine (their RDP software doesn't support Linux, Skype for Business doesn't support Linux and normal Skype for Linux doesn't support screen sharing).

Little things like trying to SSH into any machine so that you can execute commands on your docker container (for, say, database migrations or to check logs) is almost impossible. If it wasn't for the help of a lot of people on #docker in Freenode I would probably still be working on it.

I had to use Google Hangouts with a Microsoft support person's personal gmail account, while he was connected over VPN (since he was based in Shanghai), so I could show my issue. The support person was extremely pleasant to deal with and understanding, though, and he went above and beyond to help get my issue resolved even though it turned out to not be from his department.

However, after getting set up, I noticed I was getting 12 second (!) responses from an API I had written just to retrieve a logged-in user's first name, last name and email in JSON. This API resolves locally in 20ms - including layers of authentication.

This turned out to be a known issue when running a managed "Azure Database for PostgreSQL" service and was common on MS support forums.

After reaching out to Microsoft support for Azure Database for PostgreSQL, their response was this, copy-and-pasted:

> As you are currently using Basic Tier (2 vCores, 51200MB), the bad performance is expected.

> When comparing with the performance in your VM, the on-prem is supposed to be better than cloud even within the same hardware environment.

> Please give it a test in higher tier and configure it with a compatible settings compared with your VM. In the meanwhile, you can monitor the slow queries via Query Performance Insight to find out what queries were running at a long time when those API were called.

> Pricing tier information can be found at https://docs.microsoft.com/en-us/azure/postgresql/concepts-p... .

...they tried to upsell me on the higher tier database 3 times in that email chain, believing that this level of performance was acceptable for my database tier.

Of course the next tier up from the $60/month that I was on was $160/month, and since we only have maybe two concurrent users at most it didn't make sense to triple our costs just to avoid 12 second database calls.

I moved the entire service to AWS last week. The set up was painless and swift. Using equivalently priced services, the API now resolves in 50ms.

I don't think I'll ever go back. Not even for free.

I’m sorry you had that experience; your feedback is very helpful to us in pointing to where we can improve. Our storage layer which underlies the Basic tier has variable IOPS, and in this case it sounds like you were on the receiving end of a noisy neighbor. Our General Purpose and Memory Optimized tiers deliver more predictable IO and do not show the same behavior as the Basic tiers. That said we’re always working to improve all of our offerings, including Basic, and hope to provide a better experience for you in the future.

-Rachel, from Azure Database for PostgreSQL

I'm unsure why you're responding to this. There is no planet where 12 seconds latency is appropriate or a consequence of a "noisy neighbor" or not "predictable" or whatever you want to use to rationalise it.

I don't believe your reasons because I had two Azure Database for PostgreSQL instances in two different locations (yes, I moved it across as part of troubleshooting) yet experienced the same levels of performance.

The product that you charge ~$60 a month for is fundamentally broken. It is unusable for any application.

If you are genuinely sorry, I'm happy to give you my company's contact details so you can refund or credit the last few months of what we paid for your service. Otherwise it's this type of empty response and lack of responsibility (or desire to correct the problems, or even correct your marketing material to emphasize what kinds of latency to expect) that led me to instruct my management and ultimately lead the move over to AWS.

Sounds like they built a dead-man's switch and then broke the process through which the man and the switch communicate.

It was garbage collection. They were deleting data that couldn't be accessed (because no key existed in the system to decrypt it), but the DNS failure fooled the detection into thinking that a failed lookup meant "no key exists". Yikes.

To be clear: it was ultimately only a 5 minute loss (and the fact that the DNS outage was simultaneous probably meant there wasn't much data being stored anyway) because they had a regular snapshot facility. So defense in depth saved them.

Still, yikes. That's a pretty disastrous bug.

I’m really curious as to whether this would be because of REST APIs. Conflating ‘I can’t find that endpoint’ with ‘the endpoint says that resource has been deleted’ via a 404.

Would be a pretty ineffective dead mans switch if a backup from 5 minutes ago is available

It's presumably intended as way to reclaim space now taken up by effectively-random garbage rather than a security measure.

The keys didn’t get deleted. Only the encrypted data. If keys are gone - they are GONE. AFAIK you cannot restore the keys

> The deletions were automated, triggered by a script that drops TDE database tables when their corresponding keys can no longer be accessed in the Key Vault, explained Microsoft in a letter reportedly sent to customers.

By what logic is this NOT a terrible idea?

It sounds like they do this for a data security and compliance reasons but it seems like sloppy engineering to not consider unreachability as possible temporary error.

It's reasonable to delay deleting encrypted data (which can take a long time) and just delete its keys (which is very fast) upon a user request to delete the data. If you believe in encryption, then once you delete the only remaining copies of keys, the encrypted data is as good as deleted.

So that's why it's a great idea to implement data deletion as a two-phase sequence of synchronous key deletion, then asynchronous low-priority block scrubbing (or marking free for reclamation).

But not handling the case where your system is confused whether the keys are deleted (versus just temporarily unavailable) is less of a great idea.

Well, it's certainly one way to guarantee database consistency after a network partition…

Hey now, we have guaranteed eventual consistency!

...at the heat death of the universe

eventual elimination..

At least it needs better error handling. A "not found" and "cannot resolve domain name" should behave differently.

Perhaps that was exactly the response they've got in that moment.

The whole "managed resources have a mandatory public dns and ip" idea is insane

Yeah they come with a firewall but still. Imagine competing with everyone else on a single namespace.

At least for the s3 bucket is justified because those are meant to be accessible, but the databases?

If the "public dns" are arbitrarily-nested subdomains, and the "ips" are IPv6 addresses, I don't see what the problem is. We've got essentially-unbounded amounts of both of those.

> If the "public dns" are arbitrarily-nested subdomains,

not names. those are the limited conflicting resource, and they aren't arbitrarily nested:

"The server can contain only lowercase letters, numbers, and the hyphen (-) character" - "The domain name postgres.database.azure.com is appended to the server name you provide."

it also means I cannot just create parallel environments on their own subnet with the same scripts, I've to enter and manage configuration points for all the names, either as dns aliases or straight up in the application connection strings, and it also makes unnecessarily harder to just migrate the stuff around as needed to a failover node

i.e. rds advantage was zero downtime upgrades too, then and then sent a mail everyone last week with "we going to have 5 minute downtime to upgrade your db, whether it's multi zone with replicas or not"?

you can't trust them with that, the cloud is just somebody else computers', you need your own failover on top, and instead of changing one entry in the namespace I've to redeploy all clients, oh joy.

It appears that the SLA guaranteed uptime for Azure SQL Database is 99.9% or 99.99%, depending on tier. That equates to the following allowable downtime per month (which I think is what they base SLA fulfillment on):

99.9: 43m 49.7s

99.99: 4m 23.0s

Sounds like they need to cough up some money for their four 9s customers...

As the article indicates, MSFT is offering 3 months of free service to affected customers.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact