Hacker News new | past | comments | ask | show | jobs | submit login
Microsoft had three staff at Australian data centre campus when Azure went out (itnews.com.au)
89 points by pophenat 7 months ago | hide | past | favorite | 49 comments

I guess people really have forgotten how to run datacenters.

The only people who should be shocked in this thread are the people who have been hoodwinked into thinking operations is so hard you need thousands of staff. I know AWS/GCP/Azure like to charge us as if we were hiring an army of sysadmins, but the truth is that day-to-day DC ops does not require so many people. Hardware failures are more rare than you think and you can work around them without panicking anyway.

You actually need thousands of people for operations at major cloud provider scale (and much more for development), but it scales at some point where you only need people for hands-on tasks at satellite plants and the rest sits at HQ.

Or rather at low wage locations like India :)

The management's way of thinking is: "Well, let's just pay for the peace of mind." Except that this famous peace of mind never comes, because the cloud gets more and more complex each year and it's hard to keep up. Heck, even Amazon can't keep up: for example, officially they depreciate bucket policies but internally they are using it for example in the Cloud Formation templates for the Control Tower. But now it's too late to go back as most of the internet is running on the three major public clouds. You need a lot of determination and a good plan to free oneself from vendor lock-in. In larger orgs it's practically impossible.

I don't believe that S3 Bucket Policies are deprecated. They are powerful, effective, and consistent with almost everything else at AWS (Resource Policy). Perhaps you are thinking of ACLs?

Sorry, yes, I meant ACLs!

This peace of mind is also outsourcing responsibility. Having someone else to point to when shit hits the fan is very valuable for a manager.

In this case they can't even get blamed for their vendor choice because both AWS and Azure are now so big that they're in "nobody ever got fired for buying IBM" territory.

Even within a single AWS region you can land in completely different data centers. Perhaps it doesn’t require as many people as some think, but the large cloud businesses run at a larger scale than your mom-and-pop data center.

But what about the hundreds of local jobs they promise?

They're temporary initial construction jobs and a few low skilled remote hands running around.

I wouldn't call those low skilled. But not developer paid.

Three to five on-site staff to operate a mid-sized DC (10MW/~1K rack yield) is not unusual. This is assuming there are several others on-call.

I know very little on the datacenter operations side of things - I guess 3 people is not a lot, but what is normal? How many operations people are at say AWS US-East-1? I presume it doesn't scale with number of servers, that would not scale well. What is a 'normal' level? 10? 100? It can't be more than 100, can it?

I doubt any amount of staffing could address lack of specialism in dealing with power or air conditioning issues, both likely involve infrequent maintenance by external vendors. 20 people blowing up the phone to a vendor doesn't fix a problem any faster than 1 etc.

Amazon goes to the extreme of putting its own custom firmware on switchgear because the choices that vendor makes in theirs doesn’t align with their objectives.

I don’t think AWS is blowing up a vendors phone when something goes wrong in one of their facilities.

[1] https://www.datacenterknowledge.com/archives/2017/04/07/how-...

Amazon doesn't make their own AC units or generators, so it's still likely they would need external support for a case like this.

That's some magical thinking, thinking that you don't need hardware people because you wrote some software.

All the same gear is still there, its control functions are just ceded to Amazon panels. And integrated to the point of even removing some PLC like devices.

> I know very little on the datacenter operations side of things - I guess 3 people is not a lot, but what is normal?

Bear in mind that outside of the US and maybe one or two other locations ex-US, almost all of the magic cloud operates out of third-party datacentres, not their own.

They will have a small office on-site where 3–5 people sit, and those people are exclusively dedicated to the cloud equipment itself. The datacentre ops side is, by definition, handled by the third-party datacentre operator.

The guys onsite are clearly only there for "intelligent hands" purposes, as everything else will be done remotely from silicon valley or wherever.

us-east-1 is many, many datacenters across its Availability Zones and tens of billions of investment for Amazon Web Services.

Across all the datacenters the number of operations personnel likely exceeds 100. Think of the unit of scale as a datacenter, with an availability zone potentially containing 10+ of those.

[1] https://www.datacenterfrontier.com/cloud/article/11427911/aw...

0 - 2 staff in a typical DC is not unusual at all, with people who are on-call usually within a 30 minute drive.

Larger DCs can and do have more staff on-site 24/7 and typically the amount of staff on-site at any given time is driven by SLAs.

I expect the DC in TFA to return to lower staff levels once they've worked on reducing their total "time to restart chiller" or reduced the amount of manual work involved in doing so.

Still we read how DCs are a job generator. E.g. "One hopes for hundreds of jobs for locals." https://cryptoquorum.com/oman-opens-cryptocurrency-mining-ce...

These usually mean temporary construction jobs. Politicians don’t like to point that out.

Yeah, that's counting construction jobs like the sibling said, or just unfounded optimism.

A datacenter with tight SLAs probably needs one 24/7 tech and one 24/7 security guard. That's acheivable with 5 jobs for each, so 10 jobs total. Maybe a couple more techs if the ticket volume is high. Never going to be hundreds of permanent jobs.

Yeah Northern Virginia trapped themselves with that thinking. They can't stop building Datacenters, less they lose tens of thousands of jobs in the region. And the Datacenters know this, that they can beat concessions out of politicians for it.

I've heard of the big-cos having to use bonafide robots for doing manual tasks in a data center like replacing broken drives or swapping tapes etc. I think there is still a bunch of manual tasks to be done.

That said I have no idea. When I worked (many many many many years ago) in a small DC that is perhaps the size of a 2bed apartment we had 4 guys scurrying about doing stuff (hands-on-keyboard, routing cables, replacing hardware etc). This was way before Docker & Kubernetes et al - physical iron and all that. I would assume that in modern DC ops you could run a football field sized DC with less than 10 people due to automation. But that said if part of the actual infrastructure like power or cooling fails, you need to have the right skill-set in place. If the cooler's failed and couldn't just be turned off and on again, we would have been out of luck in my old DC days and would need to call someone in and just hope the servers didn't fry in the meantime. Sounds like a similar deal here.

Smaller operations usually need more manpower.

You don't need many on-site staff. 99% of tasks can be performed remotely. The only ones that can't involve physically moving equipment, which doesn't happen that often. And if you're doing a big build out you can bring in extra staff for the couple of weeks that takes.

Over 10k servers here, a couple dozen locations scattered around the globe. One full time operations person.

They go on site to geographically adjacent DCs and outside that just travel onsite for special projects.

Guessing affected customers had to spend time and effort on top of ongoing high cloud bills

I've slept so much better since I began hosting, producing energy, and cooling on-prem

The secret is hosting across failure boundaries so that a single outage like this does not impact you. Self-hosting is fine if you can afford the capex for two physically separate data centers (like really separate - like 100+ miles etc (or more!) to cope with natural disasters) and the staff to operate & maintain them 24/7. For many, this is not realistic.

For those that do need to use cloud, just make sure you are running your services in different failure zones.

> For those that do need to use cloud, just make sure you are running your services in different failure zones.

By which time you might as well just roll out your own kit in colocation or your own datacentres.

The cloud providers are nickle and dimers, they charge you for every little tiny thing.

Cloud might look cheap at cents-per-hour, but then you find you need X "services" to deliver your Service and so you are talking about exponential cents-per-hour (X cloud services times x cents-per-hour).

And then running your services across failure zones will of course cost you more beyond the basic double-cost, because most cloud providers charge by the GB for cross-zone traffic. So if you're doing cross-zone replication, that's gonna cost you a pretty penny.

Meanwhile, in your own colo/DC, you have predictable costs. And you can get redundant connections between sites for a flat rate, not some stupid per GB fee.

>like 100+ miles etc (or more!) to cope with natural disasters)

People talk about this often but this failure mode seems to never happen? When was the last time us-east-1 went down because of a natural calamity compared to some technical issue?

Not sure about us-east-1 specifically but there are frequently fairly large natural disasters in the US - there are always hurricanes and stuff, there was that flooding in new York not so long ago, earthquakes in California in the 90s, wildfires etc. And this is just in the US. Basically, don't put all your servers in NYC or all in SF or whatever, but put half in NYC and half in SF and that random hurricane/wildfire/flood/snowstorm etc won't take out both of your data centers.

.... Of course then you have latency issues to think about, but that is often quite application-specific and potentially a good problem to have if a slightly slow website or database or whatever is the biggest problem you have when the alternative would have been a total shutdown.

There are also occasional fires and stuff that take out a whole building (I think OVH had this in France recently?). Ensure that your failure zones are physically separate places, and not just logically-separate zones in the same physical building, or in a building that is next to the one on fire :)

>but there are frequently fairly large natural disasters in the US - there are always hurricanes and stuff, there was that flooding in new York not so long ago, earthquakes in California in the 90s, wildfires etc.

Right but what type of datacenter related incidents did they cause? Did us-east-1 go down because of hurricane sandy? Did us-west-1 go down because of wildfires? I don't seem to remember any datacenter outages caused by wide area natural disasters, whereas I can remember plenty caused by BGP/DNS/config shenanigans.

> Did us-east-1 go Dow because of hurricane sandy?

Nope, but Sandy did a hell of a lot of damage to some key telecommunications infrastructure. Verizon lost multiple floors worth of equipment, cabling, and related infrastructure that served at least their customers across Manhattan.

Having geographical redundancy for mission critical workloads is a good investment if your business is making money. Networked computing is one of the few places we can actually “run away” from a physical source of problems. (Not forever, or universally, of course).

We’re based on the eastern seaboard. You bet we have failsafes in areas less susceptible to natural disaster.

> Did us-east-1 go down because of hurricane sandy?

No, but I was at a company with all the production services in Reston, VA during that storm, and we would have been pretty screwed if Sandy made landfall in the DC area instead of continuing north.

Sandy's flooding in NYC wasn't great for some of the datacenters there, I seem to recall some having trouble, but most were fine.

BGP and DNS are certainly much better at causing disruption, and especially global disruption though.

I remember Hurricane Katrina shutting down lots of online services, and directnic battling to stay online https://www.datacenterknowledge.com/archives/2007/11/05/prov...

Fully agree on this, plus (a very important plus) test that severing down an AZ doesn't bring the services on the good AZ down too. And test this frequently.

I would be very, very surprised if the companies mentioned, in particular banks, weren't running on multiple AZs, but I wouldn't be surprised if the scenario of severing down an AZ was not tested.

What about data center colocation? When you simply rent the energy, cooling, etc, but the hardware is yours? Do you think it's a nice middle ground?

> Do you think it's a nice middle ground?

It is.

The cloud fanbois will tell you until their blue in the face that its not.

I fully accept that the cloud is great for bursty workloads where you're doing nothing and then suddenly half the planet needs your service for a couple of days. That is clear.

But if you've got a reasonably stable baseload running 24x7x365 and a few modest bursts here and there then honestly people need to do the math, because if you look at beyond the short-term figures, the cloud tends to work out much more expensive than colo if you look at for example a three-year period.

Most people don't need the scale the cloud gives. They think they do, but really most people will never grow to FANG scale as much as they may dream it !

I believe the real secret reason the cloud is so popular among developers (based on 10+ years of experience) is that cloud providers are so much nicer and faster to deal with than your company IT department.

Also on the price side, I'm not comparing the price of cloud vs colo, but the price of cloud vs what the company IT department charges my department for being allowed to use one of 'their' colo servers, and that is many times what a cloud server costs. (as a real world example, the place I used to work internally invoiced $150/server/month for a virtual server that would cost me $20/server/month on AWS before any discounts).

Cloud lives not by competing against smart people running their own servers, but against inefficient internal IT services, and there they have them beat both on price and quality.

I'm the ops guy on a small dev team, and I run a sort of hybrid setup for prod that does involve me working on hardware in a colo sometimes, though fairly rarely (I'd love to spend about half my time hauling servers around and cabling stuff so that I'm not stuck at a desk all day, but that's not the way it is).

The whole point of my job is to enable developers to deliver code that provides customers value. On that level I actually embrace the common "condescensions" (so-to-speak) that I'm tech support for developers or a YAML wrangler.

I actually had an experience recently where a developer asked to make some changes to our infrastructure. I pretty much developed our container orchestration system (based on Docker Swarm rather than Kubernetes - a choice our architect made that I've come to appreciate), so I walked him through how my IaC works, told him what he needed to change and then reviewed his pull request and applied the changes. I guess we're on a devops journey now if I want to put it in corpo-tech speak.

Anyway, I suppose a lot of IT departments/guys get lost in creating their "perfect" unassailable systems and forget that the big picture is that the job is to enable customers; most directly are likely to be the developers or other internal employees, but ultimately the end customer who's handing you money to solve their problems.

Also the vast array of managed services. Managed databases, message queues, infinite storage, data warehouses, caches , etc etc. Many of which are very complicated to host well yourself and operationize (failovers, monitoring , backups etc)

This idea that you can build a DC that competes on cost for rented cloud compute - it might be technically true but it’s mostly missing the point of why modern shops prefer the cloud.

> Many of which are very complicated to host well yourself and operationize (failovers, monitoring , backups etc)

Oh you are hilarious.

Time for your daily reminder that failovers, monitoring and backups DO NOT EXIST in the cloud UNLESS (a) you configure and manage them (b) deploy your services in multiple zones (and spend $$$$$ along the way).

Lots of people cannot do (a) properly and it is regularly demonstrated by AWS US-East-1 and others that not many people do (b) fully, or in many cases, don't do it at all.

So yeah, the cloud is still "complicated", it's just a different sort of complicated. And if you do failovers, monitoring and backups properly, the cloud is still "expensive", its just a different sort of expensive.

hey maybe avoid the patronizing crap? I've been involved in running at-scale properties in the cloud and not for 20 years or so now so, whilst i dont know everything i do somewhat know what im talking about.

Making an RDS postgres instance multi-az with automatic failover, and bulletproof backups to s3 is ticking a couple of boxes. Compared to building all of that yourself at the same level of uptime - its not complexity of the same magnitude at all. And sure it will cost you more for the instances for redundancy, but its pretty easily worth it - i dont have to pay an ops team to babysit my databases. Thats just postgres - not even getting into things like aurora, dynamo, kinesis, sqs, lambda - things that either dont have a self-hosted equivalent at all, or if they do are way more complicated to run at scale than PG.

In some cases its trading cloud costs for personnel costs. Both opex. But in many others its having access to services, datastores etc that i couldnt otherwise have as a dev.

I see. But the cloud is much more than just VMs. I don't know if I'd want to manage an equivalent to SQS on my own. Maybe I should try it out and see what happens :)

I wonder what a 4th, 5th or 10th person onsite could have done to speed up mitigation and recovery.

I straight out think MS lied when they said this was an .au only issue. We had a surge of rouge traffic from MS during this issue and we are pretty much on the other side of the globe.

It’s Microsoft. I’m kinda surprised it’s not ChatGPT-REPL at this point.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact