Hacker News new | comments | show | ask | jobs | submit login

In my time at larger companies, DC power seems to be one of the weakest links in the reliability chain. Even planned maintenance often goes wrong ("well we started the generator test and the lights went out, that wasn't supposed to happen. Sorry your racks are dead").

Usually the root cause appears simple - a dead fan, breaker set to the wrong threshold, alarm that didn't trigger, incorrect component picked during design phase, or whatever else that gets the blame - things it would seem to a software guy that good processes could mitigate.

Can any electrical engineers elaborate on why power networks fail (in my experience at least) so frequently? I guess failure modes (e.g. lightning strike) are hard to test, but surely an industry this old has techniques. Is it perhaps a cost issue?

It's really incredibly complicated, and difficult to test fully. The bits of Amazon's DC that failed seem like stuff normal testing should catch, but the DC power failures I've dealt with in the past always had some really precise sequence of events that caused some strange failure no one expected.

As an example, Equinix in Chicago failed back in like 2005. Everything went really well, except there was some kind of crossover cable between generators that helped them balance load that failed because of a nick in its insulation. This caused some wonky failure cycle between generators that seemed straight out of Murphy's playbook.

They started doing IR scans of those cables regularly as part of their disaster prep. It's crazy how much power is moving around in these data centers, in a lot of way they're in thoroughly uncharted territory.

The even crazier thing is big industrial plants where they are using tens or hundreds of MW and have much lower margins than datacenter companies, so they run with dual grid (HV, sometimes like 132kV) feeds and no onsite redundancy. As in, when the grids flicker, they lose $20mm of in-progress work.

I'd guess that's because "tens or hundreds of MW" of on-site backup power would be _ludicrously_ expensive to own/maintain, and the tradeoff against the risk of both ends of their dual grid flickering at once and trashing the current batch is less expensive. (or maybe the power supply glitches are insurable against, or have contract penalty clauses with the power companies?)

I assume you mean "Datacenter (conditioned) Power", not literally Direct Current power.

In my experience (in ~30 datacenters worldwide, and reading about more), the actual -48v Direct Current plant is usually ROCK SOLID, in comparison to the AC plant. It's almost always overprovisioned and underutilized, at least in older facilities, or those with telcos onsite (who, unlike crappy hosting customers, actually understand power).

My pro tip for datacenter reliability is to try to get as much of your core gear on the DC plant as possible -- core routers, and maybe even some of your infrastructure servers like RRs, monitoring, OOB management, etc. Ideally split stuff between DC and AC such that if either goes down, you're still sort of ok, or at least can recover quickly. DC and AC is even better than dual AC buses, since what starts out as dual AC can easily end up with a single point of failure later (like when they start running out of pdu space, power, or whatever), and dual AC is also more likely to have a closer upstream connection.

DC stuff is WAY simpler to make reliable and redundant, just uses larger amounts of copper and other materials.

Not an EE, but I've observed a few things about electrical infrastructure:

- The work is usually done by outside contractors, working off of specifications that may or may not make sense.

- Some aspects of testing have the potential to be dangerous to the people doing them. (ie. if network switch fails in testings, no big deal. If some types of electrical switch break during testing, the tester is dead.) High voltage electricity is not a toy.

- IT and facilities staff usually don't talk much, and often don't understand each other when they do.

- There's no instrumentation. I get an alert when IT systems aren't configured right. Nothing from the other stuff.

- There is a wide variance in quality of electrical infrastructure that isn't obvious to someone who isn't skilled in that area. IT folks don't need to deal with computers built in 1970. Electricians deal with ancient stuff that may be completely borked all of the time.

Rackspace has a pretty detailed report[1] of their 2009 outages, which is surprisingly similar to the Amazon problems.

[1] http://broadcast.rackspace.com/downloads/pdfs/DFWIncidentRep...

Power failures caused by lightning strike are relatively easy to test with platforms like RTDS [1] (I am not affiliated to RTDS).

I know that you can test in real time your electrical protection systems for almost all the possibilities you can imagine (thousands of them), for example: faults in your high voltage utility distribution system, breaker failures, coordination of the protection systems, lost of your back-up generator power. I don't know their systems or their philosophies, would be interesting for me to know why they don't parallelize groups of generators (at the backup system), so, when one generator fails, the power load are balanced to the others (and using well known schemes to avoid cascade failures).

[1] http://www.rtds.com/applications/applications.html

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact