Hacker Newsnew | comments | ask | jobs | submitlogin
evilneuro 283 days ago | link | parent

When two of your 2,000 servers die, your load balancers etc kick in and route around the problem.

When two of your two servers die, you ... um, well, you lose money and reputation. Quickly.



vidarh 283 days ago | link

If you buy $1 million servers, a whole lot of things needs to go bad in whole lots of ways that would likely take own large numbers of those 2,000 servers too. I'm not so sure I agree with the notion of going for those big servers myself, but having had mid range servers from a couple of the big-iron vendors in house, here's a few of the things you can expect once you tack a couple of extra digits on the server bill:

- Servers that phone home; sometimes the first you know of a potential problem is engineers at your door come to service your server.

- Hot swappable RAID'ed RAM.

- Hot swappable CPU's, with spares, and OS support for moving threads of CPU's that are showing risk factors for failure.

- Hot swappable storage where not just the disks are hot swappable, but whole disk bays, and even trays of hot swappable RAID controllers etc.

- Redundant fibre channel connections to those raid controllers from the rest of the system.

- Redundant network interfaces and power supplies (of course, even relatively entry level servers offers that these days).

In reality, once you go truly high end, you're talking about multiple racks full of kit that effectively does a lot of the redundancy we tend to try to engineer into software solutions either at the hardware level, or abstracted from you in software layers your application won't normally see (e.g. a typical high end IBM system will set aside a substantial percentage of CPU's as spares and/or for various offload and management purposes; IBM's "classic" "Shark" storage system used two highly redundant AIX servers as "just" storage controllers hidden behind SCSI or Fibre Channel interfaces, for example).

You don't get some server where a single component failure somewhere takes it down. Some of these vendors have decades of designing out single points of failure in their high end equipment.

Some of these systems have enough redundancy that you could probably fire a shotgun into a rack and still have decent odds that the server survives with "just" reduced capacity until your manufacturers engineers show up and asks you awkward questions about what you were up to.

In general you're better off looking at many of those systems as highly integrated clusters rather than individual servers, though some fairly high end systems actually offer "single system image" clustering (that is, your monster of a machine will still look like a single server from the application point of view even in the cases where the hardware looks more like a cluster, though it may have some unusual characteristics such as different access speeds to different parts of memory).

-----




Lists | RSS | Bookmarklet | Guidelines | FAQ | DMCA | News News | Feature Requests | Bugs | Y Combinator | Apply | Library

Search: