Hacker News new | past | comments | ask | show | jobs | submit login

Oh, there wasn't a handful of servers after we finished the migration (we have migrated a bit late IMO, so we had a lot of traffic even back then). And today, with much larger infrastructure, with hardware clusters specifically tailored to our customers needs, etc I'm pretty sure the same infrastructure on EC2 would cost more than 2x.

(Update) Re: failures - with a ~50 servers we see a hardware issue (disk dead in a RAID or an ECC memory failure) about once a month or so. None of those failures caused a single outage (RAID and ECC RAM FTW) so far.

I ran several dozen Dell blade enclosures fully maxed out - well over 300 server blades - and in 3 years I had two disk failures, none of which were critical. Hardware is pretty reliable these days.

How do you monitor HW and network failures and how do you notify SoftLayer? Is that 1-2 hours replacement time true for each components of your server fleet?

1-2 hours is their new server provisioning time. For HW issues we use nagios (that checks raid health and ECC memory health regularly) and at the moment we just file a ticket with SL about the issue showing them the output from our monitoring. They react within an hour and HW replacement is usually performed within an few hours after that (usually limited by our ability to quickly move our load away from a box to let them work on it).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact