Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Best practices for server redundancy?
35 points by joelhaus on Jan 25, 2010 | hide | past | web | favorite | 31 comments
The server for our e-commerce website went down during a recent holiday and we lost a significant number of transactions.

What redundancy procedures have you implemented to protect against these kinds of issues?

After doing a bit of research on this topic, I'm left with even more questions. DNS propagation would still leave our website down for up to a day or more for some users and we operate on a relatively tight budget (using a $50/mo. VPS).

Hoping someone can shed some light on this and that I'm just ignorant to the obvious solution.

First, if you aren't already, get a hosting company that offers IP-failover/shared IPs. Linode, Slicehost (among many) offer this. See: http://articles.slicehost.com/2008/10/28/ip-failover-high-av...

Basically, you have two VPSs (or servers) and if the first goes down, the second jumps in to take its place on the same IP address so that you don't have to change the DNS or anything.

Second, you can turn your TTLs down for your DNS so that when you make changes they happen faster.

Third, you can have an offsite VPS mirror with a different hosting company ready to roll.

That's usually how I deal with the issue of high availability. Have two boxes with one company with a shared-IP setup so that if one goes down, the second just takes over. Then have an off-site mirror with a different company and have my DNS TTLs set low enough that hopefully most people can get access in a couple hours.

The problem is that it all costs money. Rather than one server, you now have three. Costs have tripled to move you from, say, 99.5% reliability to 99.9%. So, you're paying a lot more for a very marginal improvement and the question that you have to ask is whether that tiny marginal improvement warrants paying triple. I think it's worth it - servers always seem to go down when I'm least available. So, that's my 2 cents.

If the 0.4 percentage points increase will get you more earnings than its costs, then yes it's worth it.

The other way to look at this: if the cost of bringing the site up (e.g. salary of having a sysadmin always on call) is higher than the increase in availability, then again it's worth it.

So I agree with your final conclusion: it is worth it.

Duly noted. The cost/benefit is clear and this seems like the optimal solution.

Thanks to everyone for your thoughtful responses; your comments cleared quite a few things up.

Your budget does not allow for building a redundant system. The only way to mitigate risk is to set up monitoring and automated backups.

There are a couple of options for making your front-end redundant: round-robin DNS, or a CDN.

In either case, it still leaves you with a single point of failure in the backend. Your best bet there is probably to get a second VPS through a completely different provider -- and figure that you'll never lose both at the same time -- and then set up replication between the two databases, along with a monitoring process to switch the front-end to the backup database in the event that the primary becomes unreachable.

Round Robin DNS can play havoc with e-commerce software that was built on dumb assumptions about session handling. The usual symptom is carts vanishing and reappearing semi-randomly. If I'm going cheap a webserver set up as a reverse proxy can manage a large number of sessions spread out over multiple application servers. It's not as good as dedicated network hardware, but it's better than round-robin dns.

Ugh, I didn't know that about session handling.

As a tangent: guys, current ecommerce packages really, really suck. I've had a couple of ecommerce jobs recently, and for one client in particular, had to evaluate most of the field of both free and commercial offerings. They're almost uniformly horrible. Is there anybody willing to take a crack at doing it better?

I suggested round-robin DNS over a reverse proxy because reverse proxies (and similar services, like OpenBGPd) still have a single point of failure in the front-end. No matter what you do, it can still go down, and then all that back-end redundancy does squat.

The fix in most cases is having a shared session-store (memchached usually) that the front-end instances share. It's not that painful to retrofit it if you know the software.

http://spreecommerce.com is the best I could find, I've recently been customizing it for a client.

Thanks for the link! Looks like I finally have an excuse to learn Ruby.

There are DNS failover providers that will host your DNS records and provide monitoring and automatic failover between IPs. This takes propagation time out of the equation.

This unfortunately doesn't resolve problems with bad caching behavior -- not all internet service providers, including the big ones, are observing TTL or sane caching.

I think you're still better off solving the problem without relying on TTL or DNS caching behavior that you don't have control over.

Could you please list a couple or recommend one out of experience? Thank you.

1. Monitoring; if you lost a significant number of sales, it was mostly because you didn't know your site was down.

Solution: nagios running on a pc in your office. Or use one of the many external monitoring tools that will send you a page if your site is unavailable for more than 90 seconds.

2. Separation of concerns, the server that handles credit card transactions should be a different machine than the one that delivers static media. Databases shouldn't depend on webservers and vice versa.

3. Load balancers don't need to be fancy, you can distribute high impact loads across multiple backend servers using nothing but free software.

4. Eliminating single points of failure is a good idea, but it gets expensive quickly. Try to identify which bottlenecks are actually giving you grief before you charge into building complete multi-tier architectures.

All that aside, if $50 bucks is the limit of your hosting budget you are in trouble. To run a relatively high traffic site you should have at minimum separate web and database servers, and your database server should be on dedicated hardware.

Set up pingdom.com to monitor for us. Will likely be shelling out for some extra servers as well.

Your advice is much appreciated!

This depends a lot on your software design. You have to shed a lot of assumptions before you can even begin to consider redundancy. If your software can only run on one computer, then nothing much matters.

It would be a worthwhile experiment to see what it'd take to get your site running on GAE or perhaps heroku. In both cases, you are sufficiently constrained such that front-end scalability is trivial, and backend redundancy is managed for you.

Of course, many of us have done this ourselves, but these days I'd rather just get it in a box where I'm doing something fairly small (i.e. could even run on one computer).

The issue with heroku and/or GAE is that they are single services. They have and will continue to have unplanned downtime. The issue here is mitigating a single service's downtime by having redundancy. Redundancy almost always involves duplication of resources. Essentially, you could switch to GAE or Heroku, but you haven't gained much, and possibly lost, since GAE apps can usually ONLY run on GAE.

Our service needs to be highly available, so our infrastructure reflects this. Basically, we have a hosting provider that has 3 geographically separated datacenters within the same IP subnet. We have a clustered loadbalancer solution, with one loadbalancer in each datacenter, that monitor each other. When one loadbalancer goes down, another one picks up the same IP address of the one that went down, and the whole thing fails over to another datacenter.

Of course, the load balancers have a lot more webservers behind this, but this is what we consider a pretty highly availalbe solution for the front-end.

The problem is in the backend: how do you handle databases going down ? Do you make a master/slave setup like the loadbalancers I just described? Or a multi-master that sync periodically ? It all depends upon your requirements. So in the end, there is no best practice, there's just a good practice for your problem.

You could use something like Rackspace cloud sites: http://www.rackspacecloud.com/cloud_hosting_products/sites

They claim to offer automatic scalability and server redundancy out of the box. I have no personal experience with this product thou.

My personal experience with Rackspace is pretty horrible. I'd probably recommend running a site off of a mini-itx server using bandwidth leeched from your neighbors unsecured wifi connection before I'd recommend Rackspace.

Service may or may not pass basic PCI scan. Probably should see if it would pass any required scans before you go. If you're storing credit card data yourself, it's Definitely not PCI compliant. OTOH, you're already not PCI compliant in that case.

I have this service but don't use it for anything serious. The primary win is that I don't have to maintain the servers, and the email service is very solid with a good webmail. Downsides are I can't make them switch to PHP 5.3, the filesystem is NFS, and there's no API for the control panel stuff. (Most hosts don't have an API for control panel actions, but it would be nice if I could automate creating sites.)

Rackspace made headlines recently for having multiple outages.

Any hosting company with an outage is going to get headlines these days.

My first startup was a hosting company -- my partner still runs it and I exited a few years ago. Having "been there and done that," I have a lot of respect for anyone that can stay in the hosting business -- when everything is running just fine, your customers are complaining about why things can't be cheaper. As soon as the network blips for even a second, everyone's talking about how crappy the network is, why they pay so much (even on $10/mo accounts), etc.

In other news: it's early, I can't sleep and I might be a little cranky. :)

No worries. FWIW, I've worked for an ISP, and now offer (really cheap) domain mail hosting services too. So, yeah.

That said: my point wasn't so much that Rackspace is a bad hosting company, but rather that the original question was looking for a solution that would offer really close to zero downtime for their customers, and given that, "put it all on Rackspace" isn't a viable answer.

DNS propagation doesn't have to be that long if you set the TTL for your records to be the lowest your nameserver will allow.

Your problem is severely limited by your budget given that at your level, redundancy will at least double your hosting bill. If your site is that important, I think you should spend a bit more on hosting.

DNS rarely takes a day or more to do anything. The only place I have to wait on it is when I change my name servers at my registrar. If you set your TTLs low, the greatest amount of time you're looking at is probably an hour. Another option is dynamic DNS load balancing.

Back when I was working for an ISP, we discovered that sbcglobal was ignoring the TTL for some domains.

90% of the ISPs (just pulling that out of my ass) will respect DNS TTL times. However, the other 10% may not, and these are likely to be the large providers (Comcast cable modem customers, etc.) and will often set a TTL in their caching servers of 24hours minimum.

On top of that, most browsers also cache DNS information.

The best way, IMO, to get better than average availability on a budget is to put 2 haproxy servers at a colo provider that offers some geographic diversity, but can route IP traffic seamlessly between the two sites. These kinds of places will often be BYOS, and charge by rackspace used. Some bill in 1U increments, some you may have to get a 1/3 or 1/2 cabinet minimum.

Use the IP of the haproxy server as your "public IP", then configure haproxy to direct traffic to 1 or more budget hosting providers.

haproxy will generally handle all the session-management for something like an ecommerce site that requires a users connection go back to the same server each time.

It's been a couple of years since I was heavily involved in this, but you're looking at about $300/mo minimum in colo/hosting fees to get into something that can offer some resiliency against the common outage.

If a couple of hours of lost business have a value of more than $500, then you probably need to upgrade your infrastructure.

If they run an API, some hosts (Dreamhost) ignore the TTL for ages. I discovered this when one of my API users emailed me after a DNS change asking what happened. In the end my user had to edit their hosts file to over-ride Dreamhost's obscene DNS cache.

have you separated out your database? How much traffic can it handle?

Upgrade your hosting budget.

More machines reduces redundancy - which is what the op is asking about.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact