

Ask HN: Best practices for server redundancy? - joelhaus

The server for our e-commerce website went down during a recent holiday and we lost a significant number of transactions.<p>What redundancy procedures have you implemented to protect against these kinds of issues?<p>After doing a bit of research on this topic, I'm left with even more questions. DNS propagation would still leave our website down for up to a day or more for some users and we operate on a relatively tight budget (using a $50/mo. VPS).<p>Hoping someone can shed some light on this and that I'm just ignorant to the obvious solution.
======
mdasen
First, if you aren't already, get a hosting company that offers IP-
failover/shared IPs. Linode, Slicehost (among many) offer this. See:
[http://articles.slicehost.com/2008/10/28/ip-failover-high-
av...](http://articles.slicehost.com/2008/10/28/ip-failover-high-availability-
explained)

Basically, you have two VPSs (or servers) and if the first goes down, the
second jumps in to take its place on the same IP address so that you don't
have to change the DNS or anything.

Second, you can turn your TTLs down for your DNS so that when you make changes
they happen faster.

Third, you can have an offsite VPS mirror with a different hosting company
ready to roll.

That's usually how I deal with the issue of high availability. Have two boxes
with one company with a shared-IP setup so that if one goes down, the second
just takes over. Then have an off-site mirror with a different company and
have my DNS TTLs set low enough that hopefully most people can get access in a
couple hours.

The problem is that it all costs money. Rather than one server, you now have
three. Costs have tripled to move you from, say, 99.5% reliability to 99.9%.
So, you're paying a lot more for a very marginal improvement and the question
that you have to ask is whether that tiny marginal improvement warrants paying
triple. I think it's worth it - servers always seem to go down when I'm least
available. So, that's my 2 cents.

~~~
pierrefar
If the 0.4 percentage points increase will get you more earnings than its
costs, then yes it's worth it.

The other way to look at this: if the cost of bringing the site up (e.g.
salary of having a sysadmin always on call) is higher than the increase in
availability, then again it's worth it.

So I agree with your final conclusion: it is worth it.

------
Mc_Big_G
Your budget does not allow for building a redundant system. The only way to
mitigate risk is to set up monitoring and automated backups.

------
thaumaturgy
There are a couple of options for making your front-end redundant: round-robin
DNS, or a CDN.

In either case, it still leaves you with a single point of failure in the
backend. Your best bet there is probably to get a second VPS through a
completely different provider -- and figure that you'll never lose both at the
same time -- and then set up replication between the two databases, along with
a monitoring process to switch the front-end to the backup database in the
event that the primary becomes unreachable.

~~~
andrewbadera
There are DNS failover providers that will host your DNS records and provide
monitoring and automatic failover between IPs. This takes propagation time out
of the equation.

~~~
thaumaturgy
This unfortunately doesn't resolve problems with bad caching behavior -- not
all internet service providers, including the big ones, are observing TTL or
sane caching.

I think you're still better off solving the problem without relying on TTL or
DNS caching behavior that you don't have control over.

------
olefoo
1\. Monitoring; if you lost a significant number of sales, it was mostly
because you didn't know your site was down.

Solution: nagios running on a pc in your office. Or use one of the many
external monitoring tools that will send you a page if your site is
unavailable for more than 90 seconds.

2\. Separation of concerns, the server that handles credit card transactions
should be a different machine than the one that delivers static media.
Databases shouldn't depend on webservers and vice versa.

3\. Load balancers don't need to be fancy, you can distribute high impact
loads across multiple backend servers using nothing but free software.

4\. Eliminating single points of failure is a good idea, but it gets expensive
quickly. Try to identify which bottlenecks are actually giving you grief
before you charge into building complete multi-tier architectures.

All that aside, if $50 bucks is the limit of your hosting budget you are in
trouble. To run a relatively high traffic site you should have at minimum
separate web and database servers, and your database server should be on
dedicated hardware.

~~~
joelhaus
Set up pingdom.com to monitor for us. Will likely be shelling out for some
extra servers as well.

Your advice is much appreciated!

------
dlsspy
This depends a _lot_ on your software design. You have to shed a lot of
assumptions before you can even begin to consider redundancy. If your software
can only run on one computer, then nothing much matters.

It would be a worthwhile experiment to see what it'd take to get your site
running on GAE or perhaps heroku. In both cases, you are sufficiently
constrained such that front-end scalability is trivial, and backend redundancy
is managed for you.

Of course, many of us have done this ourselves, but these days I'd rather just
get it in a box where I'm doing something fairly small (i.e. could even run on
one computer).

~~~
oomkiller
The issue with heroku and/or GAE is that they are single services. They have
and will continue to have unplanned downtime. The issue here is mitigating a
single service's downtime by having redundancy. Redundancy almost always
involves duplication of resources. Essentially, you could switch to GAE or
Heroku, but you haven't gained much, and possibly lost, since GAE apps can
usually ONLY run on GAE.

------
stingraycharles
Our service needs to be highly available, so our infrastructure reflects this.
Basically, we have a hosting provider that has 3 geographically separated
datacenters within the same IP subnet. We have a clustered loadbalancer
solution, with one loadbalancer in each datacenter, that monitor each other.
When one loadbalancer goes down, another one picks up the same IP address of
the one that went down, and the whole thing fails over to another datacenter.

Of course, the load balancers have a lot more webservers behind this, but this
is what we consider a pretty highly availalbe solution for the front-end.

The problem is in the backend: how do you handle databases going down ? Do you
make a master/slave setup like the loadbalancers I just described? Or a multi-
master that sync periodically ? It all depends upon your requirements. So in
the end, there is no best practice, there's just a good practice for your
problem.

------
ArtemD
You could use something like Rackspace cloud sites:
<http://www.rackspacecloud.com/cloud_hosting_products/sites>

They claim to offer automatic scalability and server redundancy out of the
box. I have no personal experience with this product thou.

~~~
thaumaturgy
Rackspace made headlines recently for having multiple outages.

~~~
paulsingh
_Any_ hosting company with an outage is going to get headlines these days.

My first startup was a hosting company -- my partner still runs it and I
exited a few years ago. Having "been there and done that," I have a lot of
respect for anyone that can stay in the hosting business -- when everything is
running just fine, your customers are complaining about why things can't be
cheaper. As soon as the network blips for even a second, everyone's talking
about how crappy the network is, why they pay so much (even on $10/mo
accounts), etc.

In other news: it's early, I can't sleep and I might be a little cranky. :)

~~~
thaumaturgy
No worries. FWIW, I've worked for an ISP, and now offer (really cheap) domain
mail hosting services too. So, yeah.

That said: my point wasn't so much that Rackspace is a bad hosting company,
but rather that the original question was looking for a solution that would
offer really close to zero downtime for their customers, and given that, "put
it all on Rackspace" isn't a viable answer.

------
bozmac
DNS propagation doesn't have to be that long if you set the TTL for your
records to be the lowest your nameserver will allow.

Your problem is severely limited by your budget given that at your level,
redundancy will at least double your hosting bill. If your site is that
important, I think you should spend a bit more on hosting.

------
oomkiller
DNS rarely takes a day or more to do anything. The only place I have to wait
on it is when I change my name servers at my registrar. If you set your TTLs
low, the greatest amount of time you're looking at is probably an hour.
Another option is dynamic DNS load balancing.

~~~
thaumaturgy
Back when I was working for an ISP, we discovered that sbcglobal was ignoring
the TTL for some domains.

------
sfall
have you separated out your database? How much traffic can it handle?

Upgrade your hosting budget.

~~~
akronim
More machines _reduces_ redundancy - which is what the op is asking about.

