
Ask HN: Most cost-effective method of achieving physical redundancy - dabeeeenster
Hi,<p>We have a client currently being hosted on a single server in a data centre in the UK. They are asking about getting a level of physical redundancy built into the hosting infrastructure. If a plane lands on the data center, they want to be able to continue operating their website. They are not overly concerned about the service dropping out for a minute or two, but they want to avoid the hours spent offline whilst we restored from backups to another data center and redirected the DNS.<p>I've done some investigation and found two relatively cheap solutions:<p>1. Get an additional failover server hosted in another data centre. Operate some level of rsync and mySQL replication to the failover server. Then run a service like http://dynect.com/ to do a rudimentary form of failover via DNS. This is not a 100% solution for everyone (mainly people with badly configured long TTL DNS servers). The rsync/mysql replication is also open to potential problems. This seems like one of the cheapest solutions.<p>2. Use a provider with a SAN-based virtualisation setup with a backup data centre. I know Hostway offer this, but it is not cheap! In the event of a failover we just fire up our VMWare instance in the failover data center and pick up on the shared SAN.<p>Both of these are not ideal. 1 is a bit too flakey for my liking, and 2 is a bit too expensive for my client. Is there a neater solution that I am missing?<p>The application is Java/Struts2/Spring/Hibernate running Tomcat/Apache/MySQL on Linux .<p>Thanks!
======
jwilliams
Interesting question :)

1\. The most cost effective in terms of immediate capital outlay is active-
passive - which is pretty much #1 that you described. This is generally also
less effort... and is generally easy to test.

2\. The most cost effective long term strategy is active-active as you get
ongoing use of both sites. Particularly if the client is happy for reduced
performance in a DR scenario... Even if this isn't the case, at least you get
a performance boost from the DR site.

This is usually more effort as you need things like replication to
work/perform... And you also need to test a whole new range of failure
scenarios.

This approach does have some other advantages though. For example, you can do
rolling deploys very easily (upgrade DR first, bring down Prod, upgrade Prod).

3\. A common hybrid is to have two active sites on the Tomcat/Apache end and
active-passive for the MySQL. Depending on the DB load this can be a best of
both worlds scenario.

4\. Some other solutions use a coherent cache solution - e.g. Tangosol, which
works with Hibernate/MySQL. As long as the latency between the two sites is
low, then this should work. Tangosol is a non-free product, now owned by
Oracle afaik, but I don't think it's prohibitively expensive... I've seen this
used a lot, and it's probably a very simple and elegant solution, but I
personally don't like adding another moving part in. A lot of people I've met
swear by it though, particularly Hibernate users.

~~~
dabeeeenster
This response is exactly why I posted on HN! Love this site.

In terms of active-active, I guess I need two physically disparate load
balancers that heart beat with each other? I'm not really a network engineer,
and I dont understand what happens to the packets when 1 data center goes
down, in terms of routing around it?

Thanks for your time.

~~~
jwilliams
> In terms of active-active, I guess I need two physically disparate load
> balancers that heart beat with each other?

Yeah - as the brk indicates in his post, you can use something like haproxy to
do this. Afaik, you can get Apache to do this too...

One advantage might be that Apache can be configured to do "sticky sessions" -
which you'll need if you're not sync'ing sessions between sites (which in this
scenario doesn't seem necessary). Never used haproxy, but it might do this
too.

In terms of your active-active setup, you can have two "legs" that are load
balanced at the front, or you can have load balancing between each tier. If
you're (a) not very experienced or (b) you don't have lots of bandwidth
between sites, I'd go with the former and just have load balancing at the
front end.

------
brk
Get your own ASN ;)

You are on the right path with your approach #1. Setup a failover server
someplace and keep it sync'd at an interval that is the right balance between
too-often and too-stale.

You could opt to get a small dedicated machine from a larger-scale true multi-
homed hosting company. Run haproxy on this "small" machine, which serves
primarily to redirect traffic to whichever server is currently deemed to be
the "active". Basically this is a low-end DIY Akamai kind of solution. Your
cost would be somewhere between #1 and #2. You would eliminate the DNS-lag and
the issues you can't control (other servers and clients caching stale DNS data
that you can't refresh), and you would get most of the benefits of an
availability service like Akamai, without the $15KUSD monthly price tag.

There are other benefits to the haproxy solution as well, you can take either
machine offline for maintenance with zero (theoretical ;) ) downtime, and if
you ever get a traffic surge you can load balance between the two sites (which
isn't a bad idea to do all the time if your syncing is up-to-date enough. Send
every 10th connection to the fail over site, just to make sure it's always
working as expected).

~~~
dabeeeenster
What happens if the 'small' server fails?

~~~
brk
Depends on how risk-averse you are. You can generally get an SLA that would
allow you to have a hot-back up of that server, or you take the risk that it
won't wail until you can afford the "better" solution.

Doing "high availability" coupled with "low budget" is ALWAYS going to involve
some compromises.

If you keep your DNS TTL's low, the worst-case scenario is if the small server
fails, you're doing exactly what you would have done anyways: making manual or
automated DNS changes.

~~~
mcargian
When I investigated a low cost solution like this before, the issue was big
companies like AOL not honoring TTL's less than 24 hours. Is this still the
case?

------
lsc
1\. is the cheap way. as someone else said, you can use BGP instead of dns for
failover to improve downtime, but then it's no longer cheap.

Personally, this is my preferred solution. keep as much as possible in MySQL
cluster and have the rest built on a central dev server and pushed out when
there are changes.

San is not cheap, and it's pretty easy to screw up the whole san. It's still a
single point of failure. (I ran prgmr.com on a san for the first few years,
and I switched away from SAN not because of cost, but because of reliability.
With a san, it's really easy for the new guy to accidentally trash
everything.)

so yeah, I'd do #1. if you can use a active-active VPS, that's best, if you
want to save money, run a smaller active-active site with ec2 images standing
by (remember to test weekly) - the problem with 'cold' backups is that they
are usually broken. active active means they are up all the time.

------
smoody
A quick question: When replicating mySQL across data centers, is there an easy
or perhaps built-in way to encrypt the replication data streams to prevent
prying eyes from potentially intercepting the data? If not, might that be a
problem if one is passing replicating credit card numbers, etc.?

~~~
nickh
In the past, I've used stunnel for encrypting server-to-server communications.
It's very easy to setup.

------
vaksel
why not just auto-copy everything to AWS and do checks for uptime of your
host. Then if the check fails, auto-forward the url to the AWS location.

This way if it hits the fan, the only problem the users will notice is the 5
mins of downtime between the checks, before the site becomes slower due to the
switch to AWS

But then again I don't really know hardware so could be wrong. But from what I
understand should be possible

