
Why DNS-based Global Server Load Balancing does not work [2004] - IgorPartola
http://www.tenereillo.com/GSLBPageOfShame.htm
======
michaelcampbell
If memory serves, Netflix users discovered that Akamai is using this strategy
and everyone coming in from Google's DNS or OpenDNS was getting put into the
same 'pool', and overloading the pipes in that pool. The recommendation was to
use your ISP's DNS at least for your Netflix devices.

Or have I misunderstood?

(Post should probably have had the date listed; just realized this article is
7 years old. That doesn't make it wrong or bad, mind you...)

~~~
davidu
This isn't true for Google. Akamai has all kinds of archaic network issues
that cause problems, but it isn't often OpenDNS related. They certainly do
their best, and know our network map well as we send it to them whenever it
changes.

For Google -- If you use OpenDNS, you will always go to the most appropriate
Google datacenter to you (not OpenDNS).

~~~
dsl
THIS. Akamai is using technology designed in 1998. I wish they would work with
opendns to fix it.

~~~
foobarbazetc
Akamai's not doing anything wrong.

OpenDNS is a broken concept.

~~~
michaelcampbell
How so? DNS was never designed to be used as a geolocation vector, as far as I
have ever heard, so relying on it to be one does seem "wrong".

~~~
IgorPartola
OpenDNS defaults to resolving non-existant domains as search results,
bereaking DNS. You can turn it off, if you have a static IP.

~~~
michaelcampbell
I see. That doesn't seem to me to be a "broken concept", however, but a rather
minor misfeature. I may have misread the poster to which I was responding, but
his assertion seems to read to me that OpenDNS as a whole is fundamentally
wrong or broken, and that Akamai is using the idea of DNS for something it was
never intended to be is "correct". Both of those statements do not ring true.

------
aeden
It's worth pointing out that this article is a bit long in the tooth. Although
perhaps biased responses, it's worth looking at some other thoughts on the
subject:

[http://devcentral.f5.com/weblogs/macvittie/archive/2011/03/2...](http://devcentral.f5.com/weblogs/macvittie/archive/2011/03/21/the-
skeleton-in-the-global-server-load-balancing-closet.aspx)

[http://dev.robertmao.com/2007/06/30/global-dns-load-
balancin...](http://dev.robertmao.com/2007/06/30/global-dns-load-balancing-
for-free/) (towards the end, and yes, the grammar is bad)

[http://blogs.sun.com/davew/entry/thoughts_on_global_server_l...](http://blogs.sun.com/davew/entry/thoughts_on_global_server_load)

------
xtacy
The main reason seems to be clients caching DNS lookups for a period of time
longer than what they should have, as indicated by the TTL field in the DNS
response. If this behaviour changes across all browsers, wouldn't that solve
this problem?

Another "hack" to solve the caching problem would be to have multiple random
lookup records (say server-$random.hostname.com) all result in multiple
lookups that cannot be cached? The tradeoff here is latency vs availability.

As mentioned in the article, triangulation and backup redirection would work
as long as "Site A" can be up serving requests labeled (1).

------
timdorr
What about using Anycast IPs? <http://en.wikipedia.org/wiki/Anycast>

~~~
jemfinch
Routing changes break the network connection. If it's buffered and you can
reconnect transparently, it might work, but for unbuffered connections (like
most web connections are) you'll see increased error rates.

~~~
davidu
This has been proven to not happen often in the real world.

[http://www.nanog.org/meetings/nanog37/presentations/matt.lev...](http://www.nanog.org/meetings/nanog37/presentations/matt.levine.pdf)

~~~
jemfinch
My experience indicates otherwise.

~~~
davidu
Who are you? We thought we knew most folks who run TCP Anycast... Anyways,
I've run multi-day long TCP streams without issue. Like any sort of anycast,
there is a lot of finesse, but if you are logical in your network architecture
you will have no issues. In fact, you can re-establish TCP connections across
datacenters if desired using tools like pfsync.

------
IgorPartola
I think one solution to this is to add a DNS record that says how to set the
connection timeout for a given port for this IP before trying the next one. As
is, there is no control over that and each browser implements it differently.

~~~
davidu
SRV records could do this, but most browsers can't do DNS. Only Chrome can.

~~~
justincormack
What do you mean by this? Chrome doesnt support SRV for http
<http://code.google.com./p/chromium/issues/detail?id=22423> but all browsers
do dns... No mainstream ones support srv for http.

~~~
davidu
Chrome resolves its own DNS. Therefore it can support any DNS record type
(even if it doesn't have hooks for one today).

No other browser does. All other browsers rely on the system's stub resolver
which means you can't even open a bug for them, because the bug would require
you to speak DNS first.

------
yesbabyyes
Since the Amazon outage, I've been wondering about the best way to quickly
switch to another host. Having multiple A records is so obvious I can't
believe I have neither thought about it nor heard of it before (I've seen it
implemented, but never thought about as a failover). I didn't know the browser
would switch to the next IP. Interesting!

That said, can you control which host the browser will try first? I.e. can I
have 1.1.1.1 as my main host and trust that all clients will connect to that
as long as it's up, and only connecting to my backup 2.2.2.2 when 1.1.1.1
doesn't respond?

~~~
prakash
_Since the Amazon outage, I've been wondering about the best way to quickly
switch to another host._

This is exact problem we (Cedexis) solve.

Assume your site is www.website.com, in general points to an A record or a
CNAME to a datacenter/ cloud/ CDN.

We add an intermediate hostname, low TTL (20 seconds), global anycast network,
which can be scripted (write your load-balancing logic in php) to handout a
CNAME (one of many) based on performance (RTT), load, cost, anything else you
can think of, etc.

Re.: AWS's recent outage, assuming you are running your apps in multiple
zones/ regions/ clouds, we would have noticed the latency and automatically
routed away to a different zone/ region/ cloud.

We collect hundreds of millions of performance measurements daily:
[http://gigaom.com/cloud/heres-what-amazon-outage-looked-
like...](http://gigaom.com/cloud/heres-what-amazon-outage-looked-like/)

Drop an email and I am happy to explain more and setup folks from HN with a
free account: prakash at cedexis.com

~~~
yesbabyyes
Thanks Prakash, I'll definitely check your offering and reach out when we have
the need! It sounds cool even though I can't help but feel I would like really
tight control over that part - basically we would be dependant on your service
instead.

