
Let's Make DNS Outage Suck Less - kvz
http://kvz.io/blog/2013/03/27/poormans-way-to-decent-dns-failover/
======
trotsky
dnsmasq --interface=lo --all-servers --server=172.16.0.23 --server=8.8.8.8
--server=4.2.2.2

echo "nameserver 127.0.0.1" > /etc/resolv.conf

<http://www.thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html>

 _\--all-servers : By default, when dnsmasq has more than one upstream server
available, it will send queries to just one server. Setting this flag forces
dnsmasq to send all queries to all available servers. The reply from the
server which answers first will be returned to the original requester._

Dnsmasq also has avoidance of unresponsive servers built in, this is a bit
more of a blunt instrument.

~~~
sigil
The example you gave won't stick for most people:

    
    
        echo "nameserver 127.0.0.1" > /etc/resolv.conf
    

If you use dhcp3-client (chances are you do), add this line to
/etc/dhcp/dhcplient.conf to try your local DNS cache first:

    
    
        prepend domain-name-servers 127.0.0.1;
    

Note that this will slow down stuff like captive portals at airports that want
to push you to their "accept terms" page first.

I use djb's dnscache instead of dnsmasq, but dnsmasq works fine too.

<http://cr.yp.to/djbdns/run-cache.html>

------
3amOpsGuy
You know, sometimes the simpler solutions work best. I like this one.

I'm sitting here contrasting this against my normal approach* and the approach
here is:

    
    
        1) easier to explain to others
        2) self documenting
        3) just as effective as running a caching nameserver
    

* i have vm hosts configured in CM to have a bunch of promises applied, one of them is to run a caching nameserver, these hosts are the only ones allowed to do zone transfers. The vm instances running on top of these have a promise applied which has them use their underlying dom0 for dns queries.

------
datums
If you use the rotate option with timeout you'll quickly hop to the next
working resolver. <http://edwin.io/optimized-resolv-conf>

Do you have data you can share regarding "Amazon EC2 resolving nameserver
(172.16.0.23) is unreachable too often" ?

~~~
dsr_
You don't need to do rotation, either -- suppose the first nameserver, when
up, is consistently faster than the others. Just setting a timeout of 1 is
appropriate, then.

~~~
kvz
A timeout of 1 can result in false positives. 3 seconds is recommended
according to the manpage. Should the occasional false positive be none of your
concern, then you still add at least a second to every request. If you make
many, that's many seconds.

------
jefe78
They're using DNS for DB communication. That's a major faux pas.

~~~
kvz
Using IPs for communication to Amazon RDS instances is inadvisable. You'd have
to use WAN addresses outside of EC2, and LAN addresses inside of EC2.
Addressing them by domain resolves this. Additionally I don't think Amazon
guarantees your box will be accessible on the same IP addresses after Multi-AZ
failovers.

~~~
jefe78
You're definitely doing it wrong.

