
Ask HN: Are we all OK with short DNS TTLs now? - mfincham
Conventional wisdom has long had it that short DNS TTLs are bad because they increase query loads and any faults in DNS will cause availability problems more quickly.<p>It seems popular now though for large sites (e.g. twitter.com, github.com, ycombinator.com, stackoverflow.com just to name a few) to use relatively short DNS TTLs, between 1 and 5 minutes, presumably to make failover easier.<p>Has popular opinion around short TTLs being &quot;OK&quot; changed? Are these sites doing something special to make this viable?
======
wahern
Very large site operators use anycast routing[1] which is particularly well
suited to UDP DNS; they can easily spread load and minimize latency. If you
can leverage anycast directly or indirectly through DNS hosting services then
low TTLs are not a problem.

Otherwise, it depends. Even with one DNS query per HTTP request, DNS will only
represent a fraction of your network load. It's difficult to make any DNS
server break a sweat. More likely the network link will saturate beforehand,
causing lots of dropped packets. But it's trivial to advertise and use
multiple DNS servers; much more trivial than a web application stack. DNS was
built for high availability almost since day 1. This is also why you shouldn't
worry too much about low TTLs exacerbating network faults--there's no excuse
for not using geographically dispersed authoritative name servers.

For example, depending on the site I'll often host the domain on my own
primary name server so I can control records without fscking with a web GUI or
REST API, but the advertised authoritative name servers are EasyDNS servers
which behave as secondaries mirroring my primary.

The real issue isn't load but latency. That's a more complex problem. If
you're not using anycast then your site is probably not big enough or
important enough for a few millisecond latency upfront for intermittent page
loads to matter. Also, many caching resolvers these days will preemptively
refresh records upon TTL expiration subject to usage patterns, which means if
you're seeing moderate, repeat traffic then users may not experience any
additional latency at all. (Similarly, caching resolvers will often remember
failing servers and try them last, regardless of ordering in a response.)

As for how painful are erroneous DNS changes, low and high TTLs cut both ways.
If it really matters you should be monitoring this stuff 24/7 (e.g. Pingdom),
which means record errors should be quickly identified and reported. If you're
setup to respond quickly (which you should be for a serious commercial
operation), that augurs in favor of low TTLs.

[1]
[https://en.wikipedia.org/wiki/Anycast](https://en.wikipedia.org/wiki/Anycast)

------
bigiain
AWS Route53 defaults to 300 seconds (which is probably why so many articles
see that as the median ttl), and I've never encountered problems leaving that
at the default. I suspect the performance problems are real if you're running
your own dns servers - but if you're piggybacking on something like AWS I
seriously doubt you'll see issues (apart from monthly bills, if you suddenly
go viral to the moon...)

"Back in the day", Internet Explorer was a problem with TTLs, from memory IE6
was when they stopped caching all dns lookups for 24hrs no matter what the til
was, and IE6 still coached for 4hrs. (This was a drama for me back in the
early 2000's when I was trying to do dns based load balancing...)

My opinion these days is don't try to go much below 1 minute if you want other
peoples resolvers or software to honour your ttls, but I do see people using 1
sec ttls occasionally, so presumably if your application doesn't mind too much
if not everybody h9onours your ttl - it's still worth doing for some people...

------
mfincham
Also in the mix: is there any point in having multiple A records for a busy
site now?

