
Stop using low DNS TTLs - fanf2
https://00f.net/2019/11/03/stop-using-low-dns-ttls/
======
teddyh
I operate authoritative name servers for almost 10.000 domains. Originally, I
used a default TTL of 2 days, as recommended by RIPE-203¹ (which is also
compatible with the recommendations of RFC 1912²), but this was not accepted
by users, who didn’t want to wait two days. Therefore, for all records except
SOA and NS records, I changed the default TTL to one hour, which I still use
as the default value unless a change is scheduled and/or planned, in which
case I lower it to 5 minutes. I do not want to lower it any more, as I’ve
heard rumors of buggy resolvers interpreting “too low” TTLs as bad, and
reverting to some very-high default TTL, and thereby wrecking my carefully
planned DNS changeover. I have, however, not seen any real numbers or good
references on what numbers are “too low”, and would like to hear from anyone
who might have some information on this.

1\.
[https://www.ripe.net/publications/docs/ripe-203](https://www.ripe.net/publications/docs/ripe-203)

2\.
[https://tools.ietf.org/html/rfc1912#page-4](https://tools.ietf.org/html/rfc1912#page-4)

~~~
lykr0n
Would you mind sharing the service?

~~~
teddyh
The service? What service? What do you mean?

~~~
lykr0n
You said you run authoritative servers- I assume you provide DNS hosting. I'm
curious which provider you run.

~~~
teddyh
I am hesitant to say; we only target the local area, and our home page isn’t
even available in English. Our main role is as a domain name _registrar_ ,
also providing, in increasingly tangential order, domain name strategy
planning, some trade mark strategy, DNS hosting, HTTP redirects, E-mail, and
web hosting. Our main value proposition is _support_ ; call us and talk to us
directly, or send an e-mail, and get an answer more or less immediately. We
only very reluctantly provide self-service control panels, and we don’t
mention its availability unless people directly ask for it, and we generally
discourage its use, preferring that people simply tell us what they want done
in their DNS. Some people, including some very large companies, prefer this
arrangement, and if you are one of them, and you are part of our local market,
I’m sure you’ll be able to find us.

~~~
ianai
Probably way better to not say than to say. Little upside versus who knows
what downside.

------
jedberg
The irony of all of this is that those TTLs are almost meaningless as a server
operator anyway. Even if you set your TTL to 5 minutes, there are a whole lot
of clients that will ignore it.

When I made a DNS switch at reddit, even with a 5 minute TTL, it still took an
hour for 80% of the traffic to shift. After a week, only 95% had shifted.
After two weeks we still had 1% of traffic going to the old IP.

And after a month there was still some traffic at the old endpoint. At some
point I just shut off the old endpoint with active traffic (mostly scrapers
with hard coded IPs at that point as far as I could tell).

One of my friends who ran an ISP in Alaska told me that they would ignore all
TTLs and set them all to 7 days because they didn't have enough bandwidth to
make all the DNS queries to the lower 48.

So yeah, set your TTL to 40 hours. It won't matter anyway. In an emergency,
you'll need something other than DNS to rapidly shift your traffic (like a
routed IP where you can change the router configs).

~~~
jeremyw
(Weeping in agreement.)

The lesson I took away from hosting operations is that the implementation of
internet standards has a long tail of customization; it's part of the job to
accommodate them graciously. :)

~~~
hoseja
Do you mean gracefully?

~~~
the_duke
No reason one can't be both graceful and gracious.

------
pdkl95
> Why are DNS records set with such low TTLs?

The author seem to be missing one of the big reasons ridiculously low TTLs are
used: it lets passive eavesdroppers discover a good approximation of your
browsing history. Passive logging of HTTP has (fortunately) been hindered as
most traffic moved to HTTPS, but DNS is still plaintext.

Low TTLs mean a new DNS request happens (apx) every time someone clicks a
link. Seeing which domain names someone is interacting with every 60s (or
less!) is enough to build a very detailed pattern-of-life[1]. Remember, it's
probably not just one domain name per click; the set of domain names that are
requested to fetch the js/css/images/etc for each page can easily fingerprint
specific activities within a domain.

Yes, TTLs need to have some kind of saner minimum. Even more important is
moving to an encrypted protocol. Unfortunately DOH doesn't solve this
problem[2]; it just moves the passive eavesdropping problem to a different
upstream server (e.g. Cloudflare). The real solution is an encrypted protocol
that allows everyone to do the recursive resolution locally[3].

[1] [https://en.wikipedia.org/wiki/Pattern-of-
life_analysis](https://en.wikipedia.org/wiki/Pattern-of-life_analysis)

[2]
[https://news.ycombinator.com/item?id=21110296](https://news.ycombinator.com/item?id=21110296)

[3]
[https://news.ycombinator.com/item?id=21348328](https://news.ycombinator.com/item?id=21348328)

~~~
mike_d
> The author seem to be missing one of the big reasons ridiculously low TTLs
> are used: it lets passive eavesdroppers discover a good approximation of
> your browsing history.

I operate DNS for hundreds of thousands of domains. I've tried to reassemble
browsing history from DNS logs, and I can tell you it is damn near impossible.
You have DNS caches in the browser, the OS, broadband routers, and ISPs/public
resolvers to account for - and half of them don't respect TTLs anyways.

The reason people set low TTLs is they don't want to wait around for things to
expire when they want to make a change. DNS operators encourage low TTLs
because it appears broken to the user when they make a change and "it doesn't
work" for anywhere from a few hours to a few days.

~~~
brazzledazzle
The problem is that your ISP can log and mine your DNS requests, regardless of
the servers you use. They definitely do this and one can only assume they then
sell it after some sort of processing.

~~~
tyre
The comment you're replying specifies caching at the browser, OS, and router
level. Not one of three would show up as DNS refreshes with the ISP because
the DNS is not being refreshed.

~~~
ses1984
Don't browsers and operating systems mostly respect ttls?

So if some things are cached, you won't get a complete picture, but the
picture you get might be enough.

~~~
spc476
I can't tell. I run Firefox at home, and set up my own DoH server (mainly
because I saw the writing on the wall and and if Mozilla/Google are going to
shove this down my throat, I want it shoved down on my terms, but I digress).
If I visit my blog (which has a DNS TTL of 86,400) I get a query for my domain
not only on every request, but even if I just hover over the link. It will
also do a query when I click on a link to news.ycombinator.com (with a TTL of
300) but not when I hover over a link. It's bizarre.

------
rgbrenner
I seem to remember a paper a few years ago that (IIRC) tested this by setting
a very low TTL (like 60), changing the value, and seeing how long they
continued to receive requests at the old value... and most updated within the
TTL, but there were some that took up to (I want to say) an hour. I'm probably
getting bits of this wrong though..

I did find this paper: [https://labs.ripe.net/Members/giovane_moura/dns-ttl-
violatio...](https://labs.ripe.net/Members/giovane_moura/dns-ttl-violations-
in-the-wild-with-ripe-atlas-2)

The violations in that paper that are important are those that have increased
the TTL. Reducing the TTL increases costs for the DNS provider, but isn't
important here. The slowest update was about 2 hours (with the TTL set to
333).

Of those that violated the TTL, we don't know what portion of those would
function correctly with a different TTL (increasing the TTL indicates they're
already not following spec). So I wouldn't assume that increasing the TTL
would get them to abide by your requested TTL. They're following their own
rules, and those could by anything.

Considering how common low TTLs are... you're worrying about a DNS server
that's already potentially causing errors for major well known websites.

~~~
belorn
It is important to note that this study used active probes asking selected
recursive resolvers around the world.

From my own experience when changing records and seeing when the long tail of
clients stops calling the old addresses (with the name), it is a really long
tail. An extreme example that lasted almost six months was a web spider that
just refused to update their DNS records and continued to request websites
using the old addresses.

Is there a lot of custom written code that does their own DNS caching? Yes.
One other example is internal DNS servers that shadow external DNS. There is a
lot of very old DNS software running year after year. Occasionally at work we
stumble onto servers which are very clearly handwritten by someone a few
decades ago by people with only a vague idea of what the RFCs actually say.
Those are not public resolvers of major ISPs, so the above study would not
catch them.

Naturally if you have a public resolver where people are constantly accessing
common sites with low TTLs then issues would crop up quickly and the support
cost would get them to fix the resolver. If it's an internal resolver inside a
company where non-work sites are blocked then you might not notice until the
company moves to a new web hosting solution and suddenly all employees can't
access the new site, an hour later they call the public DNS hosting provider,
the provider diagnoses the issue to be internal of the customer's network, and
then finally several hours later the faulty resolver gets fixed.

~~~
buzer
> An extreme example that lasted almost six months was a web spider that just
> refused to update their DNS records and continued to request websites using
> the old addresses.

It may have been Java client that was not restarted. At least for older
versions of Java the default was to cache result forever.

~~~
tetha
Yep, older java versions had some ridiculous caching of both positive _and
negative_ DNS responses. That was some weird problem to troubleshoot. We ended
up writing our own caching then, back in Java7ish. And the first version of
our DNS caching was broken and promptly triggered load alerts on 2 DNS servers
of our operations team by issuing ... a lot of DNS queries very very quickly
:)

------
tzs
> Of course, a service can switch to a new cloud provider, a new server, a new
> network, requiring clients to use up-to-date DNS records. And having
> reasonably low TTLs helps make the transition friction-free. However, no one
> moving to a new infrastructure is going to expect clients to use the new DNS
> records within 1 minute, 5 minutes or 15 minutes. Setting a minimum TTL of
> 40 minutes instead of 5 minutes is not going to prevent users from accessing
> the service.

Note that you can still get the benefit of a low TTL during a planned switch
to a new cloud provider, server, or network even if you run with a high TTL
normally. You just have to lower it as you approach the switch.

For example, let's say you normally run with a TTL of 24 hours. 25 hours
before you are going to throw the switch on the provider change, change the
TTL to 1 hour. 61 minutes before the switch, change TTL to 1 minute.

~~~
devnulloverflow
Wouldn't you be canarying your switch over a period of longer than 24 hours
anyway?

I can still imagine a benefit to short TTLs in the sense that you can maybe
roll out your canary in a more controlled way. But that's a lot more
complicated than the issue of quick switching.

~~~
tgsovlerkhgsel
If it's planned, yes.

If your cloud provider does an oopsie (e.g.
[https://news.ycombinator.com/item?id=20064169](https://news.ycombinator.com/item?id=20064169))
and takes down your entire infrastructure, or you have to move quickly for
some other reason, or you're recovering from a misconfiguration, the long TTL
can add 24 hours to your mitigation time.

If you're just playing around with your personal project/web site, you just
added a giant round of whack-a-cache to your "let's finally clean up my
personal server mess" evening.

~~~
ahje
As most who's ever worked with web hosting can confirm, small business
customers often have no idea of what they're doing, and I've talked to _many_
people who switched providers after seeing an ad for cheap hosting, without
realising that they have to a) wait for the DNS changes to propagate, b) that
they have to actually move their web site from one provider to another.

Subsequently, my previous employer lowered the default TTL simply because it
got rid of all the bad Trustpilot ratings about customers being "prevented
from leaving", and started offering a "move my WordPress site for me" service
to profit from all the panicking new-comers who had no idea about how to do
trivial things like importing/exporting a database and transferring files.

------
zamadatix
It would have been interested to see actual delay rather than qualitative
results of the nature "<x>% wasn't in cache so this is horrible!". Admins and
users don't care if it's in cache, they care what the impact to operations and
load time is. [https://www.dnsperf.com/dns-speed-
benchmark](https://www.dnsperf.com/dns-speed-benchmark) says lookup times for
my personal domain results in 20ms-40ms. Ironically the same dns test for
00f.net is taking 100ms-150ms.

99% of apps will gladly trade a 30ms increase in session start (assuming the
browser's prefetcher hasn't already beaten them to it) to not have to worry
about things taking an hour to change. Not all efficiency is about how
technically slick something is.

~~~
belorn
I just tested 00f.net and got as low numbers as 6ms. Latency is a question
about network traffic between the client and the server, and unless you use
anycast you will get different latency depending on what place in the world
the client and server reside in, and if you use anycast it depend on how good
the contracts and spread the anycast network has.

~~~
zamadatix
Very true, looks to be hosted out of Europe. The point about 00f.net
optimizing for ease of operation vs optimizing milliseconds of performance
only holds doubly true with this information though.

------
mc3
Hug of death by the looks of it. Maybe they need to quickly change their DNS
entries to point to a better server :-)

Edit: [https://outline.com/CrL5gf](https://outline.com/CrL5gf)

~~~
teddyh
Doesn’t work for me. This does:

[https://web.archive.org/web/20191103211525/https://00f.net/2...](https://web.archive.org/web/20191103211525/https://00f.net/2019/11/03/stop-
using-low-dns-ttls/)

------
Kudos
> I’m not including “for failover” in that list. In today’s architectures, DNS
> is not used for failover any more.

I mean, my company does this for certain failure scenarios involving our CDNs.
Can anyone tell me why we're idiots, or is this just hyperbole?

~~~
rodgerd
You aren't idiots if you're using it where there are no better alternatives -
it's preferable to use load balancers etc where available, but there are
places where it's very much "DNS or nothing".

~~~
bristolianthrw
How would I use a load balancer to fail traffic between, say, London and
Amsterdam with no fiber in place between them? Where would the load balancer
physically exist in that scenario and how would it fail to the other when
power is lost in one location? Would I make a third PoP to isolate it? What
would then be my redundancy story for that PoP? How would I relocate traffic
to my backup load balancer PoP number four?

Within a single network, sure, load balance all you want. That’s not the
scenario low TTLs go after.

~~~
teddyh
> _How would I use a load balancer to fail traffic between, say, London and
> Amsterdam with no fiber in place between them?_

What people use in those sitations is Anycast.

Of course, DNS itself, or e-mail, don’t need this kind of redundancy, since
the NS (or MX) records themselves provides a list of failover servers. The
corresponding alternative for HTTP, SRV records, has been consistently
stonewalled by standard writers for HTTP/2, QUIC, etc.

There is an interesting draft RFC which I am keeping an eye on, but I don’t
want to get my hopes up:

[https://tools.ietf.org/html/draft-nygren-dnsop-svcb-
httpssvc...](https://tools.ietf.org/html/draft-nygren-dnsop-svcb-httpssvc-00)

~~~
ti_ranger
> What people use in those sitations is Anycast.

This requires that you blow a publicly-advertisable prefix for every unique
combination of services you would want to fail-over.

E.g., if you wanted to be able to have independent fail-over between your
customer-facing self-service portal and your webmail interface (each relying
on specific state that you can't replicate synchronously, and can't guarantee
replicate consistently with each other), you would need to /24s, one dedicated
to anycast for the webmail interface, one for the self-service portal, and
separate from any services which are active-active.

Whereas using DNS, you could use your other existing public /24s that you are
already using for your active-active services.

In the last days of IPv4, an extra 2 /24s just for this is quite an expense.

------
rini17
That can be acceptable price for minimizing the impact of accidental DNS
misconfiguration. Which probably happened to every sysadmin.

Or is there a better way to quickly invalidate DNS caches in case of
emergency?

~~~
bristolianthrw
No, there isn’t. The specification as implemented requires no invalidation
mechanism, which means no such mechanism across _all_ caches exists, nor will
it ever. The long tail kills you in such a failure scenario, and remember,
people who make kitchen appliances write DNS resolvers.

------
vitalysh
"The urban legend that DNS-based load balancing depends on TTLs (it doesn’t)"

So whats the solution? We are using AWS ALB/ELB and it states in docs, that we
should have low TTL, and it makes sense. Servers behind LB scale up and down.
What is the option B?

~~~
sandinmyjoints
In fact, if you use Route 53 with an alias to an ELB, the TTL is hard-coded at
60s -- it is not even configurable. If it were, we'd follow the practice of
lowering it prior to changes, and raising again once things or stable, but as
it is, that's not an option (moving DNS off AWS would be a hard sell, not
cause it's terribly hard but afamic, there's not really any value to doing
it).

------
jsizzle
I would maintain that if you are experiencing poor performance for a web site,
there are MUCH more fruitful places to look than DNS latency. Third party
objects, excessive page sizes, lack of overall optimization based on device
are just the tip of the iceberg.

~~~
edoceo
For many apps I've worked the DB connection setup was always the slow part
(use PgBouncer). Then, the part was the queries. DNS, gziped CSS/JS - chasing
a red-herring.

~~~
jsizzle
Yeah definitely. A poorly crafted SQL query can wreak havoc on performance,
especially at scale!

------
vitus
> Here’s another example of a low-TTL-CNAME+low-TTL-records situation,
> featuring a very popular name:

> $ drill detectportal.firefox.com @1.1.1.1

Is captive portal detection not a valid use case for low TTL? The entire point
is to detect DNS hijacking of a known domain, which takes longer when you
cache the DNS results...

~~~
zamadatix
Captive portal detection involves more than just checking for DNS hijacking.
The browser tries to load
[http://detectportal.firefox.com/success.txt](http://detectportal.firefox.com/success.txt)
and acts based on how that goes. Having a short TTL does not help.

~~~
vitus
If your captive portal is implemented by intercepting your DNS queries, then
having a short TTL should ensure that the captive portal actually has a query
to intercept.

But sure, there are other implementation approaches (e.g. injecting HTTP
redirects), which I imagine is one reason why Firefox doesn't literally
inspect the DNS reply.

------
xfitm3
I run authoritative DNS for a very busy domain - 30B queries per month.
Originally we had 6 hour TTLs, but now I use 60s. We have had no problems.
Uptime and fast failover comes before anything else.

------
justinsaccount
There was a dns record looked up primarily by large supercomputers that had a
0 ttl. It was used for stats via a UDP packet (because it was non blocking,
nevermind that the dns query was blocking). This was set to 0 for "failover"
but it hadn't changed in years. I worked out that our systems alone had caused
billions of queries for this name.

After I complained I think they upped the ttl.. to 60.

~~~
zamadatix
Reminds me a of a server pair at the last healthcare place I worked. Between
the two of them they'd generate something around 1200 DNS lookups per second
(about 60% of the load on the DNS servers) of their own name. I think the
logic was if the name stopped responding then server A was primary. If the
name was responding the server that owned the IP it was responding for was
primary. If the servers wanted to swap primary/secondary they would issue a
DDNS request.

After about 8 years we were were restructuring our DNS infrastructure for
performance and I rate limited those two to 10 or so queries per second each.
In that time there must have been 300 billion or so requests from those two
boxes alone.

~~~
justinsaccount
In my experience that sort of thing is from the local hostname not being
present in /etc/hosts and (of course) a caching resolver not in use.

Some process on the system wants to connect to itself, which then causes a dns
lookup. Add a high transaction rate on top of that, and 1,200/second is easy.

The funniest one I remember seeing was thousands and thousands of lookups for
HTACCESS. Turned out apache was running on top of a web root stored in AFS and
not configured to stop looking for .htaccess files at the project root, so it
would try to open

    
    
      /afs/domain.com/app/.htaccess
      /afs/domain.com/.htaccess
      /afs/.htaccess
    

When it hit /afs/.htaccess then it would try to contact the AFS sell called
.htaccess, which would do a bunch of dns lookups.

This would be done something like twice for every incoming web request.

------
vsviridov
Just anecdotally, from running PiHole, looking at the logs, I have some sites
being resolved 12K times over 11 days... That's over a thousands requests a
day.

~~~
jlgaddis
Running

    
    
      $ echo min-cache-ttl=300 | \
          sudo tee /etc/dnsmasq.d/99-min-cache-ttl.conf
    

will likely cut down on the number of forwarded queries by a large amount.
Adjust the value (in seconds) to your needs.

Don't forget to run

    
    
      $ sudo pihole restartdns
    

afterwards.

------
dorset
Can anyone explain why ping.ring.com needs to have such a low TTLs?

    
    
      =-=-=-=-=
      $ drill ping.ring.com @1.1.1.1
      ;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 36008
      ;; flags: qr rd ra ; QUERY: 1, ANSWER: 2, AUTHORITY: 1, ADDITIONAL: 0
      ;; QUESTION SECTION:
      ;; ping.ring.com. IN A
       
      ;; ANSWER SECTION:
      ping.ring.com. 3 IN CNAME iperf.ring.com.
      iperf.ring.com. 3 IN CNAME ap-southeast-2-iperf.ring.com.
       
      ;; AUTHORITY SECTION:
      ring.com. 573 IN SOA ns-385.awsdns-48.com. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
      =-=-=-=-=
    

I've been trying to find out from Ring support for a few days, and while the
support layer has been trying to find out, not much information seems to be
getting back. To put this in perspective, in my house with two Ring devices (a
doorbell and a chime), I am getting 10,000+ DNS un-cached requests a day,
which easily is 20x more than the second most requested domain.

~~~
Twirrim
That's actually a pretty long TTL by Amazon standards.

Amazon.com:

    
    
        ;; ANSWER SECTION:
        amazon.com.             60      IN      A       176.32.103.205
        amazon.com.             60      IN      A       176.32.98.166
        amazon.com.             60      IN      A       205.251.242.103
    

Or some AWS services

Glacier:

    
    
        ;; ANSWER SECTION:
        glacier.us-east-1.amazonaws.com. 60 IN  A       54.239.30.220
    

S3 has even shorter:

    
    
        ;; ANSWER SECTION:
        s3.ap-northeast-1.amazonaws.com. 5 IN   A       52.219.0.8
    

Or, say, DynamoDB:

    
    
        ;; ANSWER SECTION:
        dynamodb.us-east-1.amazonaws.com. 5 IN  A       52.94.2.72
    
    

The main reason to do so is to be nimble, it's to be able to react to
incidents as fast as you can and change, and to make potential deployment
patterns possible.

From time to time, you need to do something with customer facing
infrastructure: Remove the DNS entry, watch the traffic drain over the next
5-10 minutes, and then do what you need to do on the device, test, and then
add it back in the DNS again, from which you can watch traffic return to
normal levels and verify everything is good.

~~~
33C57E63
You're looking at the SOA (573) record for ring.com, but he's asking about the
CNAME (3) for ping.ring.com.

------
megous
Thankfully this is one of those things, that you don't need to respect. TTLs
are just suggested values in the end (the standard may disagree).

I just checked, and I actually have TTL forced to 1 day in dnscrypt-proxy. My
internet experience is fine. I guess I never noticed in the last 2 years or
so.

------
gpm
Why does DNS cache expiration need to be in the critical path?

Instead of a browser doing

1\. Local DNS lookup (resulting in expired entry)

2\. DNS query

3\. DNS response

4\. HTTP request

why not do

1\. Local DNS lookup (resulting in expired entry)

2.1. DNS query

2.2. HTTP request

3\. DNS response

4\. If DNS response changed and HTTP request failed, HTTP request again

Maybe use two expiration lengths, one that results in flow 2 and a much longer
one that results in flow 1.

~~~
dsp
Ya, this is roughly what the fb apps do. Dns rarely blocks, and changes are
seen quickly.

------
MayeulC
Well, in my case it makes sense, I think: I host my server at home, and have a
dynamic IPv4. I don't know when it could change, so I just set the TTL to
something low.

Since the traffic is low, though, I can afford to check for an IP change every
~5min, and although I set a TTL of ~15min on most services, the main CNAME
(ovh-provided dynamic dns service, TTL set by them) is set to 60s.

My IPv6 record was set to 1h, but I'll look into increasing it. It is true
that my mobile phone often pings my server, so I imagine that it could reduce
the battery usage.

------
jimnotgym
Please excuse any ignorant use of terminology, I am not a DNS expert like
others on here, but I can share some experience in the smaller business world.

A company I worked with a couple of years ago was using Dyn as their DNS
provider, and one day we got a notifcation that we had passed the usage limits
for our account. This seemed impossible considering our site was getting a
couple of hundred unique visitors a day. A few things came out of the
analytics.

1) A short TTL on an A record had been left on from a website migration
project. The majority of the requests were coming from our internal website
administrators. I moved it up to a couple of hours and this went a way.

2) We were getting a huge amount of AAAA record hits. I think most modern
browsers/OS try quad A first??? We didn't have IPv6 configured, and therefore
the negative resolution had a TTL set to the minimum on the SOA record, which
was 1second! A change of this to 60 caused a huge reduction in requests. I
suppose I should have set up ipv6, but I didn't.

3) When we sent out stuff to our mailing list the SPF (or rather TXT) records
saw a peak that was off the chart. We had a pretty settled infrasructure, so I
moved that TTL to a day (I think from memory) and it flattened the peak
somewhat.

4) There was a large peak in MX request around 9am. I put this down to people
opening their email when they got to work and replying to us. I had to set the
TTL to a couple of days (of course) to smooth that one.

I like to think it was worthwhile and improved things for users. I at least
had a nice warm glow that I had saved the internet from a bunch of junk
requests, and it just felt tidier.

------
skybrian
I'm wondering what it would take to make a DNS caching service with updates
based on reliable notifications rather than polling? After all, every
cellphone does it.

~~~
diegocg
This already exists (NOTIFY), but it's only used for master-slave setups (ie.
a bunch of DNS servers serving some authoritative zone who want changes to be
transmitted to all slaves ASAP)

It would be interesting to (ab)use this mechanism in the way you suggest. A
recursive DNS server could ask to be NOTIFY'ed of changes in the zone they are
querying...it would, of course, add load to the server, and it would need
strict limits to avoid DoS, but it seems an interesting idea.

~~~
dsp
The big problem, to the extent there is one, is between the client and the
recursive server. Not as much the recursive and the authoritative. Cost is
highly amortized between recursive and authoritative for busy names.

------
psanford
The author says low TTLs are bad because of latency but never attempts to
quantify how much latency we are actually talking about. Its hard to know how
outraged I'm supposed to be without actually seeing the numbers.

It seems that a lot of sites are ok with slightly higher latencies if it means
greater operational flexibility.

~~~
megous
Latency is dependent on many things. DNS server location - accessing my
website from australia will take 500ms for DNS lookup (or twice as much if I'm
using cnames). If this is not cached somewhere, that's 500ms every few seconds
with those <1minute TTLs. If I'm on GPRS, or similar, that will add a bunch
more hundreds of ms to every useless DNS resolution, incl. unpredictable
variability.

So there's no single latency to report.

------
AYBABTME
"It was DNS" is the root cause of enough postmortems to justify low TTL
values, in my opinion.

------
rocky1138
I run my own dnsmasq server on an old laptop and force really long TTL caching
regardless of what the records come back with. I even cache nxdomain. It works
great, except once or twice a month I have to flush the cache because Slack
seems to not handle it well.

~~~
iforgotpassword
I'm doing the same but "only" three hours and it's working just fine. Not a
slack user though.

------
otterley
I’m honestly not sure what this author is complaining about. If the
infrastructure can handle it and the zone owner is willing to pay for the
excessive traffic, and DNS cache operators are fine with it, then this seems
like a call for premature optimization.

------
zokier
The missing data in the article that has many graphs is how often did the
records truly change

------
lanstin
I have worked in a place using GTM to fail over from a bad data center to a
good data center. Maybe few minutes TTL. I worried about it but availability
is much higher this way, especially combined with a only change one data
center at a time.

~~~
hvindin
This.

I'm kind of suprised that I can't see any other comments talking about GTM's
(I assume you mean F5 Global Traffic Managers)

Where I am at the moment GTM's are used everywhere, and everywhere the TTL is
set to 30s.

The only part of this that really annoys me is that the global default
configuration, rather than serving up a subset of the list of IP addresses,
only a single IP address is returned when you resolve down to the A record.

When I've pressed the issue that _at least_ on our internal GTM's we should
just return a bunch of IP addresses every time someone resolves the address,
I've been told that it would break load balancing... which blows my mind
because who on earth is relying on DNS to load balance traffic with a 30s TTL,
I would have thought that the normal thing to do, if you actually wanted load
to balance, would be to result a subset of IP addresses in a different order
and with a different subset each time. That way other DNS servers which
resolvers that will cache that record can at least be returning multiple
addresses to all the clients it serves, as opposed to everyone using that
resolver getting stuck to a single address for 30 seconds...

But all of that being said, it would make perfect sense to me to just return
like 4 IP addresses publicly for every resolution and rotating setting the TTL
to like 30s so that clients could spend 30s iterating through the A Records
they have cached, then hit your resolver up again and get a different sites
addresses back if your site had gone down...

------
Neil44
Obvs. The authors server isn’t seeing longer TTL’s because there’s no need for
clients to keep querying his server for them?

------
JeanMarcS
To avoid delay when migrating a website IP, what I usualy do is to first
migrate on a HAProxy (like 2 days before switching) so all ISP DNS are update
and on the D-day I switch my backend to the new website/VM.

And then I change my DNS again to the new IP.

You have to tune a bit to get the right IP in your logs but so far it works.

------
StreamBright
>> The urban legend that DNS-based load balancing depends on TTLs (it doesn’t
- since Netscape Navigator, clients pick a random IP from a RR set, and
transparently try another one if they can’t connect)

Unless you do not return an RR set and what you return is based on geolocation
and data center health.

~~~
necovek
Uhm, how does this work with "global" DNS services which people tend to use
more and more? (Eg. Cloudflare's 1.1.1.1 or Google's 8.8.8.8/8.8.4.4)

Basically, your request is coming from them and wherever their servers are
(US, I guess, though they probably have several data centers) and they route
it to the final user.

I think using DNS-based geolocation sounds like a really bad idea: what am I
missing?

~~~
amalcon
The EDNS0 client-subnet extension exists for this exact reason.

~~~
necovek
Thanks. It seems, unfortunately, that only Google DNS and OpenDNS (Cisco iirc)
include the data as of now. Older articles even mention how you have to have
your website (well, nameservers) whitelisted for them to forward client subnet
as part of DNS queries, not sure if that is still the case.

Of course, caching gets more complicated and less useful with this.

------
minusf
DNS Caching: Running on Zero

[https://archive.nanog.org/meetings/nanog50/presentations/Tue...](https://archive.nanog.org/meetings/nanog50/presentations/Tuesday/NANOG50.Talk64.bhatti-
nanog50_dns.pdf)

------
crispyporkbites
Should DNS providers have a setting that increase TTLs over time
automatically? I.e the longer I leave my dns entry pointing to the same IP the
longer my TTL is?

Obviously it would be possible to opt out for situations where you genuinely
need a low TTL on a domain.

~~~
jcrites
This is definitely a feature I've also thought would be useful to have in DNS
providers.

I've worked on managing thousands of (sub)domains and the administrative
overhead of changing the TTLs for everything manually would be considerable.
I'd certainly like an automated way to say "These records should gradually
increase TTL up to <X> time over <Y> time" (e.g., gradually raise TTL to 2
days over 2 weeks if there are no changes).

There are downsides to high TTLs though: (1) you need to remember to
preemptively lower them ahead of any planned changes (if you want those
changes to take effect quickly), and (2) you can't change the records quickly
in an emergency. But, fortunately, lots of record types are ones that you
probably don't need to change in an emergency -- and for ones that you do, you
can use a low TTL.

Anyway, I'd personally like to see automated TTL management as a feature in
DNS software.

------
lgats
For those running CloudFlare, proxied DNS records have an unchangeable TTL of
5 minutes.

------
necovek
This entire analysis is plain out wrong.

They've collected data on DNS queries "for a few hours". By definition,
clients who have DNS cached (iow, most clients, since browsers and resolv
calls in operating systems will do that for you), will not issue DNS requests
for any records that have a TTL that has not yet expired.

So, they've caught _all_ (well, all that were re-requested) the TTLs shorter
than whatever "a few hours" is, and only those longer ones that expired
exactly during the experiment and were re-requested.

To run a proper experiment testing for "short" vs "regular" (let's say 1-3
days), you need to collect data for days (eg. at least 7 days, preferably at
least 30), but even that would not report most TTLs longer than 7/30 days.

Articles like this are bad because they can easily confuse even the
knowledgeable people like the HN crowd.

------
musicale
I hate low DNS TTLs. They are a stupid way to do load balancing.

However they wouldn't be quite as bad if web pages didn't load useless crap
from 60 different domains.

------
highprofittrade
What if you accidently make changes to your authoritative nameserver you want
the recovery to be as fast possible because it's a complete outage

~~~
shinwn
Nothing precludes you from upping the TTL after the change. Traditionally DNS
admins progressively drop the TTL prior to a change to reduce the time an
RRSet is in flux (so if your TTL is N, N + 1 seconds prior you drop it to half
N, and again and again until its your preferred window size) and cautious ones
slowly ramp back up again to the regular value.

------
geogriffin
Am I missing something, or is the reason most of the queries observed have low
TTL because, well, they have a low TTL? IOW, the higher TTL responses would be
cached downstream and so you'd see them less often. If that is the case, the
distribution shown is not all that surprising.

~~~
necovek
It's weird how people are not understanding this: perhaps it's the way you
phrased it. Or perhaps you missed to mention the core part from the article:
the experiment was only run "for a few hours". This means that many a DNS
record (well, most) with TTL greater than the experiment duration would not
show up in the data.

FWIW, I've learned in the past that while there are plenty of people who claim
to want communication to be as succint as possible, majority are unable to
understand when somebody is really terse (while still saying exactly enough).
I've learned to follow up such a terse statement with examples and longer
explanations for the majority that does not get it.

But maybe it's just that people don't expect the mathematics-level precision
on the internet :)

------
IshKebab
Maybe DNS servers should support push updates rather than relying on polling.

~~~
fanf2
They do!

[https://tools.ietf.org/html/rfc1996](https://tools.ietf.org/html/rfc1996)

[https://tools.ietf.org/html/rfc1995](https://tools.ietf.org/html/rfc1995)

~~~
zamadatix
I don't think they meant to authoritative servers.

------
ryanthedev
Seems like DNS TTL was a big issue before HTTP 1.1.

Connections are cached and reused.

~~~
zamadatix
If you're talking about keep-alive this times out far before most of the DNS
TTLs mentioned in this article and don't persist after a connection is closed
anyways.

~~~
ryanthedev
That's not true. It all depends on the implementation...

Also what are you talking about? If the connection is closed, how would it be
used?

Connections should only be closed due to? Inactivity. If a connection is
closed, don't you think you would probably want to do another DNS request?

Also if your doing proper layer 4 load balancing using BGP, DNS is a moot
point...

Magic...

~~~
zamadatix
> Connections should only be closed due to? Inactivity.

Or if the user closes the browser of if the server/proxy restarts. But yes,
mostly inactivity on the order of a couple of minutes.

> If a connection is closed, don't you think you would probably want to do
> another DNS request?

That's the whole point of the DNS TTL, to say how long to go before doing
another lookup rather than doing it each time you reconnect.

> Also if your doing proper layer 4 load balancing using BGP, DNS is a moot
> point...

BGP load balancing operates on layer 3, and is irrelevant as you still need to
DNS lookup an anycast address. EDNS client subnet is better anyways.

~~~
ryanthedev
An anycast address doesn't change. I mean come on. And it actually operates on
layer 4. It uses layer 4 to actually work?

I hope you didn't pay for your education.

~~~
zamadatix
Anycast addresses change all the time. Ask Google, Microsoft, Amazon, Akamai,
Cloudlfare and so on if you don't believe me. About the only anycast IPs that
don't change are public DNS resolvers but that's also true of unicast
resolvers as well.

By that logic BGP is a layer 7 load balancer since it has an application
layer. BGP only exchanges layer 3 reachability information to update route
tables therefore you can only load balance layer 3 with it.

Personal attacks and other things in your comments are against the HN
guidelines. The goal is to talk about DNS/TTLs and their impact on performance
not insult each other.
[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

------
KirinDave
Cloud providers actually use low TTLs to route traffic globally away from
regional failures. You're not seeing that go away anytime soon, there aren't
other options.

------
wazoox
Uh? I always used 24 hours TTL for DNS. I reluctantly move it to 1 hour for
some tests, then quickly set it back to 24 hours. What are these people
thinking?

~~~
throwaway-9320
One use case where short TTL-s make sense is running a service on a
residential network where a power outage or router reboot can trigger an IP
address change. If the IP address changes then you won't be offline for too
long.

Yes, it is not exactly great, but at least it works well enough for self-
hosting services.

