Hacker News new | comments | show | ask | jobs | submit login

AWS engineer here, I was lead for Route 53.

We generally use 60 second TTLs, and as low as 10 seconds is very common. There's a lot of myth out there about upstream DNS resolvers not honoring low TTLs, but we find that it's very reliable. We actually see faster convergence times with DNS failover than using BGP/IP Anycast. That's probably because DNS TTLs decrement concurrently on every resolver with the record, but BGP advertisements have to propagate serially network-by-network. The way DNS failover works is that the health checks are integrated directly with the Route 53 name servers. In fact every name server is checking the latest healthiness status every single time it gets a query. Those statuses are basically a bitset, being updated /all/ of the time. The system doesn't "care" or "know" how many health status change each time, it's not delta-based. That's made it very very reliable over the years. We use it ourselves for everything.

Of course the downside of low TTLs is more queries, and we charge by the query unless you ALIAS to an ELB, S3, or CloudFront (then the cost of the queries is on us).




_most_ of the traffic will move in response to DNS changes, but there's always a group of resolvers that keep your old IPs for an unreasonable amount of time. I've taken machines out of DNS rotations with short TTLS (I think 5 minutes, but maybe 1 hour) and had some amount of traffic on them for weeks. After a reasonable amount of time, too bad for them, but when I can work behind a 'real' load balancer it's nice to be able to actually turn off the traffic.


Interesting, thank you. So a potential mitigation strategy could look like this:

- Route 53 failover record * primary record: Google global load balancer IP * secondary record: Route 53 Geolocation set (really need that latency) - Elastic Load balancer record per region * routes to mirror region GCP IP address (ELB's application load balancer seems to able to point to AWS external IPs) * optionally spin up mirror infrastructure in AWS

Seems brittle. Does Azure support global load balancing with external IPs?

Does anyone have such (or similar) setup actually in production? How did it work today?


That would work, and Azure Traffic Manager does support external IPs. CDNs like Cloudflare and Fastly also have built-in load-balancing where they use their internal routing tables for faster propagation.


I haven't been able to make an ELB target be an external IP. What did you mean by "ELB's application load balancer seems to able to point to AWS external IPs"?


https://aws.amazon.com/elasticloadbalancing/details/#details

IP addresses as Targets You can load balance any application hosted in AWS or on-premises using IP addresses of the application backends as targets. This allows load balancing to an application backend hosted on any IP address and any interface on an instance. You can also use IP addresses as targets to load balance applications hosted in on-premises locations (over a Direct Connect or VPN connection), peered VPCs and EC2-Classic (using ClassicLink). The ability to load balance across AWS and on-prem resources helps you migrate-to-cloud, burst-to-cloud or failover-to-cloud.

Looks like you need an active VPN connection to access external IPs.


That feature requires you to use a private IP address, so if you have a VPN or Direct Connect to another location you could load balance across locations. In the case of the global load balancers those will be public addresses though.

"The IP addresses that you register must be from the subnets of the VPC for the target group, the RFC 1918 range (10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16), and the RFC 6598 range (100.64.0.0/10). You cannot register publicly routable IP addresses."

[1] https://docs.aws.amazon.com/elasticloadbalancing/latest/netw...


> Of course the downside of low TTLs is more queries

I was diagnosing a networking issue from one of our service providers last Friday. For whatever indeterminate reason DNS responses from R53 took upwards of 10-15 seconds to return. While I appreciate the non-configurable default TTL of 60 seconds for ELB is not plucked out of thin air and that actual issue seemed to be on the service providers side, the lower limit seems far too low for medium/high latency networks. I wish it was configurable.

What's worse is it looks like it's our site that is the issue, so we get the complaints and I have to dig through wireshark logs.


If you have a very high latency network, say a satellite link, make sure that your near-side resolver supports pre-fetching! Unbound is a good choice.


I run unbound on my own workstations. It's so lightweight, you'd never even notice it, but it definitely makes browsing a little more snappy.


>There's a lot of myth out there about upstream DNS resolvers not honoring low TTLs, but we find that it's very reliable

I've done a few unplanned DNS failovers, and I agree with this. What can be real trouble though is if you're running a B2B app, and your customers corporate networks can be configured in any strange way. I've met real network admins who think they need to have high TTLs everywhere in order to protect themselves from root DNS DDoSes.


There really are locations where DNS resolvers don't honor TTL.

For example, the public wifi in the last Hackspace in Munich I visited did not honour my 10 second TTL.

But in my opinion there aren't enough of them to justify not using short TTLs. It's their problem after all if they don't honour websites' settings: Then they will see downtime when nobody else does.


Do you mean it was cached for longer than 10 seconds? Was it Freifunk? It might be worth writing to them to ask what their caching setup is.


I've always thought TTL less than 60 seconds should be avoided, as some upstream DNS resolvers will ignore values less than 60 seconds and use a default long value. You are saying this is not true and a TTL of 10 seconds can safely be used?


I think it's safe, based on a lot of experiments. We use 5 seconds for S3 ...

    ;; ANSWER SECTION:
    s3.us-east-1.amazonaws.com. 5	IN	A	52.216.165.117
One of the biggest, highest traffic, systems on the internet!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: