

Ask HN: Intermittent EC2 DNS failures at 1 minute past the hour - jik

For about the past 72 hours, our EC2 instances have been encountering intermittent DNS resolution failures when talking to the default AWS DNS server, 172.16.0.23.<p>These failures pretty much always occur at around one minute after the hour (i.e., between 12:01 and 12:02, between 1:01 and 1:02, etc.), and the form the failures take is that the AWS nameserver returns SERVFAIL.<p>I have attempted to isolate the problem to AWS&#x27;s name servers, as opposed to the DNS servers that AWS is speaking to, by running code outside of AWS that looks up exactly the same DNS records at the same time. I have a script which runs out of cron at 1 minute one minute after the hour and spends about 60 seconds repeatedly looking up host names that doesn&#x27;t exist and reporting if the lookups return SERVFAIL instead of NXDOMAIN. The script returns occasional errors when run in AWS, but returns absolutely no errors when run outside of AWS.<p>The domain I&#x27;m looking up records in is hosted by Dyn through their DynECT service, but I&#x27;m not sure that&#x27;s relevant since I&#x27;ve confirmed that the errors only occur when the AWS nameserver is in the loop.<p>Amazon&#x27;s DNS servers are notoriously unreliable, but we&#x27;ve never seen this particular failure mode before; the usual failure more is that DNS simply doesn&#x27;t work at all on a particular instance and we have to terminate and replace it. Certainly, we&#x27;ve never seen a failure mode where the all the errors occur on an hourly cycle like this.<p>What I&#x27;m looking for from HN is:<p>1) Are you seeing similar behavior in your AWS deployments?<p>2) Would you be able to run a script similar to the one I&#x27;m running to find out if you can reproduce the issue?<p>If I can collect evidence that this is happening to lots of people rather than just us, I have a better chance of convincing Amazon to pay attention to it.<p>Thanks!
======
namecast
I'm in for #2. And no to #1, but I'm not sure we'd notice if we did.

Here's a theory that you might be able to chase down with AWS EC2 support
folks:

Many, many EC2 instances are either scheduled to be created on the hour (e.g.
by cloudformation/knife ec2/whatever) or are running cron jobs that run
hourly;

EC2 provisioning tasks and cron jobs usually require connections to outside
servers - package installs, apt-get updates, sending logs to s3, etc. - and
that means looking up hostnames;

Lots of hostnames are being looked up on the hour as a result, and <resource
X> is being exhausted hourly when a flood of lookups go to the DNS server.

Important caveat: resource X may not be the AWS internal DNS server itself! It
can be the port it's connected to being saturated, or a particular uplink on a
two port portchannel being flaky (and the flakiness is only evident when it's
under high load>, or the elastic interface that is attached to the DNS server,
or any one of another dozen things.

Are you seeing this behavior across multiple AZs and regions, or just one?

(This is just a theory, mind you, but I've seen this same behavior when
managing other large DNS clusters, and it sounds like a good fit.)

------
spaceapesam
Can also confirm. We've been seeing them for over a week, more severely in the
last 2 days. It is not specific to AZs or accounts. We know from Amazon that
none of the AZs in our accounts overlap.

It doesn't seem to be the VPC DHCP Options set assigned recursive resolver
having problems as resolution from within the VPC via 8.8.8.8, say, still
results in occasional SERVFAILs again zones Route53 is authoritative for.

EDIT: some tcpdump confirmation

    
    
      17:01:27.988201 IP 10.0.0.2.domain > xxxx.54322: 61062 3/0/0 CNAME xxxx., A 10.x.x.x, A 10.x.x.x (151)
      17:01:28.278093 IP 10.0.0.2.domain > xxxx.53047: 49767 ServFail 0/0/0 (61)

------
jik
I received the following from AWS support at 9:33am US/Eastern today (8+ hours
ago): "... We have now been able to reproduce the behavior in tests similar to
your scripts to pinpoint where the UDP packets were disappearing, and
yesterday evening the team tested a fix that unfortunately had some unexpected
problems. I am hesitant to provide any time estimate since any software
development has risk, but I'm hopeful it will be fixed today...."

The last DNS blip we saw was less than an hour ago, so I don't think it's
fixed yet, but the day is not yet over...

------
jik
Here's the script I'm running from cron both inside AWS and outside it at one
minute past the hour (with our internal DNS domain replaced with example.com):

    
    
      #!/bin/bash
    
      tf=/tmp/out.$$
    
      for turn in $(yes | head -60); do
          start=$(date)
          if ! host $(uuidgen).example.com 2>&1 | tee $tf | grep -q -s -w NXDOMAIN
          then
              end=$(date)
              echo "$cmd failed from $start to $end:"
              cat $tf
          fi
          sleep 1
      done
    
      rm -f $tf

------
cce_
Hi, here we are seeing this too, starting around September 27. Some of our
worker processes have been getting DNS exception notifications, always at 1
minute past the hour, and we had been scratching our head about it. We'll open
a ticket with AWS too. Thanks for helping us find out we're not the only ones
seeing this!

------
jik
It looks like this was fixed on October 7.

------
hltbra
I've been experiencing the same issue since the reboot events, but only RDS
DNS errors:
[https://forums.aws.amazon.com/thread.jspa?messageID=573380](https://forums.aws.amazon.com/thread.jspa?messageID=573380)

------
jbarnard
I heard rumours that it was fixed on reddit, however I'm still seeing the
issue. Lookups to my RDS instance from an EC2 instance will fail. The last few
failures were at 6:01am, and 7:02am.

------
jeffbarr
I have asked the AWS DNS team to take a look at this thread!

~~~
spaceapesam
Any update Jeff? We're having trouble getting information via account managers
and requesting ticket updates. We're still seeing it right now 13:00 UTC.

------
Bobbickel
We isolated it to East 1a zone and have taken our webserver in that zone
offline until the issue is cleared up.

~~~
jik
We're seeing it in multiple AZ's, so I don't think it's isolated to just one.
And I don't know if you were aware of this (I learned it just recently), but
AZ's are labeled differently for different customers, i.e., your 1a may be
different from ours! Gotta love it.

------
mrdavid
We also see the same exact behavior. I opened a ticket with AWS earlier today
and just forwarded them this thread.

~~~
mrdavid
I received a response from AWS a short time ago regarding this issue. They are
able to reproduce the problem and are currently testing a fix. I will send a
follow up once we receive confirmation that the issue has been fixed.

------
davedash
Yup, a client of mine's nagios keeps fritzing every minute past the hour with
DNS issues. Thanks.

------
helper
Yes, we're seeing similar failures at 1 minute past the hour.

------
jnankin
Yep, seeing this too across multiple machines!

------
dkuebric
We're seeing these as well.

~~~
mcorner
Yes, something straight from our logs:

    
    
      6 2014-10-02 14:01:28.394: 141002 14:01:28.394 pid=16680 jid=157622 REDACT ERROR fatal error during work: getaddrinfo: Name or service not known
      7 2014-10-02 14:01:28.394: 141002 14:01:28.394 pid=16680 jid=157622 REDACT WARN Caught SocketError: getaddrinfo: Name or service not known

------
jik
Still broken this morning.

