Hacker News new | past | comments | ask | show | jobs | submit login
Akamai Edge DNS was down (akamai.com)
464 points by vhab 2 days ago | hide | past | favorite | 217 comments





People don't believe me when I say how much DNS matters. So I wrote a song about it.

https://soundcloud.com/ryan-flowers-916961339/dns-to-the-tun...


> People don't believe me when I say how much DNS matters.

That's weird to me. I have been working in sysadmin/DevOps for over a decade, but it did not take me very long to learn that DNS outages cause massive problems.


Right, but everybody has to learn that at some point. And I happen to be somebody who teaches such things. The importance of DNS is hard to overstate, but I go to great lengths to do exactly that, to make a point ;)

dns, DNS, dns, dns. The start of every process, dns.

Love this.


Thank you! I'm glad that landed where I wanted it to. It was a lot of fun to put together. I keep threatening to make a video. I need a collection of DNS memes so that I can just sideshow them.

Haha please do, a video would be great. Your song reminded me of the song: "Find the Longest Path" which you may get a kick out of:

https://www.youtube.com/watch?v=a3ww0gwEszo


Sounds amazing! Do you maybe have a direct link? Soundcloud doesn't want us privacy-conscious users browsing their website :(


I had a good laugh, thank you very much! :)

Should probably have a script play that on loudspeakers when monitoring detects problems /s


Brilliant. NOERROR for this.

This made my day, thanks!

And this, mine! Thanks!

Awesome! Thank you.

Thank you!! lol

lol, thanks for the laugh.

I just teared up

LOL! great comment thank you!

lol

Thank you!

https://downdetector.com/archive

So many sites down... and unfortunately not one of them is Twitter


Amazing that down detector manages to stay up during these kinds of outages. Noticed it has been a little slow but they really have done a good job keeping it up even though large portions of the internet is down right now.

Who detects if Down Detector is down? Is there a isdowndetectordown.com site?

I guess the mother of all Network Downtime checker is HN.

It's parked by GoDaddy, but unfortunately their website is fubar by this outage if you try to click through to see how much they want for it :)

Sounds like when Fuckedcompany put itself on Fuckedcompany.

"I dunno. Coast Guard?"

It's interesting that they report an AWS outage but there don't seem to be any issues there. Looks like their methodology is a bit too reliant on those speculative tweets from the first 5 minutes of all these sites going down. https://downdetector.com/status/aws-amazon-web-services/

> So many websites are down, are AWS servers down or something?

> Amazon web services is down which is affecting a lot of company web sites and services. Not sure what is going on.

> Miss us? @aldotcom and a whole bunch of other folks have been knocked off the internet by what appears to be an AWS attack/system failure. We'll be back. ?


It’s just based on user reports, so this is people mischaracterizing it as an AWS outage.

Yep that's my point. I'm guessing that for a lot of sites they can verify if there's an outage pretty easily when they see a spike in reports, but for something like AWS unless they updated their status page (lol) or downdetector ran a bunch of stuff on there just to check with, I guess they don't have a good way to verify it.

Gotcha, yeah I guess I always just considered that out of scope for their service and that it’s just a report aggregator but I suppose you would expect it to be at least a little bit clever based on the “detector” name

cloudfront was down too

You got your wish, looks like Twitter's on the list now too.

Is there a way to tell your system to fall back to the last known IP address if DNS server isn't reachable?

Basically soft-invalidate your local DNS cache but it back from the cache graveyard if DNS is down.


You could run a local resolver like dnsmasq or Unbound that can “serve stale” on upstream failures, but that assumes the DNS failure is a client-facing resolver one.

From what I observed here, it was more internal DNS related: Newegg was serving an opaque “DNS failure” error page from Akamai’s front-end which is likely because their infra was failing to resolve names internally.



It should be possible to set your cache so it lives forever but still checks for a new IP at normal expiring time.

> Unfortunately not one of them is Twitter

Please keep comments like this off HN


Just got booted out of Netflix on the PS4 because the console could no longer connect to Sony's license server. Netflix was working just fine by the way.

Was the app installed/running using a secondary PSN account by any chance? This shouldn't be happening on a primary account/console pair.

It should be my primary although I've often seen it revert back after setting it. I did try setting it as my primary again but you know.

Ah thats whats going on. Happened to me as well, I just assumed that Sony is neglecting PS4 performance with its new system, while bogging it down with bloatware.

Yup, I learned Hulu on Xbox One relies heavily on some Microsoft authentication during a recent Office 365 or Azure outage (not sure which).

You can see this on a lot of sites right now. You get the Akamai style error with something like:

  Reference: #11.453a2f17.1393u44848484.3aee33433
At the bottom of a very bland looking error page.

You could argue Akamai is the blandest of the CDN bunch; their UIs are atrocious.

Their APIs are (or, were, last I suffered their use a few years ago) also terrible, eg blanket policy of refusing to cache any resource in the presence of "Vary" header, regardless of its value, and failure to honor standard HTTP headers... thankfully there are many other options for CDN, which are SO MUCH BETTER.

Akamai is their own worst enemy most of the time. Their prices are the highest, they trail on features, their documentation opaque, it takes an hour to propagate changes, etc. Only a few years ago you could only use SSL if you purchased their ridiculously expensive pci-dss plan - I thought they would defend that to their grave.

Better alternatives are Cloudflare, Fastly, AWS CloudFront.

Google Cloud CDN always seems to have very good latency but a very bare bones feature set and no edge compute I can identify. Support is always a huge red mark for Google anything.


Surely it depends what you vary on?

Content-Encoding should be well supported, User-Agent less so and for very good reasons (there's too much variation in UA strings)


https://learn.akamai.com/en-us/webhelp/adaptive-media-delive...

> AMD automatically strips these headers out of requests to support caching for faster delivery.

> I need the Vary HTTP headers: AMD can cache the associated object if the Vary HTTP header contains only "Accept-Encoding" and "Gzip" is present in the Content-Encoding header

(AMD in this case standing for Akamai Media Delivery)


It wasn't that simple — IIRC, for a while Vary meant “don't cache anything, ever, under any circumstances” unless you made some custom configuration changes. Over time they _added_ support for just “Vary: Accept-Encoding” (IIRC less than a decade ago) and that was fragile. They improved that over time but it was painful for a number of years because there were various failure modes which meant things wouldn't be cached, or (IIRC) compression would be disabled for certain URLs sporadically if the first request for the option did not request transfer compression.

yeah, but only tech nerds see it, so it's okay. maybe it's a ploy to get the users to go to the real command set via CLI. make it so shitty nobody wants the UI, and goes back to the terminal. "if you're not a CLI ninja, then you shouldn't be using our product anyways!"

What's frustrating is that DNS is returning an address, instead of just failing, and so macos is caching that value (though it might be cloudflare doing that).

To empty the macOS DNS cache:

sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder


Wildcard DNS should be a prosecutable crime, punishable by no less than 20 years of hard labor. (Edit: Probably should have made it clear that this was a joke)

Wildcard DNS helps me to handle multitenancy easily. What's wrong with it?

I don't see how wildcard DNS is related to this? Nor how it's bad?

Presumably you're referring to the practice of answering queries for nonexistent records with an A record belonging to an advertisement page? (instead of doing the right thing answering NXDOMAIN, presuming no records of another type also exist for the queried name.)

dnsmasq has a really useful feature for dealing with this: --bogus-nxdomain


When did congress members start posting to HN?

Akamai believes they have it fixed. We've seen our traffic return to normal. https://twitter.com/Akamai/status/1418251400660889603

hmmm does not appear fixed here in the Midwest

I wonder if this is why LastPass is down. It has completely locked me out of my vault. You'd think it'd continue to work offline in a case like this. :/

I switched to BitWarden and haven't looked back. You can use it on the phone and pc (browser). As well as a desktop client.

And with vaultwarden you can go self hosted with a very lightweight server written in rust.

Switched to vaultwarden at work for password management, only have minor gripes so can recommend.

Yeah, my path was LastPass -> Bitwarden -> 1Password.

Both Bitwarden and 1Password are great.


Then what was the impetus to switch off of Bitwarden?

Same path. It'll be very hard to move away from 1Password. App experience, sync, security features like key in addition to master password, family organizer-based recovery of an account, these are a few things that stand out.

Yeah, I use 1Password for every critical bit of information (SSN numbers, physical access codes) and a whole lot of less-critical stuff. I expect to be a customer for life.

Can you explain what family organizer-based recovery means? It sounds like dad or mom could recover a kids password?

That's about right for what it is, or at least how I think about it. There's no magic "unlock vault" button (by design), but an Organizer can kick off a workflow to reset a vault if need be. I have a few of the more tech-savvy family members set as organizers in my family in case something ever happens to me.


My favorite feature personally is the built-in 2FA support. Click and it logs into your account and copies the 2fa code to clipboard so just paste on next screen.

Multiple vaults too is nice but I know others have ways to limit exposure of passwords in similar manners.


Bitwarden offers this as well, but I don't really understand why you would want it. If someone compromises your password manager, 2FA is now worthless. Or am I misunderstanding how it works?

Your understanding is correct. 1Password requires a key in addition to the master password. And finally, 1Password can have 2FA for itself, which is stored on my Authy. These are reasons why I am comfortable storing my 2FA codes on it.

Bitwarden has 2FA support too, but does not have the unique key feature that 1Password has.


I prefer the browser addon for bitwarden over 1Password. Try editing a site in 1Password. It forces you to log into the full sir, whereas bitwarden can do almost everything right there in the addon.

This is also possible with the 1Password X extension, however there's a lot of feature segmentation and unclear messaging between the Desktop app-based version and 1Password X so I don't blame you for using the old one.

It is? The last I remember, editing a field on 1Password X opens 1Password website instead, where the changes can be made.

When it comes to password managers, 1password is the one to beat. Much better experience in every regard.

Serious question, has anyone properly solved the issue of DNS as a single point of failure?

Depending on what point you draw the line of "single point of failure" you could use multiple providers for your dns.

GOV.UK for example uses both aws and gcp for DNS


So, NS entries pointing to both? But then take the example your domain was in Route53 and AWS goes down. You can't configure the NS entries to avoid AWS DNS servers. Is the idea that child DNS servers detect the outage and cache the values in the name server(s) that remain up?

But then, the cached values from AWS take a while to clear, TTL never seems to be applied properly. It always feels like the worst case in such a scenario is you can point everyone at the right thing within 24 hours.


Configuring two NS entries is pretty standard, so surely most resolvers try one of the two, and if it's down try the other one? What else would be the point of having multiple nameservers? Then you just have to get two nameserver providers and make sure their settings stay synced, and point your domain to one nameserver from each.

Of course that requires the server to properly fail, i.e. stop responding to requests. That doesn't seem to be the case here


You set both services in your ns records. So every day they share the load for dns resolution. If one day one of them is down the client can/will use a different nameserver from your configuration.

Have them all hot and live rather than any sort of failover system. Keep everything in sync with OctoDNS or similar

https://github.com/octodns/octodns

DNS is fastest first* rather than main/failover. If AWS DNS was down your GCP DNS would have replied (if all is well) sooner than {timeout} so your visitor would still have a response

* Sort of. I think if the client doesn't get a reply from the server it picked randomly in 1s they move on to the next server, repeat until all fail


Ibthink if route53 was down. Your dns provider whouldn't able to go there. So it will go to the root who will give gcp one too. So your dns provider might try that.

(I don't know if this is how it works, but I thibk that's how it supposed to work)


You typically have four name servers for a domain, but they don’t all have to be hosted with the same company. Very handy when your DNS provider decides to brag they are unhackable and the hackers reply by immediately hacking them followed by DDoSing them to death.

gov.uk's traffic seems to be handled by Fastly, a well known CDN.

What I'm a bit surprised / unsure of is what happens when I run "dig ns gov.uk". The results are:

  gov.uk.     21559 IN  NS  ns1.surfnet.nl.
  gov.uk.     21559 IN  NS  auth50.ns.de.uu.net.
  gov.uk.     21559 IN  NS  ns3.ja.net.
  gov.uk.     21559 IN  NS  ns2.ja.net.
  gov.uk.     21559 IN  NS  ns0.ja.net.
  gov.uk.     21559 IN  NS  auth00.ns.de.uu.net.
  gov.uk.     21559 IN  NS  ns4.ja.net.
Who is ja.net , uu.net and surfnet.nl ..?

EDIT: I see that ja.net i.e. jisc.ac.uk "manages the second level domain .gov.uk" -- https://www.jisc.ac.uk/domain-registry . I imagine that uu.net and surfnet.nl are there for redundancy


  whois ja.net
    Domain Name: JA.NET
    Registry Domain ID: 499794_DOMAIN_NET-VRSN
    Registrar WHOIS Server: whois.demys.com
    Registrar URL: http://www.demys.com
"Demys is a leading provider of corporate domain name management and an ICANN accredited registrar"

  whois uu.net
    Domain Name: UU.NET
    Registry Domain ID: 5486163_DOMAIN_NET-VRSN
    Registrar WHOIS Server: whois.markmonitor.com

surfnet is just an ISP in Netherlands

https://www.surf.nl/


Thanks

Is it possible to see if/where is gov.uk using GCP or AWS for its domain zones? From what I can see -- that's not the case? Or am I looking in the wrong place?


I think you did the right query, maybe they're using it for different domain names?

Last time I tried setting NS to both cloudflare and digital ocean in my domain registry, cloudflare sent me an email saying the configuration is invalid and asked me to revert. Am I doing something wrong?

No, you have done everything right. At least from the point of view of DNS. That you can not use multiple nameservers is a limitation of Cloudflare (limit in the sense of: Cloudflare can only offer their services in the Free and Pro plan if they have full control over all nameservers).

Thank you. I will look into alternative services on the thread then.

It is relatively easy to make DNS highly redundant: just put multiple DNS server in data-centers which are as independent as possible (different geo locations, different ISPs). You can also use different DNS software and different OS (say BSD+Linix) to exclude correlated bugs. Root DNS server AFAIK use different software for this reason.

Problems starts when you want to easy make frequent changes and introduce complex software to manage DNS zones (and complexity usually comes with bugs).


And then there are Cloudflare and other Centralized Downtime Networks as another point of failure.

Loled at this.

The problem isn't DNS though, is it? The problem is that people don't necessarily use the redundancies on DNS?

The whole reason it takes a domain 24h to fully work with DNS is because it propagates the information other DNS servers, thus making not be a centralized service.


DNS doesn't 'propagate' except in the very limited case of zone-transfer publication, which... nobody really relies on these days. Registrars tell you it takes 24 hours to propagate to stop you from complaining to them about your ISP's DNS caching policy. The reality is: recursing DNS servers have caches, they respect TTLs, and for the most part that means that DNS changes should fully wash through within an hour for most changes (less if you keep your TTLs shorter).

That differs per TLD though. In .nl updates are usually fully processed within the hour (they update the zone file twice per hour)

More accurately there are distributed caches, which expire on a simple timer basis, as opposed to updates being pushed immediately.

Relatively short TTLs are ubiquitous these days though.


It's an interesting question, as it's always been solved on the server side. All of the current problem is client side. That is, client resolvers that aren't using diverse providers, and only do things like round-robin with long timeouts.

Anycast for the DNS IPs deals with most of the problems of clients not failing over elegantly when their primary DNS server is broken.

From a client (DNS recursor) point of view there is no primary server. There is just multiple NS records which are equal. If one of them is down it can introduce resolving delays, but they are usually small. At least if something like Unbound or Bind is used. Unbound e. g. maintains infra-cache where it tracks RTT and errors for each server and avoid servers which are down.


https://handshake.org is the only project I've seen that actually solves the issue with a decentralized root zone file.

https://namebase.io is a "registrar" for it.


Why does this need to have the whole NFT / crypto / auction angle?

https://learn.namebase.io/starting-from-zero/how-to-get-a-na...

This is so convoluted it actually makes the whole thing a non-starter


Decentralized control of a centralized finite resource (domain names) requires consensus. For example, Joe Smith and Joe Blow both want joe.com.

You want a protocol that gives consistent "global" state without any centralized / trusted users - blockchain/bitcoin is one of the only technical solutions to provide that.

I agree that it's a garbage solution in practice, but that's why it's got cryptoshit bundled in.

A potential different solution to DNS monopoly, if that is a problem that needs solving, is multiple name-resolution providers that have differing records on what name points where. (The tradeoff is that an owner may need to register their name with multiple different providers).


Agreed. Blockchain is a convoluted solution, but it’s a solution for distributed consensus, if one feels that’s required. But in general I would argue the current root system has served us well and is open and free.

The world you describe, effectively with multiple roots, is coming. Russia have a switch (they’ve even tested it), to anycast out the root DNS IPs within the country, and block them externally. In theory this doesn’t make another “internet” (if IP space is still globally routable,) but in practice it does. Don’t be surprised if other countries follow suit (should they fail to leverage control of current infra via ITU or something.)


You can still hardcode IP addresses. Not sure most people realize DNS isn't actually needed, you know, except for convenience and all that.

The "Host:" header in http[s] pretty much killed that. Half the internet would be a Cloudflare error page if we moved back to ip addresses :)

Add the name/IP to your local hosts file. It all works great then. Until the server changes IPs, anyways.

I did this with a website I liked which had let the domain expire. It worked for quite some time, until the VPS/whatever expired too. Good thing the Internet Archive is a thing.


Meh. Without DNS, or something similar, there really is no internet.

Obviously you are technically correct.


The internet gets along quite fine without DNS. Packets route from network to network. DNS is an application-layer protocol. People often confuse the web with the internet. We use phone numbers for phone calls. It's conceivable with IPv6 you could nail up your IP address and use a QR code to make the addresses accessible. In a hundred years will DNS still be necessary? I don't think so.

It’s one of the most successful, global, distributed databases of all time.

What’s the single point of failure?


Absolutely amazing how many billion $+ companies are single homed for DNS.

I wonder how much they spend on multi-AZ redundant architectures...


So here's a weird question: Supposing companies multi-home for DNS, or whatever other essential service, via multiple service providers.

Whatever multi-home means, why can't there just be one service provider that does that? And are we sure that these service providers aren't already doing that as best we might hope for? (For instance, Amazon already has multiple zones, etc.)

I suppose the one thing this can't protect against is some sort of political (broadly defined) threat related to the company itself.


> Whatever multi-home means, why can't there just be one service provider that does that?

Many of these outages are due to pushing broken artifacts or configuration to production.

A single provider can pretty easily offer geographic or network topological redundancy, but administrative and/or technological independence is pretty hard to achieve in a single company.


I mean, I guess what I'm saying is that in theory a single provider could purposely keep two different departments that manage their own artifacts independently.

Records have to be kept in sync.

If one dept deletes a record and the other doesn’t, how do you decide who’s right?

You could add a third dept that gives them both orders, but now that third dept is a single point of failure.


If I were a customer of two different companies for the sake of redundancy, wouldn't I have that same challenge? I could be my own point of failure.

Though, I suppose if I'm responsible for it, I fix it faster for myself.


In this particular case, the Akamai clients did not push broken artifacts, so sounds like at least this particular instance would be avoided.

I believe EasyDNS can automatically push DNS settings to Route53 to host DNS in AWS. Doesn't protect you from fat-fingering a change, but you should be resilient to either EasyDNS or Route53 going down.

https://kb.easydns.com/knowledge/easyroute53/


Using multiple providers for mostly static DNS is easy, pick one as primary and AXFR to the other and notifications and whatever. Or it's not too hard to keep a zone file in source control and sync it to the providers.

Using multiple providers for fancy DNS, like only providing IPs that pass healthchecks or geotargetting users to datacenters gets pretty hard, because the different providers have similar capabilities, but no uniform interface, so you've either got to do it manually, or you have to build out your own abstraction that is probably limiting.

If possible, insourcing DNS makes the most sense to me, because if you can't keep your service online, it's not the worst if your DNS is offline; and if you can keep your service online, you probably won't mess up your DNS too badly.


So much this. Keeping feature by feature parity is the tricky part.

Might be survivorship bias. Multi-AZ arch protects against all other failures, so the only one that remains visible to the outside world is DNS.

Problem is, if your on Akamai’s CDN, only Akamai know where the local caches are. You need to be on their DNS only.

Most CDNs offer huge incentives for sending them more traffic, a lot of time you end up in a contract obligated to handle X requests and Y gigabytes of traffic per month. But personally I believe you should never have a single provider for anything - particularly when it’s acceptable for a company to cut you off with no warning or recourse.

Lastpass is down, so if you use lastpass the effect is significantly compounded.

Do they not cache everything locally? I'd have thought a password manager/secure data store would work offline.

They do.

It still works in offline mode. You can’t update passwords, but you can retrieve them.

To enable offline mode, I had to turn on airplane mode on my phone before logging in.

So many sites being reported as down, but change your DNS to something else (e.g. Google 8.8.8.8 and 8.8.4.4) and, after flushing your DNS cache, the sites are available. I was unable to get to ups.com or newegg.com (why yes, I am expecting a new toy), but after switching DNS and flushing DNS cache, I was able to get to both.

Specifically, 1.1.1.1 provided bad addresses (as opposed to no addresses), and removing 1.1.1.1 fixed my problem. By then it had returned a bunch of bad addresses and I had to flush my DNS cache.


Could you give an example of what you mean by a "bad address" in this context?

This is from the time of incident:

Server: 1.1.1.1 Address: 1.1.1.1#53

Non-authoritative answer: Name: newegg.com Address: 23.35.185.6

vs

Server: 8.8.8.8 Address: 8.8.8.8#53

Non-authoritative answer: Name: newegg.com Address: 104.80.92.252

104.80.92.252 is newegg.com

23.35.185.6 is a server that provides an error message.

So 1.1.1.1 lied. The proper response would be to reply "I don't recognize that domain". Instead it said, "yeah, I know that, its here..."

Newegg was not down, and when I got macos to forget what it had cached from 1.1.1.1 I was able to use newegg.com fine.


Yep, all our EdgeDNS zones as well as DSD edgekeys are just returning SERVFAILS. Many big german websites are down right now.

Several unrelated websites I was trying to visit are down. I figured I would find the answer on HN : )

Same haha

I am surprised financial institutions don't have any regulation for redundancy. The one that stuck out to me is the Navy Federal Credit Union website being down. I have not had any issues logging into mobile though for some of the reported sites.

this is prime shit Hacker News says right here. Wait until you learn banks close on Sunday. Or have maintenance windows for their website, ATM, etc.

Commercial banks are held to a different operational resiliency standard than financial infrastructure.

(a component of my consulting work is reporting to financial regulators for institutions)


> financial institutions don't have any regulation for redundancy

As CTO of a bank, I wasn’t aware of this. So either we wasted a ton of money and time constantly upgrading redundancy and business continuity technologies to satisfy our regulators… or this statement could be mistaken.


I'm not sure how easy it would be to regulate. But yeah. I've got a few short term trades in my brokerage account, and outages really throw a wrench into those.

The way regulate is like anything else: if they fail to meet QoS uptimes, they get fined in 6-8 figures for every minute of loss.

CapitalOne has a broken login which is pretty surprising to me.

All major Canadian banks were down.

Why would Google and Amazon be on the downdetector list or experiencing issues? Don't they have their own DNS / nameservers separate from Akamai?

because the way downdetector works is it just basically counts how many people are searching/visiting for <site> down and if it's much higher than typical it flags the site as down.

So if everyone searched "is google down" and visited the link on downdetector that was returned in the search, that would add to the downdetector count for that site.

Downdetector doesn't actually know if the site is up or down.


I found this hard to believe, but it's correct.

Downdetector only reports an issue if a significant number of users are impacted. To that end, Downdetector calculates a baseline volume of typical problem reports for each service monitored, based on the average number of reports for that given time of day over the last year. Downdetector’s incident detection system compares the current number of problem reports to this baseline and only reports an issue if the current volume significantly exceeds the typical volume of reports.

https://www.speedtest.net/insights/blog/how-downdetector-wor...


What’s hard to believe? Downdetectors well known for being almost, but not quite, useless.

Probably reported Google as “down” because a whole bunch of people use the word “Google” when they mean “internet”.


A more proper name might be PeopleThinkItsDownDetector.com

Not nearly as SEO friendly

So how do they reset status? The number of queries going down signifies return to normal status?

Some CEO calls another CEO and makes a deal?

Yep

Was just browsing a website where the first page of a query worked, but visiting page 2 of the results was returning a DNS error. Was curious how and why only part of the site was down, but it looks like this was the problem as now the whole site is down.

aren't short DNS TTLs great?

Is this a serious argument for long TTLs? Always wondered why they exist… How interesting.

Yes it is. The longer the TTL the longer you stay independent from third parties. It's what makes the internet stable.

Long TTL makes you independent from DNS third parties, in that your name is still know by clients if DNS is down.

Short TTL makes you independent from hosting third parties, in that you can quickly change which hosting provider your domain name points to.

You can't win this one by only changing your TTL. The best solution is to use short TTLs and multiple nameservers on different providers.


The good parts of centralisation

Possibly related .. Verizon peering issues / ASN701 at Equinix NY2 in Secaucus NJ

What role does Akamai Edge DNS play in normal internet traffic? DNS responses usually get cached, as far as I understand correctly. And it is usually possible to change your DNS server to e.g. Google's and circumvent the outage. Does Akamai Edge DNS play a role on the server side?

If you use a CDN to front your traffic, you need the CNAME for www (or whatever) to be pointing at their DNS infrastructure, so they can return whichever closest POP is going to serve your traffic.

e.g. dig @1.1.1.1 www.nvidia.com +trace

... various things from the root ...

www.nvidia.com. 7200 IN CNAME www.nvidia.com.edgekey.net. ;; Received 83 bytes from 208.94.148.13#53(ns5.dnsmadeeasy.com) in 35 ms

So the main DNS is fine, but it'll never get an A record because the last link in the chain is toast -- edgekey being Akamai in this case, but all CDNs do this so they can route traffic. Normally, this is a good thing so they can shift traffic within 30 seconds on their side. Unfortunately, it also means it would take nvidia an two hours to point away from Akamai.


Looks like this: the affected subdomains are CNAMEd to the akamai CDN, and the Nameserver for those are/were down.

So for example:

Top level domain for nvidia resolved fine..

dig @1.1.1.1 nvidia.com => status: NOERROR, Nameservers are ns6.dnsmadeeasy.com

But the website didnt. dig @1.1.1.1 www.nvidia.com => status: SERVFAIL,

The Nameserver for the this www.nvidia resolved to the akamai nameserver which had a problem..

dig @1.1.1.1 www.nvidia.com NS => CNAME e33907.a.akamaiedge.net.


The trend these days are DNS TTLs of 60 - 300 seconds, to allow "Cloud agility" or something, so sites are exposed to a much larger risk of authoritative nameservers going down.

You say that like it's a bad idea.

Services like Akamai use short TTLs for their edge services for a variety of reasons, not least because if one of their edge servers goes offline (for planned or unplanned reasons) it lets them sub in a new one and have it receive traffic immediately, rather than have a bunch of clients continue trying to talk to a dead node. So sure, you can increase those TTLs to trade 'what if the DNS server goes down?' risk with 'what if the edge server goes down?' risk...

But keeping the edge servers up and running is probably a lot harder - they need to scale more to handle traffic load, they have to actually handle client data, TLS termination, much more complex configuration.... so if I'm placing bets on which of those things is more likely to die on me, it's the edge node, not the DNS server.


> What role does Akamai Edge DNS play in normal internet traffic?

Clearly a big one.


Posted this is the thread about the travel websites being down, but seems Fidelity is entirely impossible to sign in to / trade right now.

Is this related to:

Multiple websites including DraftKings, Airbnb, FedEx, Delta and others appear to be experiencing issues.

https://www.bloomberg.com/news/articles/2021-07-22/multiple-...


Figured this out almost 30 minutes before they bothered to update their status page.

This is affecting Steam as well

It is impacting a lot of things: https://downdetector.com/

Well it's been an hour now since I first noticed the effects and their service status still has no useful information or ETA for a fix. It's just an "emerging issue".

The affected sites that I use are now working. Check again.

Strange thing about the duration of this outage... From logs I have, it seems to have lasted exactly one hour, from 15:38 to 16:38. Their Twitter account also said "disruption lasted up to an hour", though they incorrectly said it started at 15:46 (did it take 8 minutes for their monitoring to notice?).

That makes me think that whatever the fix was, it had to wait for some one-hour cache to expire before it took effect. I'm very interested to find out what the cache issue was, more so than what the original bug was.


I love seeing these issues reverberate around the internet.

This time i think /r/sysadmin pegged the issue first, great sub.


I'm in the middle of a migration from Akamai to Cloudfront, time to take a break I guess

All yuor data are belong to us

Not only that their support telephone line (in sweden) was down as well

App Store on MacOS is down!

And people wonder why I try to avoid depending on online anything...

Cyberpolygon already? Thought we had at least a month or two

Shh, normies are not ready for that.

It’s just a completely random DNS outage, nothing more.


Many bank systems are disrupted by this in the Netherlands

My UK bank (HBOS) seemed to have 'online banking unavailable' though their site was up. No doubt related.

Many banks in the Netherlands are affected by this.

Any idea on cause? Ddos or hardware failure?

Widespread issues like this on major CDNs tend to be configuration errors.

Cloudflare seems to be struggling too. Not sure if they have some dependency on Akamai or if this portends something much worse

So that's why the NHS website is down

Back up now by the looks of it

This is affecting apple as well

https://www.apple.com/go/


For some reason that url doesn't work for me, but https://www.apple.com/ and https://www.apple.com/nl/ do.

That fails with a 404 for me, which is probably not related to DNS at all?

archive.org seems to indicate there was never anything there...


Oops, someone unplugged the DNS machine

Looks like it is fixed now!

This is apparently why I can't book my COVID vaccine appointment...

Yes, was trying to do the same. Getting this 2nd jab has been a nightmare. Places listed as walk-in having Moderna, don't and they ran out of it when I went to get my secheduled jab. Ringing 119 just ends up in a dead line, then this outage. Fun.

I ran DNS servers, among other things, in the late 90s with better uptime than these "multi-DC/AZ/geo redundant" services everyone uses these days.

With all due respect, having also run auth DNS servers in the 90s, and seen the inside of Akamai’s CDN/DNS setup more recently, it isn’t remotely at the same level of scale or sophistication.

"Scale and sophistication" scale relatively with time. Those servers we ran were relatively at the same level of scale and sophistication for their time. The only differentiator here is uptime, which has gotten worse as time has gone on. Five 9s used to be the standard. Three 9s seems to be the new standard.

I thought DNS was supposed to be resilient

DNS is designed to be fault tolerant. Such a design, however, is often not leveraged correctly; the implementation of DNS can be and frequently is subject to SPOFs.

Probably Akamai needs to use Kubernetes.

EDIT: So HN can't even take a joke after this? [0]

[0] https://news.ycombinator.com/item?id=27893482


Probably caused by Kubernetes

That's even worse if true; despite HNers creating a storm in a tea cup on DOSing a blog of a service not using K8s when having a blog is not their main service. [0].

Either way, the joke's is now on the HNers in that thread.

[0] https://news.ycombinator.com/item?id=27893482


come on, this is funny. HN needs to lighten-up.

Sheesh, So yesterday! :)

I'm sick and tired of these types of services (I'm looking at you too Cloudflare) going down and taking otherwise healthy websites down with them.

Most websites using Akamai aren't gonna be "otherwise healthy" without the CDN handling most of the load.

It was fastly last time.

True but cloudflare have been guilty of downtime too.

There aren't many sites that aren't, including "otherwise healthy websites" hosted without a CDN.

I think this is a factually true statement if your business uses any computers. ;)

Cloudflare hasnt had an outage in a long time. And when they do they are upfront about it, and post a detailed post-mortem.

https://www.interactivebrokers.co.uk/ , a Trading Platform, is also down as well :(

How am I going to sell my AMC stock...


You don't, you hold the dumb, over priced stock as a reminder for future, better informed investing.



Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: