
DNS Outage at DigitalOcean - finne
https://status.digitalocean.com/
======
tonylemesmer
People hating on DO "I'm losing thousands every hour". Well then should have
had some failover in place if its that valuable.

[1][https://twitter.com/rodrigoespinosa/status/71303563702097100...](https://twitter.com/rodrigoespinosa/status/713035637020971009)

~~~
crisopolis
I've been reading all the comments on Twitter also... like "err mai gawd I'm
switching to AWS because of this" and your failure to not have a secondary DNS
provider, but I highly doubt you'd switch.

Then another... "Today's @digitalocean DNS #outage is a reminder to not trust
your entire business to one provider. Spread the love around!"

If your company is e-commerce and makes money by being 99.99% available. It's
your own fault for no fail-over.

another... ".@digitalocean that's two hours without DNS now...my company's
websites could be losing thousands of £ in e-commerce! Please, an update!"

~~~
codegeek
Times like this makes you realize the difference betweeen good clients and bad
clients. Yes, they have a right to be upset but claims like "could be losing
thousands of dollars" is mostly exagerrated due to their frustration.

~~~
crisopolis
heh yeah, I just laugh at all the tweets saying their losing $billions of
dollars every minute their site/app is unavailable. All I can think is... if
you're the next Amazon.com I'm pretty sure you'd have some type of disaster
plan in place should something like this happen.

------
chronid
DNS is hard. Very hard.

It may seems trivial when it works (hint: it's not), but some of the biggest
fuck ups I've seen in my professional life were caused by strange DNS things
happening or DNS servers going kaboom.

I feel the pain of the DO engineers trying to mitigate this issue. I really
do.

~~~
johansch
BS. DNS is a trivial thing to scale, compared to most other web-scale efforts.

Things break when people don't use 20 year old best practices. There is no
defense against inexperience and ignorance.

~~~
takeda
I took the OPs comment as "it's hard to understand DNS and biggest fuck ups
happen because people think they understand DNS when they actually don't".

The problem with DNS is that it can work even when it is configured
incorrectly. This makes people who has no idea what they are doing that they
actually understand it. The strange issues with DNS only happen with strange
configurations. When you follow best practices everything is predictable.

~~~
johansch
All right. This I can agree with.

------
tyingq
One thing hosting providers could do better would be to split the risk a
little by not handing the same dns server name to every client that chooses to
have the hosting provider supply dns services.

The reason this might have some upside is that DDOS attacks against a specific
DNS server are often intended to target _one_ specific customer of a hosting
provider. The attacker doesn't care about the side effects...just the original
target.

Say, for example "controversialblog.com" is hosted on DO, and uses DO dns
servers. The person attacking "controversialblog.com" looks up the NS records
for the domain, and attacks that DNS server. The fact that it's one hostname
that serves all of DO is of little interest to the attacker.

So, if DO would come up with say, 10 separate hostnames they could hand out,
then this sort of thing would take down 10% of their customers instead of
100%.

------
traviswingo
Yeah this is pretty unfortunate. We have some big investor meetings today and
this unfortunately took our marketing site offline. Hopefully they resolve
this soon - it's the first time we've ever experienced an issue with their
service.

We really need fail-overs in place...small team problems.

~~~
defenestration
We feel the pain as well as our platform is unreachable. I'm now using an
other DNS server and changed the nameserver in the domain-record. However the
DNS propagation is taking some time. What are you doing at the moment as fail-
over?

~~~
traviswingo
We just switched it over to Route53 and set up some fail-overs there. Took us
5 mins and we're back online.

Looks like DO is still offline so it seems to have been a good call...

~~~
scurvy
You're back online from your perspective. What about all the name servers that
have your SOA cached still looking at DO? You're still down for them.

~~~
traviswingo
Tough shit, lol. We now have reduced TTL times for future occurrences, but
there's nothing we can do for those users who are still experiencing an
outage.

~~~
scurvy
Minimum TTL for SOA records on .com is a day. Lowering your A record TTL isn't
buying you much in case of DNS hosting failure. Helpful if your site (not dns)
needs to move providers though.

------
sashk
My rule: provider should do single thing:

\- Hosting provider - host sites

\- vps/cloud provider - provide VMs

\- domain registrar - domain related stuff, but not DNS

\- dns provider - host dns

\- second dns provider - host dns in case first dns provider fails

So many DNS outages recently and all my projects are up.

~~~
copperx
Does Amazon's Route 53 count as a DNS provider, or do you treat it a hosting
provider?

~~~
sashk
For me - neither.

But if I'd be tied in into Amazon's cloud infrastructure, I would have to use
many of their features going against my rules above.

~~~
ludbb
How do you apply your rules considering what's available today? Which services
are you using? It sounds like it would be a big headache to orchestrate the
automation among all these different providers.

------
nlivingstone
Have multiple VMs @ Digital Ocean (TOR1), we use Cloudflare for DNS... All
site have remained available and successfully fulfilling requests.

~~~
cleaver
Every site where I was using external DNS stayed up.

------
fredophile
I don't have anything more important than a small personal website but now I'm
curious. If you set up a system to handle your main DNS provider failing, how
do you test it? Is there a good reference where I can find some best practices
on this?

~~~
mrideout
Here's my testing recommendation:

1\. Pick some subset of your DNS records to monitor, or all of them if you
want to be extra thorough. If you are picking a subset, then I'd pick whatever
records are most critical to your business.

2\. Setup monitoring that queries each of your authoritative name servers for
each of the records that you identified in the previous step. The monitoring
should notify you if any of the name servers are unresponsive, or return a
different response than what's expected.

If you'd like to dig into the details of DNS, then O'Reilly's "DNS and BIND"
is highly recommended, even if you're not using BIND.

There are a number of quality hosting providers out there. A rule of thumb
that I use is this: If a DNS hosting provider doesn't eat their own dog food,
don't trust them to handle your DNS. Digital Ocean doesn't use their own name
servers for their main website's domain. Neither does Amazon.

Shameless plug: I created a DNS monitoring service that can be used used for
monitoring each of your name servers:
[https://www.dnscheck.co/](https://www.dnscheck.co/)

~~~
tonyarkles
The flipside to the dog food point: if DigitalOcean did use their own
nameservers for their main site, then we wouldn't have been able to see the
status page.

~~~
mrideout
Good point!

------
Rezo
Their status page at
[https://status.digitalocean.com](https://status.digitalocean.com) is also now
giving an intermittent "500 Internal Server Error" nginx error, probably from
the load. That's why you should use a service like
[https://www.statuspage.io](https://www.statuspage.io) for your important
stuff, even though creating a status page is a fun side-project for a dev
team.

~~~
crisopolis
So what you're saying is that instead of running their own Status Page on
their own infrastructure that's reachable. They should outsource it to
statuspage.io and pay another company to do it?

~~~
Rezo
Yes, that's pretty standard. Availability monitoring and status reporting
should be external and separate from your own infrastructure, otherwise
neither may be available when you need it the most.

~~~
dsr_
And don't use statuspage.io if your host is AWS, because theirs is too.

~~~
Rezo
They do have geo-region (not just AZ) redundancy and failover, which puts them
quite a bit above most home-grown company status pages in my experience. But
yes, if the problem was for example in R53 it would indeed be better to have
an solution without that AWS dependency.

------
karlgrz
This is the first DNS outage I've experienced with them in 3+ years, then
again I host everything in their NY regions.

~~~
crisopolis
I've never experienced an outage of any kind with DO, so also first time. I
also host all my droplets in the NYC regions.

------
crisopolis
DigitalOcean uses CloudFlare for DNS - [https://www.cloudflare.com/case-
studies-digital-ocean/](https://www.cloudflare.com/case-studies-digital-
ocean/)

~~~
jtokoph
This statement can be misleading. If you read the article, they don't use
CloudFlare's DNS servers per se. They use CloudFlare's DNS proxy which acts as
a DNS firewall between the DigitalOcean DNS servers and the world.

------
josh_carterPDX
I think the most annoying aspect of this outage are their updates. Three
updates and they all say the same thing with no meaningful information as to
what's causing this. Likely they may not have much information, but you'd
think there would be something more than what they've been posting for the
past hour. Good times!

------
coreyp_1
Does anyone know of a good strategy for DNS failover?

~~~
mattzito
Well, there's a couple of strategies:

\- IP-diverse nameservers

\- TLD-diverse nameservers

\- BGP anycast

IP-diverse nameservers requires that you expect that your DNS servers will go
down rather than start returning bad results - I highly recommend having some
sort of mechanism to hard-terminate access to those machines.

TLD-diverse nameservers is just an extra strategy for reducing the risk that
an upstream TLD issue will blow up your spot.

And then BGP anycast is the expensive, complicated piece of this - it requires
a high level of technical sophistication, lots of moving parts, and the
QA/validation piece of it is tricky.

When I built an anycast DNS system, we ended up resorting to tricks like
having the DNS servers publish routes to the router for redistribution, so
that a down or unresponsive server automatically withdrew the routes. Then you
do things like TXT records for your zone that respond with which POP you're
hitting in some sort of hashed/obfuscated fashion.

It's hard and complicated, and unnecessary for most folks. Better to outsource
to Route 53 or someone similar.

~~~
blumentopf
\- Implementation-diverse nameservers

Use multiple implementations, e.g. NSD/BIND for authoritative servers and
Unbound/BIND for resolvers, to mitigate against implementation-specific bugs
and vulnerabilities.

------
jamescun
I would be interested in the post-mortem from this. While DigitalOcean operate
their own DNS, it is only made publicly available though CloudFlares DNS
proxying service.

------
NewHatMatt
From @DOStatus a minute ago:

"Our engineering team has identified the issue, and are working to resolve
connectivity issues to our DNS servers....
[http://do.co/status"](http://do.co/status")

[https://twitter.com/DOStatus/status/713043871559655424](https://twitter.com/DOStatus/status/713043871559655424)

------
nodesocket
Recommend AWS Route53 very highly. Route53 also allows you to buy domain names
and do lot's of fancy fail-over, geolocation, and CNAME alias at the apex
magic.

------
samgranieri
A few years ago Slicehost had a DNS outage and the webscrapers I had running
were falling over because they couldnt resolve DNS. I had to SSH into 8 boxes
and update resolv.conf to add google DNS and openDNS as a backup. (Yes, I
should've had centralized config management with chef or puppet or ansible)

~~~
crisopolis
That's crazy... I think by default DO droplets use Google DNS for resolving.

------
doublerebel
No offense to anyone here, but what is DO's SLA? Last time I looked, they did
not have one.

DO is cheap for a _reason_. And that's the same reason I don't host with them,
I can get SLA-backed infrastructure for a reasonable price and would have no
excuse to my customers or cofounders.

~~~
bpicolo
Looks like they do have one:
[https://www.digitalocean.com/help/policy/](https://www.digitalocean.com/help/policy/)

------
xir78
We have seamless DNS "failover" by running dnsmasq with the all-hosts option
on all our servers. It causes dnsmasq to query all at once so if any go down
its transparent to our apps. Works perfectly on our 1500 ec2 instances.

------
r1ch
I thought their DNS was supposed to be rock solid since they use Cloudflare
Virtual DNS. Oh well, lesson learned. Back to running my own DNS servers on
each droplet, if the DNS is down the droplet is likely down regardless :).

------
showerst
Feeling the pain here too. What DNS providers do others use and like? Route53?

~~~
dboreham
Bind, running on VMs. Not hard.

~~~
SteveNuts
You run your own authoritative DNS servers?

~~~
z92
I ran my non-authoritative DNS server [bind] on a droplet for about a year.
But the server crashed every few months. Why? Never figured out. A restart
always fixed it.

Later shifted to DO's DNS servers.

Now that that one is down too, just shifted back to domain register's DNS.

Everything is working now.

------
yakshaving_jgt
This is the second time recently their AMS region has gone down, which is
where I host my email. What a pain.

------
satyajeet23
That awkward moment when it shows the status page

------
grej
This is causing huge huge pain for us, Digital Ocean.

------
colinbartlett
If you want to get alerted when it comes back up, or you wish you had been
alerted when it went down, check out my project:
[https://StatusGator.com](https://StatusGator.com).

StatusGator monitors status pages and sends notifications via email, Slack,
and others. You can get alerted to status changes inside Slack and you can ask
it the status of a service with a /statuscheck command.

~~~
camikazeg
A bit of feedback: you should have a link back to your dashboard on every
page. That seems like the most important page to me as a user, but if I am
changing my notification or account settings, there is no way back to that
page.

~~~
colinbartlett
Great feedback, thank you! Added that.

------
pmalynin
Yeah, tried to access our site and it was down. Really was expecting more out
of Digital Ocean than to fuck up such an integral part of their
infrastructure. In the future we'll be transitioning away from their DNS
solution because this is unacceptable.

~~~
tehbeard
I hope your clients/users are as understanding and civil as you are.

In the meantime, I'm going wait for post-mortem before deciding if I should
continue using them for dns. Looking back over the status history, 1-2
incidents a year isn't that bad for my needs, but might be too much for you,
which is fine (since I'm only hosting a couple of small side projects with
them).

~~~
pmalynin
The problem is, for an early stage startup incidents like this are deadly.
Especially since we just applied to a bunch of accelerators.

~~~
crisopolis
The resolution is, for any app/startup/business everything is a risk and if
you didn't include the edge-case of "What happens if my primary DNS nameserver
goes down for my domain?" into account. Is all you can do is blame DO?

If your app goes down do you have failover for that? Or do you blame your
devops team?

~~~
grej
Any small business has to manage which risks it accepts. OP got the Digital
Ocean service to try and mitigate this risk to a degree. Beyond that, it
becomes a question of accept risk in other aspects of product or a risk that
your primary DNS server provider will fail?

The reality is, you have limited development and financial resources so you
simple can’t do everything. Sure, in a vacuum, or in a larger enterprise, we’d
love to manage every risk. But when we're just starting out that’s not
realistic, and we do have a right to be upset at Digital Ocean's service going
down while at the same time realizing that yes, ideally we would / should have
had redundancy in place already.

Your question on blaming the devops team is exactly the mindset of someone who
has a lot more resources than a brand new startup. In a brand new startup
there IS NO devops team. If you're lucky, there is one person who does the
devops work part time, balanced with a bunch of other development work he/she
also does.

