
Cloudflare Dashboard and API Service Issues - easywiththe
https://www.cloudflarestatus.com/incidents/g7nd3k80rxdb
======
nikisweeting
All our sites using Argo Tunnels go down whenever the API goes down, but the
dashboard still claims Argo Tunnels are unaffected :/

At least we'll get a good blog post out of it in a few days.

~~~
inapis
Does Argo help much with performance? How much of a difference are we talking
about here?

~~~
buildbuildbuild
Argo Tunnels are most useful for security: you can serve HTTP from a server
that allows no inbound connections. Or, you can use it to reliably serve a
public site from behind NAT.

Argo routing’s ability to improve page load times depends on how international
your user base is, and how poor your origin’s transit quality is. It’s great
at improving the latency and reliability of a cheap host.

------
rkwasny
Update from CEO
[https://twitter.com/eastdakota/status/1250479501726760960](https://twitter.com/eastdakota/status/1250479501726760960)

~~~
zymhan
How on earth can rented "remote hands" at a datacenter take down the ENTIRE
Cloudflare management layer?

~~~
MichaelApproved
and for over 3 hours? What could they do accidentally that would cause an
outage for this long?

The tweet says they're failing over to their backup facility. I would've
expected that fail over to happen much faster.

Seems like they have two issues going on. First, the remote hands could take
down the datacenter. Second, their fail over is taking this long to come
online.

I also wonder how much Covid impacted the process, if at all.

I'm looking forward to the details after they get things back online.

Good luck CF engineers!

~~~
eb0la
Three hours for a _physical_ problem like this is very little time. Once you
discover your equipment is not there, you still have to put it on its place,
power up, reconnect al cabling, and check networking.

From this kind of mess to a fully functional infrastructure you need at least
12h-16h to function minimally. Probably takes 2-3 days to have that node
working as before.

When I was on call, I always jocked with my colleages the worst incident you
could have in a datatenter was someone swapping two cables.

------
redm
I wish cloudflarestatus.com (powered by StatusPage) offered a subscription
(like [https://status.box.com/](https://status.box.com/), also on StatusPage)
so you could get a pro-active notice about outages.

I had to debug customer issues to find out that this was down.

Even if they don't want to offer this to the general public (free customers),
they should have another notice mechanism for enterprise customers.

~~~
pmccarren
StatusPage has native Subscriber Notifications [0]. Cloudflare must not have
it setup on their end.

[0] [https://status.io/features](https://status.io/features)

~~~
burlesona
Wrong URL, Statuspage is at [https://statuspage.io](https://statuspage.io) :)

------
js4ever
It's triggering the regular cloudflare error when trying to access their
dashboard.

Error 522: If you're the owner of this website: Contact your hosting provider
letting them know your web server is not completing requests. An Error 522
means that the request was able to connect to your web server, but that the
request didn't finish. The most likely cause is that something on your server
is hogging resources.

I can't wait to see the postmortem, I wonder if it's a DDOS, network/hardware
issue or a deployment error

~~~
pul
They've added the cause on their status page now: > [...] a disruption that
occurred during a maintenance.

[https://www.cloudflarestatus.com/incidents/g7nd3k80rxdb](https://www.cloudflarestatus.com/incidents/g7nd3k80rxdb)

------
robinhouston
This has broken publishing in our app, which purges the file from the
Cloudflare cache when something is republished. We’re ignoring errors from the
Cloudflare API, but that isn’t enough in this case because it isn’t returning
an error – it’s just hanging till the request times out.

We’re pushing an emergency config change to skip the cache invalidation, which
will stop it timing out but means republished projects won’t update (because
the old version will still be cached).

Godspeed to the Cloudflare engineers who are presumably scrambling to fix
this!

~~~
darkerside
Do you not have some kind of task runner you can offload this to? That seems
like it would be of general benefit.

~~~
robinhouston
The trouble with returning to the user before the cache has been purged is
that they think it hasn’t worked, because they still see the old cached
version.

~~~
darkerside
I'm probably missing something but it sounds like cache invalidation happens
on republishing, not on user request for content.

~~~
robinhouston
You’re absolutely right. But user requests for content are made directly to
Cloudflare, not through the app, and it’s common for users to click directly
through to the published content after republishing.

~~~
darkerside
Ah, that makes sense. You could store latest task status per content in your
database and have a JavaScript component that polls and shows a message if
cache invalidation is incomplete. But for how often Cloudflare goes down, it's
probably not worth it.

------
arn
Can't clear our cache via api or manually, so our cached HTML pages are stuck
until the natural expire happens -- which are set somewhat long. Not great for
a blog / news site. for example, if we publish a story, our front page won't
reflect it.

------
jgrahamc
Coming back now.

------
iampims
Postmortem thread:
[https://news.ycombinator.com/item?id=22884833](https://news.ycombinator.com/item?id=22884833)

------
verroq
By degraded performance it means down completely.

~~~
mmm_grayons
Yeah, dashboard still appears to be completely down.

------
mattashii
> "Cloudflare is continuing to investigate a network connectivity issue in the
> data center which serves API and dashboard functions."

This implies that CF hosts its API and dashboard all in one DC, which I find
an _interesting_ observation. One would expect a company like CF to host its
critical infrastructure in a redundant fashion.

~~~
tyingq
It's certainly not ideal. But it's not unusual to spend a lot more time on
making the runtime very redundant, but less time/money on dashboards and
configuration change underpinnings. Doesn't work well in this case since it
kills invalidating cache items for customers.

The comments seem to imply that having a redundant way to refresh the page
cache, even if it were global/domain versus page, would be an okay backup for
many.

~~~
mattashii
I agree that first priority would be data integrety (which would be the
runtime). But a large part of the CF experience of a CF customer would be the
availability of their management APIs/dashboards, and that would be another
part to optimize for.

I'm really suprised that they hosted all those non-vital but still quite
critical services in just one DC, or somehow had one DC as a single point of
failure. Network issues happen "regular enough" to want to protect against
that, or at least have mitigations available.

~~~
bostik
To be fair, you have these latent single points of failures even in the most
resilient distributed systems.

Such as S3. The bucket names are globally unique, which means that their
source of truth is in a single location. (Virginia, IIRC.)

Now... a small thought exercise. If I wanted to take down a Cloudflare
datacenter and I had access to a few suitably careless remote hands, I'd take
out the power supplies to the core routers, and while the external network is
out of commission, power down the racks where they have their PXE servers.
That should keep anything, within the DC, from being unable to recover on its
own.

------
bithavoc
The pushed a change to fix the DNS propagation 45 minutes ago[0]. Edge servers
continue to proxy but no new records are being served.

[0][https://www.cloudflarestatus.com/incidents/57shkf1841kh](https://www.cloudflarestatus.com/incidents/57shkf1841kh)

~~~
gazelleeatslion
Noticed this yesterday around 5pm EST.

Edit: Also noticed that when generating API keys, the dropdown wouldn't list
all my accounts for setting permissions. Just assumed it was all related or
something.

Either way, overall super insanely reliable product/service and could not live
without.

------
gramakri
All DNS APIs are failing :(

~~~
zymhan
As are our CircleCI piplines that call out to Cloudflare.

Time to get some fresh air!

------
partiallypro
While they are fixing this, could they please roll out a feature to allow me
to assign users to only specific domains? Biggest complaint about Cloudflare,
heck even GoDaddy lets you do that at no cost.

~~~
judge2020
You can, but only once you upgrade to Enterprise - the delegated dashboard
per-zone access functionality is part of their business model.

See:
[https://www.sec.gov/Archives/edgar/data/1477333/000119312519...](https://www.sec.gov/Archives/edgar/data/1477333/000119312519222176/g735023g16s02.jpg)
(RBAC)

~~~
partiallypro
I said at no cost. Other registrars/DNS providers provide this for free.

