Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Cloudflare Workers are down?
157 points by cristaloleg on Oct 30, 2023 | hide | past | favorite | 79 comments
Got too much 500 on dozen of services now.

UPD: https://www.cloudflarestatus.com/incidents/l6x2h1zp69bc




This should be resolved. We’re still investigating the underlying root cause, and intend to share a write-up once we have that in hand.

This is not the way we wanted anyone to start their week.

(I am the PM lead for Cloudflare Workers: Databases & Storage)




Just noted on HN and already incident upgrade. Much faster "response" than most other companies:-)

All the best to the people fixing!


Works for me: https://blog.sapico.me/

Seems to be 30 minutes. According to status page.

Fix is fast. Curious what it was.


3:55 PM Eastern: Our entire website hosted on cloudflare pages is returning 500. I also cannot login to the dashboard either (it just spins)

EDIT 4:10 PM Eastern: Now I can login to the dashboard but "Workers and Pages" menu is returning errors and no access. Website still down :(

EDIT at 4:23 PM Eastern: RESOLVED. Website (cloudflare pages) is back up now for me.

Looks like they took about 25 mins to resolve.


Our prod app and staging just completely died. Bad day for somebody at Cloudflare


Our main Marketing website that brings revenue is down. No Sympathy from me. It has been 20 mins now. Losing money as I type this.

EDIT: I panicked a little. As a dev, I should have been more sympathetic.


It’s only reasonable to be angry but do try to remember that the people fixing this are people like you who showed up at work to build something and are instead dealing with a fire. Ask their bosses about how they got in that situation but be nice to them, they’re having an even worse day than you are.


Fair enough. They resolved it now and I was in a bit of panic considering our revenue depends on the website. As a developer though, I should have been more sympathetic.


I'm curious, I definitely get the panic. How much does a 30 minute outage cost you, vs how much would it cost to build a solution with some kind of standby that you could fail over to in scenarios like this?

It could be worth it, but if you do the math and it seems like it's not worth it, it could perhaps give you some equanimity the next time it happens?


Honestly, the impact may be less monetary and more reputation but as a B2B SAAS provider, I agree that it may not have been that dramatic as I made it sound for the 25 mins of downtime. It is just that we never had a downtime this long, EVER in 8+ years of business so I hit the panic button fast. 25 mins seemed like 25 hours if you ask me :).


3.55pm eastern. My websites work

4.10pm eastern, still working

4.23 eastern. Yep you guessed it

Half an hour means they’ve lost their five nines for this year based on this outage alone.


Us also, prod and staging are down and dashboard is resulting in API failure requests (500).


And just 30 minutes ago we were about to flip the switch on a months long migration to Cloudflare Pages for our new website, I guess some things weren't meant to be :')


Omg. What timing. I feel your pain. We recently migrated to Cloudflare Pages and I was happy at the speed and everything and now this :(. Never had a downtime when I self hosted on my DigitalOcean droplet. damn. Re-considering going back to old school nginx static site hosting.


Those might have had downtime, but never reported


Well then you haven't used DO that long, I get regular emails about X or Y server needing to go down for maint.


I've used Digital Ocean (and many other hosting providers) for as long as most of them have existed. Most of my servers have been running nearly uninterrupted for many years. Yes, there will be a reboot or move every so often but the uptime is incredibly high.

The idea that single server is capable beat the reliability of a massively distributed system is counter-intuitive and yet usually it's the case.

The average distributed system is a house of cards that can come tumbling down if any one of a number of pieces fails. The average static server is a rock of stability, with very few failure modes.


Yep our current marketing site is NextJS hosted on Hetzner fronted by Cloudflare, fortunately that's still up and never has any problems.

We've moved to next-on-pages for our new marketing site and I've spent the whole day on finishing touches ready for switch over at 20:00 UTC, and now this :((


Not sure why you're being downvoted


Did you ever reboot for patches or was it load balanced?


Heck even shared hosting for $3/mo works just fine


For any terraform users that may be using code like this:

data "cloudflare_ip_ranges" "cloudflare_ipv4_list" {}

This is coming back with an empty list on some fields and causing havoc in terraform.


It is shocking to me how bad to non-existent error handling is in most terraform providers. It leads to some remarkably arcane and esoteric error messages


Terraform error handling as a whole is nuts anyway. Like, I recently tried to delete an ACM cert that still was in use in a Cloudfront distribution - didn't work, but it took 20 minutes for Terraform to recognize that, yes, there's an API error. It shouldn't have come so far given that the API call immediately errors out when trying over the CLI or Web Console, but instead of erroring out, Terraform retried for 20 minutes until it hit some sort of timeout.

To make it worse, you can't even kill Terraform safely because while it does register your Ctrl+C, it won't interrupt an ongoing process, and if you force kill it you run the very serious risk of corrupting your state file.

Seriously, I'm looking for OpenTofu to light some fire under the ass of Hashicorp. I don't know where all the VC money went, but for what's supposed to be the golden standard of IaC solutions, it's sometimes bloody ridiculous.

(Not to mention it's written in Go of all things which means there's virtually zero tooling and documentation to debug it or to develop anything for... especially when compared to the state of the art in Java, NodeJS or PHP)


This is usually down to provider implementation which switching the core won't help. The provider controls HTTP calls and errors against the relevant service API.

Here are the retries in the provider code https://github.com/hashicorp/terraform-provider-aws/blob/mai...

It's hard coded to "certificateCrossServicePropagationTimeout" which is 20 minutes here https://github.com/hashicorp/terraform-provider-aws/blob/mai...


Sure, but Terraform Core doesn't provide any way of getting user feedback in case unexpected situations happen, or aborting while saving the current state, both of which would save me serious amounts of time.


Being able to actually interrupt/cancel would be nice. You can get more feedback by adjusting TF_LOG env var. Logging levels have been getting improvements for a while (it used to just be TRACE that spammed everything)


> You can get more feedback by adjusting TF_LOG env var.

yep, but that's often enough useless after the fact. In the PHP world, for example, there's Symfony/Monolog's `fingers_crossed` logger [1]... it keeps logs below the normal threshold in memory, but if there is a single event of a given severity or worse, it dumps out all the logs it has ingested so far for this request.

A real lifesaver that one is.

[1] https://symfony.com/doc/current/logging.html


I used to add TF_LOG_PATH to my shell profile so all TF runs log to disk with a daily cron to truncate the file.


There was no error message, that was the really unsettling part.


it's shocking how much of a desirable skill it is in devops job roles given its clear deficiencies.


In the time I made this post and now it's come back. Really wish that would've returned an error and not an empty list, that almost caused a disaster in my automation.


Anyone remember big iron and servers with uptimes of 5 or 7 nines?

I mean: it used to be a thing. Now we have the cloud.


7 9's is 3 seconds of downtime per year. That was never a thing.


Auth0 seems to be down as well


Yeah, can confirm this (for those looking at their status pages which claim otherwise).


Complete Pages outage for me. I have several sites hosted on Cloudflare Pages and I can't access any of them, they're all returning 500's.


Yep, same for me :(


Fun day to release a blog post[0] about cloudflare page functions, on a site hosted on cloudflare pages.

[0] https://interbolt.org/blog/split-it-and-forget-it


Apparently Auth0 as well. Possibly related.


Most likely related, I see a `cf-ray` header in the 500 response.


It's probably bad that I noticed this just due to a large percentage of my regular online-habits suddenly breaking. I liked the old internet where websites just broke one at a time.


That was before ddos became common and cheap to execute.


This is preventing new logins to ChatGPT.


Error 1101 Worker threw exception. Interestingly fronting their auth0 tenant with CF.


auth0 itself is (edit: was) down, https://auth0.com/


I can't login to my domain dashboard either. Maybe that is a downstream effect of workers being offline?


Yes — Workaround is to disable your workers. That got my site back up and running.

EDIT 15:07 MDT: People are reporting that Workers are back up. Mine isn't in my site's critical path. So I'm going to leave the Worker disabled (un-routed) until tonight.


Are you saying disable workers and then your cloudflare Pages will be back up ?


Ah, I don't know about Cloudflare Pages. I think they use Workers underneath. So unfortunately, there's no fix yet. Sorry.


Ah well. I cannot access the Workers and pages menu. It returns an error.


My sites have started coming back up now. Their site has also just started working again (previously got a 500): https://pages.cloudflare.com/


I won't be the first or last to say these three things:

The internet was meant to stop reliance on single sources (in case of nuclear war)

The size of a house of cards increases the number of failure points

Marketers lie


> The internet was meant to stop reliance on single sources

You have all the technical means. Your home server possibly won't be reachable, yes.

The global connectivity as-is is really, really, really fault tolerant.


Cloudflare Pages aren't working on a few of my sites too


It is funny that just a few days ago the company that laughed at Okta for a breach and whose core competency is availability are now experiencing an outage.


You should probably indicate what you meant by laughed at okta. Do you have a link??


https://blog.cloudflare.com/introducing-har-sanitizer-secure...

Maybe "laugh" is not accurate, idk. But their post kinda looked like 'here, we built this tool that should have been made by okta'.


Ongoing DDoS attacks are targeting sites that raise funds for Gaza relief efforts: https://twitter.com/arblauvelt/status/1719027920054702363

I wonder if there's a connection.


Not related.

(I am the PM lead for Workers databases & storage)


I really appreciate how you've showed up quickly and given direct answers. It's an admirable level of comms for a company so large.


Is there a postmortem coming ? Would you be able to tell us what happened at a high level ?


See my comment here: https://news.ycombinator.com/item?id=38075877

(We’ll share more when we can)


That's sad, hopefully something comes along that can brighten their day.


Wonder if this is related to the mini NPM outage I was experiencing earlier:

https://status.npmjs.org/incidents/zdznxkrp22py


No way to confirm, but I think so, just because NPM threw this error at me:

     KV GET failed: 401 Unauthorized
where KV could refer to the CF KV in workers


All of them, or just those in some data centers?


Looks like it's working now as of 1:23 PST


Damn yeah, noticing for last few mins+


Looks to be back up now for my sites


Yes, down since about 3:55 ET.


The first error was at [hour]:54:23 for me, in case it helps


Yeah, my system alerted on 19:56 UTC


Back up for me as of 16:20 ET


npmjs is also in a bad spot:

https://status.npmjs.org/


Ask HN:


Thanks, fixed.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: