It’s only reasonable to be angry but do try to remember that the people fixing this are people like you who showed up at work to build something and are instead dealing with a fire. Ask their bosses about how they got in that situation but be nice to them, they’re having an even worse day than you are.
Fair enough. They resolved it now and I was in a bit of panic considering our revenue depends on the website. As a developer though, I should have been more sympathetic.
I'm curious, I definitely get the panic. How much does a 30 minute outage cost you, vs how much would it cost to build a solution with some kind of standby that you could fail over to in scenarios like this?
It could be worth it, but if you do the math and it seems like it's not worth it, it could perhaps give you some equanimity the next time it happens?
Honestly, the impact may be less monetary and more reputation but as a B2B SAAS provider, I agree that it may not have been that dramatic as I made it sound for the 25 mins of downtime. It is just that we never had a downtime this long, EVER in 8+ years of business so I hit the panic button fast. 25 mins seemed like 25 hours if you ask me :).
And just 30 minutes ago we were about to flip the switch on a months long migration to Cloudflare Pages for our new website, I guess some things weren't meant to be :')
Omg. What timing. I feel your pain. We recently migrated to Cloudflare Pages and I was happy at the speed and everything and now this :(. Never had a downtime when I self hosted on my DigitalOcean droplet. damn. Re-considering going back to old school nginx static site hosting.
I've used Digital Ocean (and many other hosting providers) for as long as most of them have existed. Most of my servers have been running nearly uninterrupted for many years. Yes, there will be a reboot or move every so often but the uptime is incredibly high.
The idea that single server is capable beat the reliability of a massively distributed system is counter-intuitive and yet usually it's the case.
The average distributed system is a house of cards that can come tumbling down if any one of a number of pieces fails. The average static server is a rock of stability, with very few failure modes.
Yep our current marketing site is NextJS hosted on Hetzner fronted by Cloudflare, fortunately that's still up and never has any problems.
We've moved to next-on-pages for our new marketing site and I've spent the whole day on finishing touches ready for switch over at 20:00 UTC, and now this :((
It is shocking to me how bad to non-existent error handling is in most terraform providers. It leads to some remarkably arcane and esoteric error messages
Terraform error handling as a whole is nuts anyway. Like, I recently tried to delete an ACM cert that still was in use in a Cloudfront distribution - didn't work, but it took 20 minutes for Terraform to recognize that, yes, there's an API error. It shouldn't have come so far given that the API call immediately errors out when trying over the CLI or Web Console, but instead of erroring out, Terraform retried for 20 minutes until it hit some sort of timeout.
To make it worse, you can't even kill Terraform safely because while it does register your Ctrl+C, it won't interrupt an ongoing process, and if you force kill it you run the very serious risk of corrupting your state file.
Seriously, I'm looking for OpenTofu to light some fire under the ass of Hashicorp. I don't know where all the VC money went, but for what's supposed to be the golden standard of IaC solutions, it's sometimes bloody ridiculous.
(Not to mention it's written in Go of all things which means there's virtually zero tooling and documentation to debug it or to develop anything for... especially when compared to the state of the art in Java, NodeJS or PHP)
This is usually down to provider implementation which switching the core won't help. The provider controls HTTP calls and errors against the relevant service API.
Sure, but Terraform Core doesn't provide any way of getting user feedback in case unexpected situations happen, or aborting while saving the current state, both of which would save me serious amounts of time.
Being able to actually interrupt/cancel would be nice. You can get more feedback by adjusting TF_LOG env var. Logging levels have been getting improvements for a while (it used to just be TRACE that spammed everything)
> You can get more feedback by adjusting TF_LOG env var.
yep, but that's often enough useless after the fact. In the PHP world, for example, there's Symfony/Monolog's `fingers_crossed` logger [1]... it keeps logs below the normal threshold in memory, but if there is a single event of a given severity or worse, it dumps out all the logs it has ingested so far for this request.
In the time I made this post and now it's come back. Really wish that would've returned an error and not an empty list, that almost caused a disaster in my automation.
It's probably bad that I noticed this just due to a large percentage of my regular online-habits suddenly breaking. I liked the old internet where websites just broke one at a time.
Yes — Workaround is to disable your workers. That got my site back up and running.
EDIT 15:07 MDT: People are reporting that Workers are back up. Mine isn't in my site's critical path. So I'm going to leave the Worker disabled (un-routed) until tonight.
It is funny that just a few days ago the company that laughed at Okta for a breach and whose core competency is availability are now experiencing an outage.
This is not the way we wanted anyone to start their week.
(I am the PM lead for Cloudflare Workers: Databases & Storage)