In case you're wondering, we use Cloudflare to keep Render's network up during DDoS attacks. Both Render and our customers are often targeted. We've already started building a product that lets customers bypass Cloudflare altogether, and I expect we'll see more demand for it after today's incident.
We really like Render but are running into issues with Cloudflare blocking requests that are incorrectly flagged as malicious (our service passes code blocks over HTTP, similar to Replit).
Not to mention our site was down for way too long this evening…
We’d consider staying if we can bypass Cloudflare altogether. Render has been stable otherwise.
Great idea for a simple fix. We’ve worked around the issue already, but it’d be nice if we could just trust our own authenticated content without the extra hoops.
They don't spend that many resources. Most participants in DDoS attacks are sometimes innocently recruited victims. Either victim of their own ignorance or victim of developers lack of care for secure defaults. In other words, some software product is deployed where it should not be...Then....The people and/or AI's who want to run these attacks, explore standard protocol behavior.
A raspberry pi can generate enough traffic to overload an otherwise unprotected service. It doesn't cost much, if anything to launch a brute force attack.
There's been posts on here about malicious browser extensions, infected IOT devices, malware in mobile apps that give someone the means to launch an utterly brutal attack. Imagine if I had a service that could handle 10k rps. Now imagine 600k android devices from all across the world send one request per second each [0].
typically done via hacked bot farms that cost the attacker nothing other than the fun of rolling out standardized scripted attacks on poorly configured servers.
Why they do it... well:
Competition suppression
Vindictive nastiness
Fun
Just because you can (the world is your sandbox)
Other reasons that might not occur to you but are very real for the attacker...
I'm going to share a, probably, controversial opinion. That opinion is: I can't stand an outage title like "Websites and APIs on Render are unavailable due to Cloudflare network errors". Its passing blame. I run an app or two on Render. I don't pay Cloudflare; I pay Render. Take responsibility for the infrastructure decisions that you make, for your customers; don't pass blame to your infrastructure providers.
We take full responsibility for the infrastructure choices that led to this outage. As the peer comment said, it's helpful to overshare in these situations.
We know developers don't actually care who's at fault and will move off of Render if we're down, period. Even before the incident, we'd started working on a project to eliminate the SPOF with Cloudflare, and now it's only a matter of time before we ship it.
I get that, and the update is much appreciated. I don't mean to insinuate that this was the intention behind why that language was chosen; its just the sentiment that the language conveys, and that's why I'm not a fan of it.
The stance that I take is; its a fine line between Oversharing and Passing Blame in outages like this, and while I'm happy that a line like that when shared by Render means it was just oversharing (I love your product!), its easy to see how a line like that when shared by a less admirable company could be seen as "Nah man, its not on us, we didn't do anything wrong." A critical difference being; if Cloudflare was the cause, how are we working toward avoiding this cause in the future; which leads nicely to where pointing at Cloudflare (or any upstream provider) generally feels more agreeable; the retro.
To be clear; I have no intention of leaving Render, even if y'all weren't planning to alleviate this SPOF. I fully grok the difficult engineering required to nuke SPOFs like Cloudflare or AWS; and a bit of downtime here and there is a price I'm fine with paying.
Sure, but there's a time and a place. Outages involve high tensions and fog of war; and you said it yourself, you're already ready to blame them for not having backup generators in this hypothetical example. The midst of an outage is not the time to start casting blame, on people, organizations, processes, providers, whatever. Outages are the time to fix; retros are the time to blame (within productive reason, of course).
If you ran it outside of Render, would you be using a CDN service or building your own?
The bigger issue you're alluding to is that of supply-chain reliability in SAAS products: when AWS goes down, multiple other (seemingly unrelated) services go down. But saying its the downstream service's fault is pointless, because if you were to do it yourself you'd be using the same upstream provider, and be dealing with their outage yourself.
In that example, Slack as a bigger of AWS would have a much bigger say, and a more direct line to AWS engineers, than you would.
Right, and I think there's an interesting transitive correlation here: As a customer of Render, while Render was down because of Cloudflare; is it appropriate for me to post on our outage page: "Service interruption due to issues at Render"? "Service interruption due to issues at Cloudflare"? What does Cloudflare post on their page? (Well, they may actually post "due to a busted AC unit in our Seattle data center" which, you know, at that point we've hit bedrock so maybe that's valuable, but)
Its turtles all the way down, and in the midst of an outage I totally empathize with the off-the-cuff thinking that oversharing is better than undersharing, but after the fog of war clears you can even retro language like that and come to a different conclusion. What value do my customers, even if they're highly technical, gain by knowing its Render's fault that MyCoolService was down? Are they going to go open support tickets with Render? I'd bet Render very reasonably wouldn't appreciate that, and they're not going to have a better trunk to their support than I do.
Resolved now, but an hour of downtime really shows you why you are paying for bigger cloud providers with an SLA and customer support. Honestly I wish we could have turned cloudflare off for the time of the issue vs having our api being down...
Maybe time to consider multiple CDN providers as an abstraction like you consider AWS/GCP as an abstraction.
That is actually a selling point too. I've looked at both Fastly and Cloudflare a while back to replace the budget CDN at my previous job after an outage, but found both had more and more serious downtime in the 24m before that. So I just made a script to quickly switch between providers, but I'd rather not have to deal with it at all :-)
Not really, Render is a managed stack, just like Heroku. If you want full control over the stack you can obviously do that. You pick Render so these choices are managed for you.
I still find the trade off worth trying to roll out our own infrastructure (no, k8s isn’t “easy”). And it looks like they are already working on a more robust solution around this particular issue.
Some don't like them on ideological grounds (centralization of a sizable portion of the internet), some don't like how Cloudflare can make your browsing experience miserable if you use TOR or turn on Firefox's anti-fingerprinting features (and Cloudflare is s major part of the reason these features are off by default)
I don't like Cloudflare skirting the responsibility of a hosting provider by claiming to be a neutral third party similar to an ISP instead of a company paid by their customers to distribute their content on the web.
Also them not taking a stance on housing despicable stuff like KF that literally bully people to their deaths.
Unrelated but the choice of background(white) and choice of color(white) of the title in the "hero" section of the website is poor on mobile. I assume the site UI wasn't tested on mobile? https://i.imgur.com/M1n6SYv.jpg
Everything is back up. We're waiting for Cloudflare's RCA and will follow up with additional Render context right after.
------------
(Render CEO) While Cloudflare investigates the issue on their end, we're also working on ways to bypass Cloudflare.
Really sorry about this, folks. We'll keep https://status.render.com updated and will post an RCA once things calm down.
Cloudflare have declared an incident at https://www.cloudflarestatus.com/incidents/2xffnv666yd7.
In case you're wondering, we use Cloudflare to keep Render's network up during DDoS attacks. Both Render and our customers are often targeted. We've already started building a product that lets customers bypass Cloudflare altogether, and I expect we'll see more demand for it after today's incident.