CloudFare outage (System Status, back online now)

Nyr · on May 2, 2012

CloudFlare promotes their service as a highly redundant CDN but the truth is that this isn't the first outage and they fail at very simple things.

I was using them until one day, they had routing problems with their DNS servers to some parts of the world (about 20% IIRC). This shouldn't be a issue except because all the name servers they provide were routing to the same network, making all my services unavailable for more than one hour. Yes, they have anycast and all that cool things, but if they fail providing real redundancy for DNS, I can't be their customer anymore.

true_religion · on May 2, 2012

I think CloudFlare isn't bad for a small-to-midsize site.

They're new in the industry and still learning.

If you have a large mature site, then its likely your uptime will be better than theirs and you'll lose serious money for every minute of downtime. In that case, don't use CloudFlare because they're still learning from their mistakes.

benatkin · on May 2, 2012

> They're new in the industry and still learning.

They aren't that new, and they would have learned faster if they had better priorities. Not only do they seem to prioritize the script minification and code insertion over performance, they seem to prioritize popularity over customer service. It's easy to find reviews of CloudFlare where someone running a small-to-midsize site gave them a fair shot and was disappointed.

jgrahamc · on May 2, 2012

Perhaps change the title to "CloudFlare was down briefly" because it's not down any more. We were aware immediately that a problem was occurring and fixed it. To quote the people who watch the CloudFlare network night and day: "While tuning Asia performance, an improper router config was pushed, causing our upstream provider to misroute. Pinpointed issue and fixed."

Lots of red faces around the office and people apologizing.

qeorge · on May 2, 2012

This explains a lot, thanks. We noticed that the traceroute for one of our domains was running through Asia (usually DFW) and found it odd. It also changed shortly when the sites came back online.

Glad to know it was just a mistake though, as opposed to an attack or failure. Much less worried about recurrence.

vinayan3 · on May 3, 2012

Thanks. My site went down and we were trying to get some new user on today. We held on and things seem to be ok now!

wahnfrieden · on May 2, 2012

Updated, thanks.

larrys · on May 3, 2012

When I first considered cloudflare the thing that kept me away (and this was quite some time ago) was the "quality" of the customers. In looking at the domains they hosted I rarely saw anything but small and spammy type sites. I also notice that they had constant churn. They would bring on customers but many customers were also leaving every day in numbers for any day I checked. This doesn't appear to have changed.

For the link below, simply change the date in the URL to any day and you will see the domains that are added and transferred out of cloudflare on a constant basis:

http://www.dailychanges.com/cloudflare.com/2012-05-02/

Added: To me given what cloudflare is it doesn't make sense that these domains are transferred out as frequently as they are other than for service related reasons. In fact we had suggested cf to a customer and they lasted about a week on the service and had issues.

eli · on May 3, 2012

What are you comparing them to? Based on your logic they don't seem any worse than any other DNS provider I plugged in. E.g. http://www.dailychanges.com/dnsmadeeasy.com/2012-05-02/

Udo · on May 2, 2012

To their credit, they turned this issue around very quickly. Overall I have to say that Cloudflare is an excellent service.

This event does remind me of the predictable and somewhat obvious question though: do the benefits of a service like Cloudflare outweigh the downsides that inevitably come along with introducing another single point of failure to a website?

jyap · on May 2, 2012

Well theoretically Cloudflare is designed with a decentralized approach which means it is not a single point of failure.

I've noticed that large scale issues like this are usually down to botched router configurations or related networking changes.

Udo · on May 2, 2012

> it is not a single point of failure.

Obviously today Cloudflare did become a single point of failure for many sites, so I'm not quite sure I understand your point. I also don't believe you can design any service with 100% uptime. Things will go wrong.

jyap · on May 3, 2012

The operative word here is "theoretically". Theoretically the concept of CDN's (a broad term to describe the main aspect of Cloudflare's service) gives you greater replication and redundancy of data. Now this all depends on their overall design (not all CDN's are created equal). You can eliminate SPOF's through replication and redundancy.

It's like saying I have 2 cars I can use to get to work. Then you say "But what if both break down?" Uh, 2 cars breaking down is not considered a single point of failure.

Take the example of serving a single image on the internet.

It starts with DNS. You can have multiple DNS servers for your domain (no SPOF for DNS lookups). You have multiple web servers in different countries. Your web servers point to CDNs to serve the image. Your CDN has multiple DNS servers for their domain. They have multiple servers in different countries to serve up your image.

Tell me where the 100% uptime of the single image fails in that scenario.

To answer your original question, if Value derived (can be high say for a news site) > Risk involved (can be low depending on provider), then that is when the benefits outweigh the downsides. In most cases, a CDN is meant to give you better uptime as well as provide you benefits such as geographically delivered content and the ability to serve your content to more people (eg. videos and other large media content).

Udo · on May 3, 2012

  Then you say "But what if both break down?" Uh, 2 cars 
  breaking down is not considered a single point of failure.

I'm afraid you might have spectacularly misunderstood my point.

The way Cloudflare operates is more closely related to a scenario where either one of the two cars failing brings down the site: that would be either the webserver or the CF infrastructure. Mathematically, the combined downtime of both must be greater than that of either one alone.

  Tell me where the 100% uptime of the single image fails in that scenario.

There is no question that a CDN is generally designed to add robustness and speed to content delivery. But as you said, not all CDNs are created equal. I say this as a (satisfied) Cloudflare user myself:

In its default configuration, a basic Cloudflare plan has (almost) no settings. It's not like you make a choice e.g. to host only images there. Using the standard CF plan comes with basically two states your site can be in: either CF is on or it's off. When it's on, the entire traffic of that site is going through Cloudflare, they become your site's front-facing servers. There are some huge advantages to this, for example they block a lot of malicious traffic that way.

Coming back to your image example: individual components of the service may be designed for redundancy, but there is still a lot of stuff that can (and does) go wrong with global repercussions, if only for the simple reason that the CDN service as a whole must be centrally controlled.

If your webserver is up but CF is down, your site is down. If CF is up but your webserver is down, your site is down (actually it becomes a mirror of some static content for a few minutes before it goes offline completely). This is what I meant by each one of those services being a single point of failure, there is really no way of getting around that fact.

I remember the same discussion about CDN-hosted JavaScript libraries. People argued that linking to a 3rd-party server made their site more robust simply because CDNs normally have a higher uptime than a standard web hosting server. This was of course completely beside the point, because (again) either one of the two failing meant the site would break. That's why it has become customary to have a local fallback for CDN-hosted JS libraries now.

larrys · on May 3, 2012

"In its default configuration, a basic Cloudflare plan has (almost) no settings. It's not like you make a choice e.g. to host only images there. "

Correct but you could use a completely separate domain for all the images and only enable cloudflare for that domains. (Not a cloudflare customer I'm just pointing this strategy out.)

jyap · on May 3, 2012

Your original argument was based on "a service like Cloudflare". I was discussing services like Cloudflare, not Cloudflare specifically.

My point is, a properly designed CDN (most common ones) should not cause a wide spread outage to their service where 80+% is down.

codexon · on May 2, 2012

They seem to have random unexplained outages every other week that aren't even mentioned on Twitter.

jaytaylor · on May 2, 2012

Twitter search is nice for staying up to date on this: https://twitter.com/#!/search/cloudflare

eli · on May 2, 2012

Seems to be back now.

https://twitter.com/#!/CloudFlareSys/status/1977854413600235... blames an upstream network issue. But I dunno, cloudflare.com was giving me a 502 error from ngnix, which indicates a cloudflare-backend problem.

jjoe · on May 2, 2012

Just thought I'd expand on this part of your comment: "cloudflare.com was giving me a 502 error from ngnix, which indicates a cloudflare-backend problem"

Everything is a backend (ex: your server). Your CF hosted website is a backend to their front end (Nginx). Except the "backend" here is located in a remote network where your actual server is hosted. So it could very well be a routing issue between their network (Nginx nodes) and your server. Hence the 502 (backend unreachable).

Regards

micro-ram · on May 2, 2012

Yes, but CloudFlare is supposed to show a cached copy of my site when my server (i.e. backend) is unreachable.

wahnfrieden · on May 2, 2012

Their status page is currently 500ing, but they have a Twitter account for it too:

https://twitter.com/#!/cloudflaresys

"Investigating upstream network issues in EU." -- even though the outage appears to be global...

on May 2, 2012

[deleted]

Udo · on May 2, 2012

Same here, glad I don't have anything critical hosted there. They have just come back online again though.

wahnfrieden · on May 2, 2012

"Expanding investigation for upstream network issues affecting all locations."

"Network issue should be clearing up. Continuing to monitor."

Seems to be back up now.

pwenzel · on May 2, 2012

My Pingdom monitor started making notes a few minutes ago. Having the same downtime problems on one of my sites.