My mistake. Early draft (with incorrect information) included in the article. Our network guys corrected me. Here's what it's being updated to read:
===
Finally, while 10Gbps Ethernet can run across standard Cat5/6 cable, we elected to use SFP+ connectors. We chose this to have the flexibility between optical (fiber) and copper connections. Some network card and switch vendors lock down their equipment to only support proprietary SFP+s, which they charge a significant premium for. We spent significant time testing a combination of SFP+ vendors before finding FiberStore, a SFP+ manufacturer from which we could directly source SFP+s at a reasonable price that worked in the network gear we wanted to use.
For anyone interested in copper vs fiber, take some time to read Chapter 1 of Nathan Farrington's dissertation: bnathanfarrington.com/papers/dissertation.pdf
As 10Gbps, and even 40G/100G, gain traction, you start running into cable length issues:
Attenuation and maximum length of AWG 7 copper wire:
I love how a business can focus on doing one thing well, recruit some awesome talent, and end up doing some great R&D and pushing the envelope in their specific field. It's great to see companies like this succeeding.
So DDoS could be...
1) an L3 packet you would drop, but still saturating your uplink (e.g. DNS amp)
2) a request for some static asset (img, html, octet...)
3) a request for a dynamic page (hitting your app farm)
CloudFlare provides something like an automatically populated CDN which includes defense against #1 and #2, and they're using distributed data centers, and servers with 10GE and SSDs to run their network. They said 23 data centers (locations), but didn't mention how many of these servers they run.
Apparently they are able to run HTTP sessions over IP ANYCAST without any issue (I read claims you could only do UDP), so that's pretty cool. BTW - I wish it was easier to setup ANYCAST on your own... it seems like a major investment at the moment.
It's interesting they don't need more CPU power -- I would have expected more CPU would be deployed, since having minimal CPU could provide an attack vector.
Another thing I'm curious about is how they are distributing load between those cute servers they built? Do they segregate specific customers/domains onto specific IPs and then route those IPs to specific boxes in each data center? Or does everything come in "equal" and then get round-robin/least-load divided up by some massive load balancers? Basically, I wonder how much of the load balancing do they try to do "client-side" or "client-based" versus how much do they do strictly on the back-end, and what devices they are using for it?
As far as CPU, there isn't a whole lot of cycles dedicated to decoding and replying to network packets, and serving static content is incredibly trivial. The interrupts from high loads of network traffic (usually small packets) are arguably the biggest impact on the CPU, which is why you get network cards that offload L3 processing much more efficiently than your CPU can. Network appliance vendors rely on them to help do things like transparently filter traffic on 40GB/s interfaces in real time, which would probably be impossible with a normal CPU.
They mentioned in a previous post that they mainly rely on L3 routing to load balance. And I don't remember if they specified this, but they hinted in the post that all customer traffic can be served by any frontend box; it just pulls the customer data from their main storage and caches it on frontends as needed, as any good proxy does.
The fact that high-frequency trading is big and specialized enough to merit hardware vendors catering specifically to their niche shouldn't have been surprising to me, but totally was.
In general, though, Cloudflare's posts continue to be totally fascinating.
I hope I'm not the only one missing the point of HFT.
Yes, you can make a lot of money buying and selling stocks, but what is HFT actually contributing to the economy other than paying HR costs for its employees?
Proponents argue HFT provides liquidity and increased market efficiency.
Market efficiency relies on extreme competition between investors that are highly intelligent and able to make quick decisions. When you can sum up much of those decisions in algorithms, what better way to deliver competition than HFT?
Opponents argue markets were efficient enough pre-HFT and that any benefits delivered by HFT are more 'academic' than practical. Opponents also argue that the level of uncertainty ("inexperience") with HFT brings more risks than argued benefits, as well as all of the "usual" criticisms of 'trading' vs. 'investing'.
As you may be able to tell, I myself am sitting on the fence on this issue, albeit with both feet pointing towards the opponents' side.
You might misunderstand what the purpose of HFT is - it's not about making a small margin on a huge volume of trades, it's all about order fulfillment (liquidity).
I'll try and illustrate with a contrived example. Say i want to buy 1000 shares at no more than $1 each. By using HFT, a trading firm can put together the order 'package' by combining many smaller trades at varying prices such that the average price comes out at the lowest (or target) price.
The competitive edge for a trading firm comes from being able to consistently fulfill orders, meaning they get more orders / customers.
Does that kind of explain the utility of HFT? Yes it allows a trading firm to make more money, but they way they do that is by providing a better service - not by simply exeucting a huge volume of trades and making fractions of a cent from each one.
Either way, it doesn't seem to be about actually funding companies. Maybe stock trades can help influence companies to change strategies or leaders, but I don't see the point of doing it in a forum from which the companies will never see the investment.
I don't think I understand the stock market as being anything more than a gambling game for people with a ton of money. Not sure what the IPOs of Facebook, Groupon, or Zynga did for anyone other than top execs who were already making a ton of cash per year, or the traders who bought and sold options.
Has the liquidity argument been proved in practice ? I'm not really close to trading so every time I hear about HFT is when they wreak havoc in market because of maladjusted algorithms.
Theoretically, it increases liquidity - the chances of a trade executing, and it keeps the trading price of the traded stocks close to their real value, whatever that is.
In practice, it's probably fairly neutral, so long as no-one is playing silly b@#$%^rs, front-running other people's trades or similar.
>> we are on our fourth generation (G4) of servers. Our first generation (G1) servers were stock Dell PowerEdge servers. We deployed these in early 2010
Wow, 4 generations of servers in 3 years. Talk about iterating quickly.
Was Cloudflare bootstrapped or did they start with a huge investment? 23 datacenters full of equipment sounds like a lot to me.
More than that -- and different amounts in different locations (e.g., London has more servers than Toronto) -- but correct that no where are we filling whole buildings with gear.
If CloudFlare is using Kernel version 3.3 or higher, they should look into using fq_codel as their scheduler instead of pfifo_fast to decrease latency under load. I suspect the 16MB buffer in the NIC isn't doing them any favors.
I asked one of our kernel guys. Here's his response:
"We are indeed looking at Codel. We were actually working on backporting BQL+Codel to the 2.6.x kernel but the Google guys finally got the network stack under control enough for us to deploy >3.3. The 16MB of buffers hasn't hurt us much yet, and may in the long run save us from switches that have too shallow a buffer for the high contention ratios we run on the switch." -LinuXY
Buffer Bloat is a real thing, it has to do with how the queuing algorithms work - I'm not the best at explaining it. I think it's mainly an issue in home routers / cable modems - but it cant hurt to do some Active Queue Management everywhere!
"Adding more cores to a CPU did help mitigate this and we tested some of the high core count AMD CPUs, but ultimately decided against going that direction."
You tested it, but you did not mention why you ultimately decided against them. Was there something specifically less good about the AMD CPU's or is Intel giving you a discount for keeping your servers all intel? (i.e. NIC's and SSD's etc)
> While top clockspeed was not our priority, our product roadmap includes more CPU-heavy features. These include image optimization (e.g., Mirage and Polish), high volumes of SSL/TLS connections, and extremely fast pattern expression matching (e.g., PCRE tests for our WAF). These CPU-heavy operations can, in most cases, take advantage of special vector processing instruction sets on post-Westmere Intel chips. This made Intel's newest generation Sandybridge chipset attractive.
and
> We were willing to sacrifice a bit of clockspeed and spend a bit more on chips to save power. We tend to put our equipment in data centers that have high network density. These facilities, however, are usually older and don't always have the highest power capacity. We settled on our G4 servers having two Intel Xeon 2630L CPUs (a low power chip in the Sandybridge family) running at 2.0GHz. This gives us 12 physical cores (and 24 virtual cores with hyperthreading) per server. The power savings per chip (60 watts vs. 95 watts) is sufficient to allow us at least one more server per rack than we'd be able to get if we went with the non-low power version.
So a combination of additional instructions and power savings.
"Specifically, we saw a 50% performance benefit addressing disks directly rather than going through the G3 hardware RAID."
Wow 50%! Is this because raid controller performance hasn't kept up with the evolution from spinning disks to SSD's, or have raid controllers always had that much overhead?
No, it's just that their workload is not very well served by any RAID levels: many small files of which it's okay to completely lose a disk's worth and are very randomly and unevenly accessed.
They actually only wanted load balancing and I'm sure that their purpose built solution does a better job of being balanced while avoiding increased risk from striping or performance loss from mirroring or parity (I'm curious what level(s) they were using). Though, cutting out the RAID layer when they didn't need it does save them a trip through the controller, which is more important these days when compared to SSD "seek" times.
I absolutely love CloudFlare! I save money by paying ~$25 a month for their service, because my bandwidth has been cut by over 65% since I signed up (the lightning fast loading, caching, security features and more are just icing on the cake). Instead of having to deal with optimzing my site for speed, they do it all at the click of one button. It's the best service I have ever signed up for, and I love it!
If CloudFlare ever offered optimized hosting (with PHP + MySQL), I would sign up in a snap and move all my websites there.
Had the opposite experience - no change in loading speeds when our ~600K visitor / month site switched to CloudFlare in March of this year, but we soon started experiencing long website downtimes where our server was running fine but CloudFlare was not serving our site, despite paying $200 a month for "always online", which the site clearly was not. All IPs were whitelisted with our host and everything else CloudFlare recommended; all CloudFlare would do was examine the site after it had come back online (following hours of downtime) and say, "Everything looks fine to us!"
We finally left, after months of this, and have had no problems with downtime since. I really wanted CloudFlare to work - I was really excited about it when I signed us up. But at least for a bigger site with heavier traffic that relies on being up as much as possible, I can't say I'd recommend it until it straightens out its downtime issues (especially when paying for "always online").
I keep hearing sporadic reports like this. The big thing with CloudFlare is you are putting them as the first hop in reaching your site. So they are an additional point of failure. Of course, it's not a zero-sum game, they could also end up increasing your uptime overall, and in many cases I believe that's the case.
Particularly for relatively low volume sites which have a short burst in traffic on occasion, CloudFlare can keep those sites running during the peaks.
I think the most important thing is transparency and correct expectations. If they set clear expectations, and they are transparent about how well they are meeting them, then it just comes down to delivery.
I found their status dashboard here: https://www.cloudflare.com/system-status. Unfortunately it doesn't show much long-term historical performance, it would be nice to see 30 days even 180 days of performance history to really evaluate them.
regal, did you find that when you had downtime on your site that it was reported in their status dashboard, and that was an accurate depiction of the service they provided? I think the worst-case scenario is getting hit with unreported downtime, because that brings up all sorts of questions.
I think the most important thing is transparency and correct expectations. If they set clear expectations, and they are transparent about how well they are meeting them, then it just comes down to delivery.
Agreed. So long as a customer knows what he/she's signing up for, and gets that, everything's fine. I might have misread what the "99.99% uptime guarantee" was supposed to be for and gotten too excited about it / taken it too seriously when I first signed up, or maybe this is for something else that's too complicated for a part-time tech guy like me to understand.
When I'd log in when the site was down, half the time CloudFlare would have the green arrow next to the site with a "Site Online" type indicator; other times it'd have the brown dot-dot-dot "Site Offline" indicator. I'd confirm numbers on this but apparently the service doesn't save this or makes it no longer available to you on account termination. Pingdom Tools would report the site as down, and when visiting the site, it wouldn't load, or would take 10+ seconds to load. There would also frequently be a "This website is offline; no cached version is available" page from CloudFlare when trying to load the site, even on the homepage, despite the guarantee to supposedly be saving and serving cached copies of the site in the event of downtime (and despite that being what I thought we were paying for, mainly) - sometimes those cached copies would show up too; though more often, there'd just be this page:
I opted to use CloudFlare almost as soon as they launched and they indeed had some uptime issues in the beginning. But I stuck with them and I haven't had any issues for a long while now even at 180M page views / month.
There was one time where I had just upgraded a server at a host and load averages just spiked. Host techs were clueless as well as the server admin I hired to diagnose the issue. They surmised that it was a DDoS attack and to turn off CloudFlare because it might be a cause (what?). Well I'm not a server admin so what do I know and I asked CloudFlare about it. They said there were no problems on their end. After a month of stressful, intermittent downtimes, I decided to just switch to Softlayer, and lo and behold, the issues went away. Turned out that one of the SSDs in the RAID array was dying but the techs at the other host just never bothered to look at it.
If you ever do figure out the issues, try giving them another shot. Their features are excellent and I hate to say it but my applications are now so dependent on them to the point where important parts will break if I even try to move to a competitor (and there are none).
We chose SolarFlare+Onload over Intel with pf_ring+DNA or DPDK mainly because of the fully featured TCP stack. While it may make sense for us to develop our own in the future, it does not currently. Additionally the SolarFlare cards gave us the benefit of 16MB buffers which could allow us to go with switches that have shallow buffers (cheaper.) There's also processing done on an FPGA on the card itself which allow us to drop packets on the card before they reach the machine all together, which is /really/ a boon under DDoS. SolarFlare has been a great partner in their willingness to work on our (non-standard) use case, which is something that is hard to find when dealing with larger vendors.
(I fully understand designing for the event - but the emphasis on it in the post makes it seem that you're under constant threat. I am assuming it is your customers that are actually being DDoS'd and Cloudflare just needs to be built up to stand against DDoS in this case??)
It is usually our customers who are attacked, but that hits our network so we need to be able to mitigate it. Last week we saw 163 "significant" attacks (which is a fairly typical week). A "significant" attack is one that generally exceeds 10Gbps, 5M PPS, or finds another way to affect other customers to the point that our ops team is alerted.
Cool writeup. Anyone recommend using the cloudflare service for commercial sites with sensitive data?
Seems like the system is robust, but I was looking for information on their policies regarding access-log retention and couldn't find much information online. Seems like they got a subpoena in the Barrett Brown case, and not sure how that all worked out.
Our business depends on the ability to process a byte of information as inexpensively as possible. Fighting for the lowest possible hardware pricing is part of that. While agreements keep us from disclosing pricing details, I can say that we work extremely hard to ensure we're getting the best price from the vendors we choose to work with.
It surprises me that the SYN attacks are being mitigated on the machines themselves; I was under the impression that this is typically done with hardware firewalls that offload the TCP handshake (thus filtering out spoofed SYN packets and other connections that the remote machine doesn't intend on actually establishing).
It does seem like doing it on the target machine will reduce latency a bit, though, since the hardware TCP offloaders usually repeat the TCP handshake (this time to the actual server) after confirming that it's valid.
A high-end "hardware firewall" is now an x86 server (usually previous-generation) running some fairly expensive proprietary software. For a company like CloudFlare that has good scale and security as a core competency, I think doing it themselves makes sense.
Bypassing the kernel created a curiosity for me; Why not develop the whole software under kernel space? It does provide the huge problem that any crash can cause a kernel panic, but no longer have to worry about performance related to virtual memory, page-miss, swapping, context switching, etc... And depended on the hardware, after a crash, it can reboot in less than 10 seconds.
Unless you literally had a single thread of execution running on each core, you would still have to worry about all those things you mentioned. Presumably, Cloudflare's software is too complex to make that feasible. So while it makes sense to bypass the kernel for the biggest bottleneck in their system, processing IP traffic, it wouldn't make sense to give up the convenience the kernel provides for things like scheduling, disk drivers, etc. Plus, security would obviously be a concern. Running complex internet-facing software in ring 0 on the bare metal all the time is like riding a motorcycle naked. Sure, you might be able to go a bit faster, but if things go wrong, there is literally no protection.
That hasn't been our experience. While we've optimized our file system to minimize wear, we do an extremely high volume of reads and writes on our SSDs. We have many SSDs (previous generations) that have been running full steam for 3 years in production. We've been pleasantly surprised with the number of write cycles they can endure without failure.
How do you measure writes in this case? Do you use SSD write caching? Is your filesystem caching the writes? Would love to see some stats/graphs to show real-world metrics of disk resilience.
Care to share the brand/variety? So far I've killed Intel, Kingston, Micron, OCZ and a couple offbrands. Wasn't terribly surprised by the offbrands, a little miffed at the Intel though...
This sort of scaling seems to be logarithmic. 16MB is 5 orders of magnitude greater than 512KB, however the logarithmic difference is 3.46. The same increase in 5 orders of magnitude beyond 16MB is 10.39 (at 524GB) which is considerably more impressive.
That's some fancy math and it probably is completely off, but another explanation is NIC performance doesn't follow exponential scaling for unexplained reasons.
The article suggests interrupt load is the major issue, although doesn't really say enough to inform us. A 12.5ms buffer means potentially just 80 wakeups/sec. They don't mention whether they use multiple hardware queues or if all interrupts hit a single core.
I'd also be interested to know if polling mode was tested with any of the cards, and why it didn't work out
Seeing as there are CloudFlare employees monitoring this discussion, what would you say about CloudFlare's relatively poor showing on the Cedexis country reports http://www.cedexis.com/country-reports/ (e.g. average response time for US is 179ms vs 96ms for Cloudfront).
That's the beauty of specialisation and economies of scale. They can justify, and afford, the extra sticker price and coordination cost to set it all up. And we customers win from that improvement.
Definitely several things we can learn from here. We have to do something similar, though we're still at a much smaller scale. That said, the wall'o'scaling is looming large and we're finding that even initial steps of building our own hardware is paying dividends.
I'm quite interested in the Disk I/O lesson's you've learned, specially when dealing with large amounts of RAM. We have to store large indexed data stores (NoSQL, usually Redis) for persisten, extremely high-speed access. A lot to learn here from what you did with SSD's to back that up, especially the lack of RAID.
Interesting read. You're scale and capacity concerns are a problem I'm hoping to have. We're still in an entry level stage where we're purchasing refurbished gear and slowly scaling horizontally across facilities. No need to have someone build us equipment just for us yet but your ssd selections and processor selections are pretty interesting and we'll probably build similar Supermicro boxes based on your experiences. Thanks again for the write-up. Happy CloudFlare Pro customer here.
Why do people still try to DDOS CloudFlare protected websites? The Cyberbunker-Spamhaus incident showed they can survive 300 Gbit/s, I thought after that everyone would just back off.
Sweet write up. I'm curious as to how CloudFlare's needs are similar/different from other companies (i.e. Facebook, Google, Twitter etc.). It seems to me that there would already be quite a lot of iteration and therefore experience/knowledge for what the best set-up would be? Though a large portion seems to be with new hardware coming out.
I work for a company building servers for Facebook. They really did try to follow the Open Compute (http://www.opencompute.org/) architecture but ... um, it didn't work out particularly well. They turned to us for tweaking, Quanta for server manufacturing, and us again for rack integration and large scale testing.
<edit> We are the OEM (e.g. design and build) for large scale storage arrays for Amazon & Netflix, too, but not compute servers.
To clarify the pronouns, "they" here does not refer to CloudFlare. It may refer to Facebook, but I hope not as it would certainly be a violation of a NDA.
After reading the article I took a look at their website. I even watched the 4 minute cartoon they made, it was funny, but ultimately didn't really tell me what they actually do. I guess I'm not their target audience.
Cloudflare is a CDN with extra bells and whistles.
A CDN is a way to off-load the bulk of the requests to your webserver by moving the content as close as possible to your end-users, thus reducing the number of hops required to get to the content, which in return increases end-user satisfaction with your product due to a decrease in page load time.
The theory is that if a user gets a snappy service they are more willing to spend their money, and so e-commerce sites and sites that tend to monetize their users in some way find benefits in using services like these.
I hope that explains it adequately. To label cloudflare a mere CDN is a dis-service to them but for explanation purposes it might as well be, I'm sure someone from CloudFlare is able to give a much better explanation of just why their offering is not just an ordinary CDN but goes much further than that.
I find it nasty how some vendors provide proprietary SFP+ connectors. I wouldn't deal with such types. They should make it an official standard and end the extortion.
Realistically you can escape the shakedown by buying "vendor compatible" transceivers. If they officially supported random transceivers they'd just make up the lost profit by increasing support prices.
No, that is the release you need to use to remove the PS from the chassis. You can see the two latches it moves on the right side of the case, a cm or two from the back.
(edit to include links)
http://www.intel.com/content/www/us/en/network-adapters/conv...
https://en.wikipedia.org/wiki/10-gigabit_Ethernet#10GBASE-T