
A Tour Inside CloudFlare's Latest Generation Servers - jgrahamc
http://blog.cloudflare.com/a-tour-inside-cloudflares-latest-generation-servers
======
danielsiders
"10Gbps Ethernet doesn't run across standard Cat5/6 cable. Instead, you need
what is known as an SFP."

(edit to include links)

[http://www.intel.com/content/www/us/en/network-
adapters/conv...](http://www.intel.com/content/www/us/en/network-
adapters/converged-network-adapters/ethernet-x540.html)

[https://en.wikipedia.org/wiki/10-gigabit_Ethernet#10GBASE-T](https://en.wikipedia.org/wiki/10-gigabit_Ethernet#10GBASE-T)

~~~
eastdakota
My mistake. Early draft (with incorrect information) included in the article.
Our network guys corrected me. Here's what it's being updated to read:

===

Finally, while 10Gbps Ethernet can run across standard Cat5/6 cable, we
elected to use SFP+ connectors. We chose this to have the flexibility between
optical (fiber) and copper connections. Some network card and switch vendors
lock down their equipment to only support proprietary SFP+s, which they charge
a significant premium for. We spent significant time testing a combination of
SFP+ vendors before finding FiberStore, a SFP+ manufacturer from which we
could directly source SFP+s at a reasonable price that worked in the network
gear we wanted to use.

~~~
brooksbp
For anyone interested in copper vs fiber, take some time to read Chapter 1 of
Nathan Farrington's dissertation:
bnathanfarrington.com/papers/dissertation.pdf

As 10Gbps, and even 40G/100G, gain traction, you start running into cable
length issues:

Attenuation and maximum length of AWG 7 copper wire:

Bit Rate (Gb/s), dB/m, Loss/m, Max length (m)

1, 0.721, 15.3%, 106

10, 2.28, 40.8%, 33.5

40, 4.56, 65.0%, 16

100, 7.21, 81.0%, 10.6

------
zaroth
I love how a business can focus on doing one thing well, recruit some awesome
talent, and end up doing some great R&D and pushing the envelope in their
specific field. It's great to see companies like this succeeding.

So DDoS could be...

    
    
      1) an L3 packet you would drop, but still saturating your uplink (e.g. DNS amp)
      2) a request for some static asset (img, html, octet...)
      3) a request for a dynamic page (hitting your app farm)
    

CloudFlare provides something like an automatically populated CDN which
includes defense against #1 and #2, and they're using distributed data
centers, and servers with 10GE and SSDs to run their network. They said 23
data centers (locations), but didn't mention how many of these servers they
run.

Apparently they are able to run HTTP sessions over IP ANYCAST without any
issue (I read claims you could only do UDP), so that's pretty cool. BTW - I
wish it was easier to setup ANYCAST on your own... it seems like a major
investment at the moment.

It's interesting they don't need more CPU power -- I would have expected more
CPU would be deployed, since having minimal CPU could provide an attack
vector.

Another thing I'm curious about is how they are distributing load between
those cute servers they built? Do they segregate specific customers/domains
onto specific IPs and then route those IPs to specific boxes in each data
center? Or does everything come in "equal" and then get round-robin/least-load
divided up by some massive load balancers? Basically, I wonder how much of the
load balancing do they try to do "client-side" or "client-based" versus how
much do they do strictly on the back-end, and what devices they are using for
it?

~~~
eastdakota
The best way to get answers to your questions:

[https://www.cloudflare.com/join-our-team](https://www.cloudflare.com/join-
our-team)

------
apendleton
The fact that high-frequency trading is big and specialized enough to merit
hardware vendors catering specifically to their niche shouldn't have been
surprising to me, but totally was.

In general, though, Cloudflare's posts continue to be totally fascinating.

~~~
revelation
Lots and lots of investment for high frequency trading. For example, a new
undersea cable to cut 6ms between London and New York [1].

1:
[http://www.telegraph.co.uk/technology/news/8753784/The-300m-...](http://www.telegraph.co.uk/technology/news/8753784/The-300m-cable-
that-will-save-traders-milliseconds.html)

~~~
mscarborough
I hope I'm not the only one missing the point of HFT.

Yes, you can make a lot of money buying and selling stocks, but what is HFT
actually contributing to the economy other than paying HR costs for its
employees?

~~~
sturadnidge
You might misunderstand what the purpose of HFT is - it's not about making a
small margin on a huge volume of trades, it's all about order fulfillment
(liquidity).

I'll try and illustrate with a contrived example. Say i want to buy 1000
shares at no more than $1 each. By using HFT, a trading firm can put together
the order 'package' by combining many smaller trades at varying prices such
that the average price comes out at the lowest (or target) price.

The competitive edge for a trading firm comes from being able to consistently
fulfill orders, meaning they get more orders / customers.

Does that kind of explain the utility of HFT? Yes it allows a trading firm to
make more money, but they way they do that is by providing a better service -
not by simply exeucting a huge volume of trades and making fractions of a cent
from each one.

~~~
mscarborough
Yes, I will take this into account, thank you.

Either way, it doesn't seem to be about actually funding companies. Maybe
stock trades can help influence companies to change strategies or leaders, but
I don't see the point of doing it in a forum from which the companies will
never see the investment.

I don't think I understand the stock market as being anything more than a
gambling game for people with a ton of money. Not sure what the IPOs of
Facebook, Groupon, or Zynga did for anyone other than top execs who were
already making a ton of cash per year, or the traders who bought and sold
options.

------
bluedino
>> we are on our fourth generation (G4) of servers. Our first generation (G1)
servers were stock Dell PowerEdge servers. We deployed these in early 2010

Wow, 4 generations of servers in 3 years. Talk about iterating quickly.

Was Cloudflare bootstrapped or did they start with a huge investment? 23
datacenters full of equipment sounds like a lot to me.

~~~
wmf
I suspect they have 23 cages, not full datacenters.

~~~
eastdakota
More than that -- and different amounts in different locations (e.g., London
has more servers than Toronto) -- but correct that no where are we filling
whole buildings with gear.

------
virtuallynathan
If CloudFlare is using Kernel version 3.3 or higher, they should look into
using fq_codel as their scheduler instead of pfifo_fast to decrease latency
under load. I suspect the 16MB buffer in the NIC isn't doing them any favors.

~~~
eastdakota
I asked one of our kernel guys. Here's his response:

"We are indeed looking at Codel. We were actually working on backporting
BQL+Codel to the 2.6.x kernel but the Google guys finally got the network
stack under control enough for us to deploy >3.3. The 16MB of buffers hasn't
hurt us much yet, and may in the long run save us from switches that have too
shallow a buffer for the high contention ratios we run on the switch." -LinuXY

~~~
virtuallynathan
Interesting, thanks for the response!

------
tinco
"Adding more cores to a CPU did help mitigate this and we tested some of the
high core count AMD CPUs, but ultimately decided against going that
direction."

You tested it, but you did not mention why you ultimately decided against
them. Was there something specifically less good about the AMD CPU's or is
Intel giving you a discount for keeping your servers all intel? (i.e. NIC's
and SSD's etc)

~~~
rtkwe
> While top clockspeed was not our priority, our product roadmap includes more
> CPU-heavy features. These include image optimization (e.g., Mirage and
> Polish), high volumes of SSL/TLS connections, and extremely fast pattern
> expression matching (e.g., PCRE tests for our WAF). These CPU-heavy
> operations can, in most cases, take advantage of special vector processing
> instruction sets on post-Westmere Intel chips. This made Intel's newest
> generation Sandybridge chipset attractive.

and

> We were willing to sacrifice a bit of clockspeed and spend a bit more on
> chips to save power. We tend to put our equipment in data centers that have
> high network density. These facilities, however, are usually older and don't
> always have the highest power capacity. We settled on our G4 servers having
> two Intel Xeon 2630L CPUs (a low power chip in the Sandybridge family)
> running at 2.0GHz. This gives us 12 physical cores (and 24 virtual cores
> with hyperthreading) per server. The power savings per chip (60 watts vs. 95
> watts) is sufficient to allow us at least one more server per rack than we'd
> be able to get if we went with the non-low power version.

So a combination of additional instructions and power savings.

------
Element_
"Specifically, we saw a 50% performance benefit addressing disks directly
rather than going through the G3 hardware RAID."

Wow 50%! Is this because raid controller performance hasn't kept up with the
evolution from spinning disks to SSD's, or have raid controllers always had
that much overhead?

~~~
lotyrin
No, it's just that their workload is not very well served by any RAID levels:
many small files of which it's okay to completely lose a disk's worth and are
very randomly and unevenly accessed.

They actually only wanted load balancing and I'm sure that their purpose built
solution does a better job of being balanced while avoiding increased risk
from striping or performance loss from mirroring or parity (I'm curious what
level(s) they were using). Though, cutting out the RAID layer when they didn't
need it does save them a trip through the controller, which is more important
these days when compared to SSD "seek" times.

~~~
eastdakota
Yup.

------
jeremydavid
I absolutely love CloudFlare! I _save_ money by paying ~$25 a month for their
service, because my bandwidth has been cut by over 65% since I signed up (the
lightning fast loading, caching, security features and more are just icing on
the cake). Instead of having to deal with optimzing my site for speed, they do
it all at the click of one button. It's the best service I have ever signed up
for, and I love it!

If CloudFlare ever offered optimized hosting (with PHP + MySQL), I would sign
up in a snap and move all my websites there.

~~~
regal
Had the opposite experience - no change in loading speeds when our ~600K
visitor / month site switched to CloudFlare in March of this year, but we soon
started experiencing long website downtimes where our server was running fine
but CloudFlare was not serving our site, despite paying $200 a month for
"always online", which the site clearly was not. All IPs were whitelisted with
our host and everything else CloudFlare recommended; all CloudFlare would do
was examine the site after it had come back online (following hours of
downtime) and say, "Everything looks fine to us!"

We finally left, after months of this, and have had no problems with downtime
since. I really wanted CloudFlare to work - I was really excited about it when
I signed us up. But at least for a bigger site with heavier traffic that
relies on being up as much as possible, I can't say I'd recommend it until it
straightens out its downtime issues (especially when paying for "always
online").

~~~
zaroth
I keep hearing sporadic reports like this. The big thing with CloudFlare is
you are putting them as the first hop in reaching your site. So they are an
additional point of failure. Of course, it's not a zero-sum game, they could
also end up increasing your uptime overall, and in many cases I believe that's
the case.

Particularly for relatively low volume sites which have a short burst in
traffic on occasion, CloudFlare can keep those sites running during the peaks.

I think the most important thing is transparency and correct expectations. If
they set clear expectations, and they are transparent about how well they are
meeting them, then it just comes down to delivery.

I found their status dashboard here: [https://www.cloudflare.com/system-
status](https://www.cloudflare.com/system-status). Unfortunately it doesn't
show much long-term historical performance, it would be nice to see 30 days
even 180 days of performance history to really evaluate them.

regal, did you find that when you had downtime on your site that it was
reported in their status dashboard, and that was an accurate depiction of the
service they provided? I think the worst-case scenario is getting hit with
unreported downtime, because that brings up all sorts of questions.

~~~
regal
_I think the most important thing is transparency and correct expectations. If
they set clear expectations, and they are transparent about how well they are
meeting them, then it just comes down to delivery._

Agreed. So long as a customer knows what he/she's signing up for, and gets
that, everything's fine. I might have misread what the "99.99% uptime
guarantee" was supposed to be for and gotten too excited about it / taken it
too seriously when I first signed up, or maybe this is for something else
that's too complicated for a part-time tech guy like me to understand.

When I'd log in when the site was down, half the time CloudFlare would have
the green arrow next to the site with a "Site Online" type indicator; other
times it'd have the brown dot-dot-dot "Site Offline" indicator. I'd confirm
numbers on this but apparently the service doesn't save this or makes it no
longer available to you on account termination. Pingdom Tools would report the
site as down, and when visiting the site, it wouldn't load, or would take 10+
seconds to load. There would also frequently be a "This website is offline; no
cached version is available" page from CloudFlare when trying to load the
site, even on the homepage, despite the guarantee to supposedly be saving and
serving cached copies of the site in the event of downtime (and despite that
being what I thought we were paying for, mainly) - sometimes those cached
copies would show up too; though more often, there'd just be this page:

[http://image2.romantika.name/2013/01/cloudflare-website-
offl...](http://image2.romantika.name/2013/01/cloudflare-website-offline.png)

------
MichaelGG
Any reason you went with Solarflare/OpenOnload versus standard Intel NICs with
DNA[1]?

1:
[http://www.ntop.org/products/pf_ring/dna/](http://www.ntop.org/products/pf_ring/dna/)

~~~
LinuXY
We chose SolarFlare+Onload over Intel with pf_ring+DNA or DPDK mainly because
of the fully featured TCP stack. While it may make sense for us to develop our
own in the future, it does not currently. Additionally the SolarFlare cards
gave us the benefit of 16MB buffers which could allow us to go with switches
that have shallow buffers (cheaper.) There's also processing done on an FPGA
on the card itself which allow us to drop packets on the card before they
reach the machine all together, which is /really/ a boon under DDoS.
SolarFlare has been a great partner in their willingness to work on our (non-
standard) use case, which is something that is hard to find when dealing with
larger vendors.

~~~
samstave
How often are you experiencing DDoS attacks?

(I fully understand designing for the event - but the emphasis on it in the
post makes it seem that you're under constant threat. I am assuming it is your
customers that are actually being DDoS'd and Cloudflare just needs to be built
up to stand against DDoS in this case??)

~~~
eastdakota
It is usually our customers who are attacked, but that hits our network so we
need to be able to mitigate it. Last week we saw 163 "significant" attacks
(which is a fairly typical week). A "significant" attack is one that generally
exceeds 10Gbps, 5M PPS, or finds another way to affect other customers to the
point that our ops team is alerted.

------
23david
Cool writeup. Anyone recommend using the cloudflare service for commercial
sites with sensitive data?

Seems like the system is robust, but I was looking for information on their
policies regarding access-log retention and couldn't find much information
online. Seems like they got a subpoena in the Barrett Brown case, and not sure
how that all worked out.

~~~
xxdesmus
There's a blog post on what we log: [http://blog.cloudflare.com/what-
cloudflare-logs](http://blog.cloudflare.com/what-cloudflare-logs)

------
chatmasta
Nice writeup. I'm interested in seeing some pricing data, but I understand if
you're trying to keep that under wraps.

~~~
eastdakota
Our business depends on the ability to process a byte of information as
inexpensively as possible. Fighting for the lowest possible hardware pricing
is part of that. While agreements keep us from disclosing pricing details, I
can say that we work extremely hard to ensure we're getting the best price
from the vendors we choose to work with.

------
kbuck
It surprises me that the SYN attacks are being mitigated on the machines
themselves; I was under the impression that this is typically done with
hardware firewalls that offload the TCP handshake (thus filtering out spoofed
SYN packets and other connections that the remote machine doesn't intend on
actually establishing).

It does seem like doing it on the target machine will reduce latency a bit,
though, since the hardware TCP offloaders usually repeat the TCP handshake
(this time to the actual server) after confirming that it's valid.

~~~
wmf
A high-end "hardware firewall" is now an x86 server (usually previous-
generation) running some fairly expensive proprietary software. For a company
like CloudFlare that has good scale and security as a core competency, I think
doing it themselves makes sense.

------
Nican
Bypassing the kernel created a curiosity for me; Why not develop the whole
software under kernel space? It does provide the huge problem that any crash
can cause a kernel panic, but no longer have to worry about performance
related to virtual memory, page-miss, swapping, context switching, etc... And
depended on the hardware, after a crash, it can reboot in less than 10
seconds.

Am I overlooking something?

~~~
zhemao
Unless you literally had a single thread of execution running on each core,
you would still have to worry about all those things you mentioned.
Presumably, Cloudflare's software is too complex to make that feasible. So
while it makes sense to bypass the kernel for the biggest bottleneck in their
system, processing IP traffic, it wouldn't make sense to give up the
convenience the kernel provides for things like scheduling, disk drivers, etc.
Plus, security would obviously be a concern. Running complex internet-facing
software in ring 0 on the bare metal all the time is like riding a motorcycle
naked. Sure, you might be able to go a bit faster, but if things go wrong,
there is literally no protection.

------
bifrost
> "SSDs give us three advantages. First, they tend to fail gradually over time
> rather than catastrophically"

Uh, I've _never_ had one fail gradually...

> "Intel reports that the 520-series drives have a mean time between failure
> (MTBF) of 1,200,000 hours (about 137 years)."

Yes, but they have a maximum write cycle, you can blow through the average
consumer drive in a month and a half of concerted writing.

~~~
eastdakota
That hasn't been our experience. While we've optimized our file system to
minimize wear, we do an extremely high volume of reads and writes on our SSDs.
We have many SSDs (previous generations) that have been running full steam for
3 years in production. We've been pleasantly surprised with the number of
write cycles they can endure without failure.

~~~
peterwwillis
How do you measure writes in this case? Do you use SSD write caching? Is your
filesystem caching the writes? Would love to see some stats/graphs to show
real-world metrics of disk resilience.

~~~
eastdakota
We'll do a post at some point on file system benchmarks and what we do to get
the most performance and life out of our SSDs.

------
Everlag
That is some beautiful hardware porn.

16MB vs 512KB is only larger? Not gigantic or even extremely large in
comparison to the extremely tiny cache? Oh dear.

~~~
ihsw
This sort of scaling seems to be logarithmic. 16MB is 5 orders of magnitude
greater than 512KB, however the logarithmic difference is 3.46. The same
increase in 5 orders of magnitude beyond 16MB is 10.39 (at 524GB) which is
considerably more impressive.

That's some fancy math and it probably is completely off, but another
explanation is NIC performance doesn't follow exponential scaling for
unexplained reasons.

~~~
hosay123
The article suggests interrupt load is the major issue, although doesn't
really say enough to inform us. A 12.5ms buffer means potentially just 80
wakeups/sec. They don't mention whether they use multiple hardware queues or
if all interrupts hit a single core.

I'd also be interested to know if polling mode was tested with any of the
cards, and why it didn't work out

~~~
eastdakota
We use multiple hardware queues. Some details in this old blog post:

[http://blog.cloudflare.com/how-the-cloudflare-team-got-
into-...](http://blog.cloudflare.com/how-the-cloudflare-team-got-into-bondage-
its)

We're continuing to experiment with polling-based network queues.

~~~
hosay123
Awesome, thanks!

------
asb
Seeing as there are CloudFlare employees monitoring this discussion, what
would you say about CloudFlare's relatively poor showing on the Cedexis
country reports [http://www.cedexis.com/country-
reports/](http://www.cedexis.com/country-reports/) (e.g. average response time
for US is 179ms vs 96ms for Cloudfront).

~~~
eastdakota
I'd be curious to see per-region breakdowns. There are definitely regions we
need to expand into. Most notably, Latin America. Stay tuned...

~~~
asb
Well take a look at the European countries for instance. Gives a similar
granularity to looking at US states.

~~~
chewxy
Nobody cares about CDNs in Australia :(. The CDN companies with PoP in/near
Australia/NZ are ridiculously expensive.

~~~
josephb
There are plenty of options in AU that are priced similarly to their presence
in other countries.

Cloudfront, Cloudflare, Rackspace via Akamai etc

------
cliveowen
Wow, this was very interesting. It's rare to catch a glimpse behind the
scenes, especially when it comes to HW.

~~~
hga
Particularly cool to see hardware optimized for dealing with DDoS attacks,
which is not one of the more common pain points I'm familiar with.

~~~
jacques_chester
That's the beauty of specialisation and economies of scale. They can justify,
and afford, the extra sticker price and coordination cost to set it all up.
And we customers win from that improvement.

------
kposehn
Fascinating.

Definitely several things we can learn from here. We have to do something
similar, though we're still at a much smaller scale. That said, the
wall'o'scaling is looming large and we're finding that even initial steps of
building our own hardware is paying dividends.

I'm quite interested in the Disk I/O lesson's you've learned, specially when
dealing with large amounts of RAM. We have to store large indexed data stores
(NoSQL, usually Redis) for persisten, extremely high-speed access. A lot to
learn here from what you did with SSD's to back that up, especially the lack
of RAID.

------
ksec
Has G4 been actually deployed?

And it seems some Gen4 will get Ivy Bridge 8 Core Xeon E5, 256GB Ram and Intel
SC3700 SSD?

And why only 10Gbps Per Server, surely you could fit one more Solarflare in?

Other then that i really hope Cloudflare could expand beyond the current PoP.

------
jetsnoc
Interesting read. You're scale and capacity concerns are a problem I'm hoping
to have. We're still in an entry level stage where we're purchasing
refurbished gear and slowly scaling horizontally across facilities. No need to
have someone build us equipment just for us yet but your ssd selections and
processor selections are pretty interesting and we'll probably build similar
Supermicro boxes based on your experiences. Thanks again for the write-up.
Happy CloudFlare Pro customer here.

------
ancarda
Why do people still try to DDOS CloudFlare protected websites? The
Cyberbunker-Spamhaus incident showed they can survive 300 Gbit/s, I thought
after that everyone would just back off.

------
chiph
(not a hardware guy, so...) Do the riser cards in the servers cause any
bottlenecks or problems? I see another connector and think "point of failure".

------
sarunas
I don't believe the 100us saving by using kernel bypass such as OpenOnLoad is
truly accurate, its more like 5us.

[https://support.solarflare.com/index.php?option=com_cognidox...](https://support.solarflare.com/index.php?option=com_cognidox&file=SF-105547-CD-5_Low_Latency_Quickstart_Guide.pdf&task=download&format=raw&Itemid=11)
(requires login)

------
samstave
How much are these per node?

~~~
ashmud
From comments: damoncloudflare: "We would not disclose that."

------
aconz2
Sweet write up. I'm curious as to how CloudFlare's needs are similar/different
from other companies (i.e. Facebook, Google, Twitter etc.). It seems to me
that there would already be quite a lot of iteration and therefore
experience/knowledge for what the best set-up would be? Though a large portion
seems to be with new hardware coming out.

~~~
eitally
I work for a company building servers for Facebook. They really did try to
follow the Open Compute
([http://www.opencompute.org/](http://www.opencompute.org/)) architecture but
... um, it didn't work out particularly well. They turned to us for tweaking,
Quanta for server manufacturing, and us again for rack integration and large
scale testing.

<edit> We are the OEM (e.g. design and build) for large scale storage arrays
for Amazon & Netflix, too, but not compute servers.

~~~
eastdakota
To clarify the pronouns, "they" here does not refer to CloudFlare. It may
refer to Facebook, but I hope not as it would certainly be a violation of a
NDA.

------
braum
After reading the article I took a look at their website. I even watched the 4
minute cartoon they made, it was funny, but ultimately didn't really tell me
what they actually do. I guess I'm not their target audience.

~~~
jacquesm
Cloudflare is a CDN with extra bells and whistles.

A CDN is a way to off-load the bulk of the requests to your webserver by
moving the content as close as possible to your end-users, thus reducing the
number of hops required to get to the content, which in return increases end-
user satisfaction with your product due to a decrease in page load time.

The theory is that if a user gets a snappy service they are more willing to
spend their money, and so e-commerce sites and sites that tend to monetize
their users in some way find benefits in using services like these.

I hope that explains it adequately. To label cloudflare a mere CDN is a dis-
service to them but for explanation purposes it might as well be, I'm sure
someone from CloudFlare is able to give a much better explanation of just why
their offering is not just an ordinary CDN but goes much further than that.

------
znowi
I find it nasty how some vendors provide proprietary SFP+ connectors. I
wouldn't deal with such types. They should make it an official standard and
end the extortion.

~~~
wmf
Realistically you can escape the shakedown by buying "vendor compatible"
transceivers. If they officially supported random transceivers they'd just
make up the lost profit by increasing support prices.

------
Keyneston
What are you using for out of band management?

------
gojomo
Is that turquoise thing sticking out of the power supply an easily-replaceable
fuse?

~~~
anderiv
No, that is the release you need to use to remove the PS from the chassis. You
can see the two latches it moves on the right side of the case, a cm or two
from the back.

------
gtirloni
Do I smell (Open)Solaris?

