
Load Balancing without Load Balancers - jgrahamc
http://blog.cloudflare.com/cloudflares-architecture-eliminating-single-p
======
mnutt
_If a particular server starts to become overloaded, and it appears there is
sufficient capacity elsewhere, then just some of the BGP routes can be
withdrawn to take some traffic away from the overloaded server_

I'd be interested to hear about the mechanism for determining if there is
sufficient capacity elsewhere, and how you avoid a cascading failure.

~~~
eggnet
I'm interested in finding out what happens to tcp sessions that were
established when routes are withdrawn.

~~~
mnutt
This is the only thing I've found on this:

<http://news.ycombinator.com/item?id=2484047>

------
Titanous
I'd be interested to see a blog post that details what "adjusting the way we
handle [TCP] protocol negotiation itself [for Anycast]" entails.

------
sophacles
Something that isn't quite clicking for me from this article, and my other
knowledge of cloudflare...

So you have Nd DNS IPs, Nc Cache IPs, Np Proxy Ips and so on, plus some
failovers. It seems to me you can only have Sum(Nsubscript) + M servers in any
given pop. Which is all good, but I presume that the load on proxy and cache
servers would be such that you'd need quite a few instances of each. Further,
given the nature of cloudflare's services, it would seem that some CDNed sites
would be heavier than others.

So how do you assign various sites to IPs? Is this via some dynamic DNS magic?
Is there a lot of communication between proxy and cache instances at each pop
(DHT or similar?).

Basically, what I'm saying is, using BGP to do most of the load balancing is
awesome, but it seems there has to be more to it than that, otherwise you'd
experience a lot of flapping between servers handling heavy sites.

That or I'm missing something, but what?

~~~
junkilo
You anycast the dns servers and anycast the results the dns servers serve up.
This is kinda confusing, but all it is really is announcing your (same) IP
block at every datacenter and relying (maybe selling/) on bgp AS path
selection as the app to get traffic to the closest datacenter.

------
23david
Why not also use software load balancers? I don't see the advantage of going
to so much effort to avoid software load balancing. It's neat that you guys
got it to work, but I would think that a hybrid system would have more
functionality and could better handle degraded performance situations.

~~~
hhw
Exactly, as it's pretty much standard with other CDN's, and has been for a
very long time. Without any application level intelligence, you would end up
with duplicate copies of content, as every node in every city ends up caching
every bit of content. You're then limited to the amount of cache each node
has, resulting in mush less available cache overall, more content expiring
sooner, and even more requests back to the origin server. Although, if you use
a tiered approach with some intermediate nodes between the end cachers and the
origin server, you could mitigate this somewhat but it would still be quite
suboptimal. The more standard method is to use a 2 layer approach, with the
front end first layer intelligently hashing the full URL across the pool of
the back end second layer. The trickier part then is if a single object
requires more than one back end node, you need to be hash the same content to
more than one node. This could be done by just having multiple pools of server
on the back end, if the scale requires it.

------
jcr
> 2\. Router: at the edge of each of our PoPs is a router. This router
> announces the paths packets take to CloudFlare's network from the rest of
> the Internet.

Umm, just curious, but I thought you used a set of redundant routers at each
POP?

Or is the single router used highly redundant on its own?

~~~
msumpter
Typically a PoP could be defined as either an individual provider being
brought into a facility or the location of a group of providers entering the
facility (meet me room). In this instance I think they mean an individual
provider (for example Level3) bringing in a feed into a data center. You would
then get multiple feeds from distinct providers. Each feed would require a
router for termination, and this router is typically using BGP. Since you keep
multiple providers having redundant routers on each feed isn't always
necessary. But there are cases where I've seen a single feed being served by
redundant routers using something like VRRP. After that you would take feeds
from the boarder router into our inner/core switch fabric to be distributed
throughout your network. It just depends on the level of redundancy you want
at each layer.

~~~
jcr
Thankyou. I remember VRRP being a bit of a mess from inception due to the
Cisco patent claims against the original IETF draft (too similar to their
patented HSRP). Are there still a lot of compatibility problems across
vendors? At home I used to use CARP (Common Address Redundancy Protocol) on
OpenBSD, but I haven't done it for awhile, and I know it's improved a lot
since then (i.e. pfsync(4), ifstated(8), ...).

------
jacobian
What, exactly, does the picture of a woman in a swimsuit have to do with load
balancing?

~~~
coat
She's...balancing?

------
rdl
One thing I've wondered is how homogenous your POPs are -- how big is the
multiple (in router size/number and server count) between say San Jose POP and
Sydney POP. I assume you'd want to scale based on expected traffic which
considers each POP nearest, but that changes based on factors outside your
control, and could change fairly fast. Plus DoS might come from areas which
don't see a lot of regular traffic.

~~~
eastdakota
Yes, our PoPs are different sizes depending on the load. San Jose and Los
Angeles, for example, are larger than Seattle. We've designed the system to
scale relatively linearly by adding additional servers. As a PoP gets more
traffic we can handle it in two ways: 1) adding more equipment to the PoP; or
2) adding another PoP to offload a portion of the traffic.

------
spydum
One thing I always wondered is how session persistence is maintained for
things like webapps when trying to host services through anycast. Only idea I
can dream up is cookie based and involves each site having internal proxy
mappings to each pop in the event a different pop becomes the favored route
(paths change all the time, and routers on the internet dont care that you had
an established connection).

~~~
toast0
CloudFlare doesn't do sessions, they're just* a proxy cache to their
customers' origins. * A nice proxy cache; and they do have some features to
muck about with the content on the way through if you desire, but I don't see
any application aware routing options.

What you're asking about isn't really an anycast problem either, you can have
the same situation with any load balancing situation; if you need the client
to come back to the same server, you need the client to bring you back
something that says which server to route to, for example a cookie or a
hostname, and you need something in your stack that handles that and routes it
(could be DNS entries, a hardware or software load balancer, application
logic, etc). Avoiding sessions is better, of course, if you can (or if you can
have the client keep the state information; possibly encrypted and signed).

------
junkilo
Great writeup! If you offer a global service you are expected to use anycast
on the WAN. BGP AS path selection is just as effective whether you use unicast
or anycast.

------
jamieb
+1 for remembering Las Vegas (and Seattle) go first.

~~~
eastdakota
:-)

------
TOGoS
This article would be more enjoyable without the extraneous clip art.

