Interesting that Linux kernel performance (ipvs) is acceptable at l4 vs something like dpdk. I guess you just overcome the limitation by increasing the number of l4 instances load balanced by ecmp.
Fun to see DSR in use.
Also interesting to see that all the inherent problems with geolocation via gslb (DNS client IP is not the same as the real client IP) don't wind up being a big problem apparently. This seems to be a growing concern in my experience: users aren't located where thier ISP DNS servers are located.
It's mostly because the point of DPDK and similar is to go around a lot of the processing in kernel, and IPVS does exactly this. I'm surprised IPVS isn't more popular, it's built into the kernel and extremely fast.
HTTP proxy type load balancers are slugs in comparison
Scaling app servers to nearly unlimited size is easy to explain but really hard in practice. It basically amounts to this:
1) Balance requests using DNS anycast so you can spread load before it hits your servers
2) Setup "Head End" machines with as large pipes as possible (40Gbps?) and load balance at the lowest layer you can. Balance at IP level using IPVS and direct server return. A single reasonable machine can handle a 40Gbps pipe. I guess you could setup a bunch of these but I doubt many people are over 40Gpbs. Oh, and don't use cloud services for these. The virtualization overhead is high on the network plane and even with SR-IOV you don't get access to all hardware NIC queues. Also, I don't know of any cloud provider thats compatible with direct server return since they typically virtualize your "private cloud" at layer 3, whereas IPVS actually touches layer 2 a little. Do yourself a favor and get yourself a few colo's for your load balancers.
3) Setup a ton of HTTP-proxy type load balancers. This includes Nginx, Varnish, Haproxy etc... One of these machines can probably handle 1-5 Gbps of traffic so expect 20 or so behind each layer 3 balancer. These NEED to be hardened substantially because most attacks will be layer 4 and up once an adversary realizes they can't just flood you out(due to powerful IPVS balancers above). SYN cookies are extremely important here since you're dealing with TCP... just try to set everything up to avoid storing TCP state at all costs. This also means no NAT. You might want to keep these in the colo with your L3 load balancers.
4) Now for your app servers. Depending on if you're using a dog slow language or not, you'll want between 3 and 300 app servers behind each HTTP proxy. You don't really need to harden these as much since the traffic is lower and any traffic that reaches here is clean HTTP. Go ahead and throw these on the cloud if want
>"'Im surprised IPVS isn't more popular, it's built into the kernel and extremely fast."
I feel it actually is popular at places that do 10's of Gigs of traffic and up, usually in combination with a routing daemon - Bird, Quagga etc. I have worked in couple of shops now that utilized a similar architecture. I also read recently about a Google LB that leveraged IPVS and now this of course.
What if you're not dealing with millions of connections but instead only a few thousand from whitelisted IP's and you need to optimise for high availability & latency? Could it be done with just anycast -> IPVS layer -> app servers ?
The ECMP/Anycast just gets you beyond the limit of an single pair of IPVS boxes which are are kept in sync with keepalived/vrrp for HA.
But a pair of boxes with ipvs + keepalived + iptables should be be able to handle a few thousand connections no problems. Your concern would then likely be the bandwidth going through the box. But if your client pull rather than push using direct server return should be able to get you past the bandwidth limitations of a single box.
Yeah it works pretty much the same. If your clients aren't geographically dispersed replace anycast with DNS round robin or use both like most huge sites do.
Also there's three layers :) dns->ipvs->httpproxy->app servers.
You could ditch the HTTP proxy layer if your app servers are extremely fast like netty/go/grizzly.
Out of kernel offloads (dpdk, solarflare's openonload, and mellanox's VMA) are good for two primary use cases:
* Reducing context switches at exceptionally high packet rates
* Massively reducing latency with tricks like busy polling (which the kernel's native stack is gaining)
LVS is pretty much the undisputed king for serious business load balancing. I've heard (anecdotally) that Uber uses gorb[1] and google has released seesaw, which are both fancy wrappers ontop of LVS for load balancing.
Source: Almost 10 years optimizing Linux and hardware for low latency in a trading firm.
Mellanox and Solarflare certainly have carved out a nice market for themselves. They are not cheap though which is I guess why they are mostly found in trading shops since latency likely equates to money being left on the table.
High frequency trading is Solarflare's original market (and Mellanox has their InfiniBand market), but both of them are becoming more and more common in the commodity server market as well. Particularly since Intel dropped the ball on 40G (and REALLY dropped the ball on 25/100G), and Broadcom is out of the adapter market, there is a void that other vendors are filling.
When you're spending $30,000 on a server, it doesn't really matter if you spend $1200 on a network card. Those CPU cycles and storage bytes have to go somewhere to make money.
Not yet. They launched 40G a while ago, but some issues have kept them from the same kind of dominance they had with 10G.
25G is supposedly coming soon-ish, but 100G is still 1-2 years away. It's going to be hard to compete with vendors who shipped their first products in 2015/2016.
Meanwhile they have a 100G OmniPath adapter, but who cares?
"I've read claims that Google Public DNS can slow down certain multimedia applications or websites. Are these true?
...
To help reduce the distance between DNS servers and users, Google Public DNS has deployed its servers all over the world. In particular, users in Europe should be directed to CDN content servers in Europe, users in Asia should be directed to CDN servers in Asia, and users in the eastern, central and western U.S. should be directed to CDN servers in those respective regions. We have also published this information to help CDNs provide good DNS results for multimedia users.
In addition, Google Public DNS engineers have proposed a technical solution called EDNS Client Subnet. This proposal allows resolvers to pass in part of the client's IP address (the first 24/64 bits or less for IPv4/IPv6 respectively) as the source IP in the DNS message, so that name servers can return optimized results based on the user's location rather than that of the resolver. To date, we have deployed an implementation of the proposal for many large CDNs (including Akamai) and Google properties. The majority of geo-sensitive domain names are already covered.
Fun to see DSR in use.
Also interesting to see that all the inherent problems with geolocation via gslb (DNS client IP is not the same as the real client IP) don't wind up being a big problem apparently. This seems to be a growing concern in my experience: users aren't located where thier ISP DNS servers are located.