Hacker News new | past | comments | ask | show | jobs | submit login
Optimizing web servers for high throughput and low latency (dropbox.com)
987 points by nuriaion on Sept 6, 2017 | hide | past | web | favorite | 84 comments



I thoroughly appreciate how the majority of the article doesn't even go into the nginx config whereas most of the internet would discuss a result of the search query "optimized nginx config". Much [for me] to learn and much appreciated, Alexey.


One of my more popular blog posts is exactly about how to "optimise" nginx for high traffic loads and the gist of it is pretty much "you can't really do much, but here's some minor stuff". I consider that a good thing though, a web server shouldn't really require config tweaking to perform well.


Definitely! Sane defaults and config readability is pretty much the main reason I recommend nginx.


I recall there being a good presentation by WhatsApp about optimizing not just the web server, but the OS (FreeBSD, in their case) to solve the C100K/C1M problem. It seemed that there was a lot more low-hanging fruit on the OS side than on the webserver side.


>"One of my more popular blog posts is exactly about how to "optimise" nginx for high traffic loads and the gist of it is pretty much "you can't really do much, but here's some minor stuff"."

Could you provide a link for that blog post? I am not seeing it.



I spent some time looking at this recently. I'm focussed on throughput, not high TPS. The quality of blog posts is absolutely atrocious. There's a lot of blog posts that amount to "Set this setting to this value" but no real explanation why, or what the setting actually does. Some of what I've seen are actually downright dangerous.


Great write-up. And even if you use standard instances there is plenty to optimize. Kudos to Dropbox, Netflix, Cloudflare and everyone else for who demonstrates this level of transparency.

And just for reference, AWS does provide enhanced networking capabilities on VPC:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-...


Fab tour de force! Great!

(Though I've been optimising my tiny low-traffic site on a solar-powered RPi 2 to apparently outperform a major CDN, at least on POHTTP...)


I am curious how you are measuring that you 'outperform' a major CDN? Average response time from your local connection? From an arbitrary IP somewhere in the world? An aggregated average response time from many places around the world?


Have a look at these:

http://www.earth.org.uk/note-on-site-technicals-2.html

http://www.earth.org.uk/note-on-carbon-cost-of-CDN.html

Basically measuring time to first byte (TTFB) (and indeed visually complete etc) as a reasonable correlate of perceived performance, for a simple site over http from my RPi vs via CloudFlare fully cached http and https. The most direct comparison is of a separate fossil site, but other measuring tools show (generally) my RPi http beats or matches CDN http beats CDN https (with HTTP/2 etc) for a UK visitor.

Tests are made from WebpageTest and StatusCake in the main, both from their test points in the UK in various data centers. These sites of mine are UK focussed, not global, so the UK test points are representative, I believe, and should be advantageous to the CDN since it is terminated closer and faster to the client than I can be, from by kitchen cupboard!


not surprising imho - you probably have more 'dedicated hardware' (in an abstract sense) per request than a busy CDN.. if the site software is all cached in ram and cores are free, you're pretty much running 'wide open' on each request..


Absolutely.

Though still, the CDN folks have faster machines not running in power-saving mode, and lower latency into the core UK Internet peering points / LINX, and my RPi is doing other things too, such as running a substantial Java-based server.


Very interesting! What steps have you taken to optimize?



DPDK?

One fabless SoC maker called Baikal claim that they can saturate 10gbs link with an 8 core arm chip running under 5W (phy power not counted in) for as long as users are using userspace driven network and hardware offloading


DPDK is typically benchmarked in packets per second instead of bitrate. The reason being that 64-byte packets at 10Gbps is much harder than 512+. Also, it depends on what the work is. You can do several hundred Gbps on a moderate Xeon, provided you're just dropping all the packets as they come in.


This sounds unbelievable


This is a wonderful article. It also is indicative of why people who really understand systems will always be so employable - it's just so hard to make things run well.


This is a great article and I've bookmarked it for future reference.

I observe, though, that if you are tuning a system to this level of detail you likely have a number of web servers behind a load balancer. To be complete, the discussion should include optimization of interactions with the load balancer, e.g. where to terminate https, etc.


> To be complete, the discussion should include optimization of interactions with the load balancer, e.g. where to terminate https, etc.

Not necessarily. There are other ways of spreading traffic, like DNS round robin for example, using the DNS delegation trick or even have client-side load balancing (which is perfectly feasible if you control the client, like your own app for example).

The article itself mentions that what they're talking about is the Dropbox Edge network, an nginx proxy tier, which sounds like load balancers to me.


Using round robin DNS for load balancing is almost never a good idea.

What is "the DNS delegation trick"?


> should include optimization of interactions with the load balancer, e.g. where to terminate https, etc.

I'm curious - is there a case where you want to terminate HTTPS on end device instead of (only) on load balancer?


Ignoring any privacy implications, terminating https on the load balancer means the load balancer will use more cpu and memory than if it was just a tcp terminating load balancer or working at layer 2. In a lot of architectures, the load balancer may not have as nice of a failover story as the hosts behind it, minimizing state on the load balancer could make state syncing possible for a quicker failover. If you're running load balancing appliances, those tend to be expensive, so you probably want to let them only do what they have to, so you don't need to buy more of them.


Thanks!


I have a related question that most would probably consider relevant but which this article (quite rightly) doesn't answer (as it's not relevant for Netflix).

Let's say I want to prepare a server to respond quickly to HTTP requests, from all over the world.

How do I optimize where I put it?

Generally there are three ways I can tackle this:

1. I configure/install my own server somewhere

2. Or rent a preconfigured dedicated server I can only do so much with

3. I rent Xen/KVM on a hopefully not-overcrowded/oversold host

Obviously the 1st is the most expensive (I must own my own hardware; failures mean a trip to the DC or smart hands), the 2nd will remove some flexibility, and the 3rd will impose the most restrictions but be the cheapest.

For reference, knowing how to pick a good network (#1) would be interesting to learn about. I've always been curious about that, although I don't exactly have anything to rack right now. Are there any physical locations in the world that will offer the lowest latency to the highest number of users? Do some providers have connections to better backbones? Etc.

#2 is not impossible - https://cc.delimiter.com/cart/dedicated-servers/&step=0 currently lists a HP SL170s with dual L5360s, 24GB, 2TB and 20TB bandwidth @ 1Gbit for $50/mo. It's cool to know this kind of thing exists. But I don't know how good Delimiter's network(s) is/are (this is in Atlanta FWIW).

#3 is what I'm the most interested in at this point, although this option does present the biggest challenge. Overselling is a tricky proposition.

Hosting seems to be typically sold on the basis of how fast `dd` finishes (which is an atrocious and utterly wrong benchmark - most tests dd /dev/zero to a disk file, which will go through the disk cache). Not many people seem to setup a tuned Web server and then run ab or httperf on it from a remote with known-excellent networking. That's incredibly sad!

Handling gaming or voice traffic is probably a good idea for the target I'd like to be able to hit - I don't want to do precisely that, but if my server's latency is good enough to handle that I'd be very happy.


Wouldn't dd'ing /dev/zero create an FS hole in your target file on many file systems? A similar (but maybe not isomorphic) situation in the Michael Kerrisk book is one of the excercises, and I'd bet ZFS does similar but suddenly we've assumed a lot of context.


Good question. The 0th-step answer is that the people using `dd` are generally going to be using all OS defaults, including for the filesystem.

For what it's worth, I know that dd'ing /dev/zero makes df show a smaller value. AFAIK, df (for ext4 in my case) isn't reading a logical abstraction of capacity as arbitrarily decided upon by a bunch of layers, but is straightforwardly reporting the free blocks on disk.


Depends in part on why you want it there. For many answers, you get a CDN. Even if you have entirely dynamic and incompressible content, the bandwidth-delay product's role in limiting TCP bandwidth means two connections with half the delay each are a huge improvement.


"If you’ve read this far you probably want to work on solving these and other interesting problems! You’re in luck: Dropbox is looking for experienced SWEs, SREs, and Managers."

Following the link will show "Open Positions" with, well, nothing to follow. They did not only optimize their servers for throughput but also HR!


"people are still copy-pasting their old sysctls.conf that they’ve used to tune 2.6.18/2.6.32 kernels."

This.


.. or 32 bit systems.


My mind kind of explodes reading that article.

So many bells & whistles and I don't even know where to begin.


wow. This is the kind of article i'm expecting to see when I google for "Nginx best practices 2018". I am so far behind, maybe 20% of my usual setup include those recommendations. Thank you Dropbox.

If someone can point me to a thorough article like this on the lua module, I will thank her/him forever.


Maybe you will find some stuff here? https://openresty.org/en/presentations.html


I'm surprised there was no mention of tcp_max_syn_backlog and netdev_max_backlog.

When I've previously tuned a server I have used both of those to my advantage... Another comment on here talked about this ignoring an existing load balancer so maybe those sysctls are more appropriate on an LB?


Those settings are irrelevant in this scenario. The point is to maximize the throughput from a single server without backing up the queue.


Does the physical location of a server matter for high-throughput use cases? If a client is downloading large files using all its available bandwith, is the download time noticeably better if the server is close to the client?


Physical locations matters a lot for uploads, mostly because client TCP stack is way less sophisticated.

For downloads high RTT can be mitigated by a congestion control that ignores constant packet-loss rates (which are common for high-rtt paths). Other tricks that you can try: fq+pacing and newer kernels with more sophisticated recovery heuristics.


I was actually looking into this the other day - latency from Japan to Virginia is over 200ms and then 32ms to grab the actual HTML the page. I reduced my SSL certificate from 4096 bits and it's definitely helped (latency with SSL was crazy, over 500ms). Even so, need to think at some point about a datacenter distributed code base :-/


One congestion-control parameter which don't get it's fair share of articles is initcwnd. And for HTTP traffic over high bandwidth+latency links it has much larger impact than the choice of the cc algorithm.

See https://www.cdnplanet.com/blog/tune-tcp-initcwnd-for-optimum...


I've mentioned it very briefly in a Network Stack part, around the "upgrade your kernel" warning:

> and I’m not even talking about IW10 (which is so 2010)[1]

[1] https://developers.google.com/speed/protocols/Increasing_TCP...


Initcwnd is set 10 in just about every distro and has been for some time now. Initcwnd only comes into play at the beginning of a new connection, it isn't as important for LFNs as TCP buffer sizes. But at any rate this post is concerned with edge networks so there shouldn't be any LFN network at all.


> You should keep your firmware up-to-date to avoid painful and lengthy troubleshooting sessions. Try to stay recent with CPU Microcode, Motherboard, NICs, and SSDs firmwares.

I wonder if this is good advice. I would have said the opposite: do not mess around with any of that stuff unless there's a security advisory or a problem points to a specific piece of hardware. It's not like updating this stuff is without risk.


> It's not like updating this stuff is without risk.

Same goes for kernel, libc, all other libraries, and pretty much anything else. This is not an argument for not upgrading them though.

Also, sadly, almost all firmware updates fix a "problem [that] points to a specific piece of hardware" that you use.


> Same goes for kernel, libc, all other libraries, and pretty much anything else. This is not an argument for not upgrading them though.

If you have a budget to responsibly keep all that stuff up to date in your project and test to make sure you haven't broken anything, more power to you. It's a total waste most of the time, but why not? Especially if you're a consultant and getting paid by the hour

Firmware is different. It's not always possible to back out a change and you risk bricking hardware every time you apply an update, especially in the realm of PC-based servers. If something like a NIC or motherboard is performing as expected, updating the firmware for no reason is generally a stupid thing to do.

Edit: oh, you're the author. Please stop telling people to do this, or at least explain the risk. At a minimum, if the vendor provides a defect list with each update, it's not necessary to blindly apply updates. Take the changes if they're needed. It's possible you're accustomed to working on high-grade hardware that exhibits fewer of these firmware related blowups, but that doesn't mean it does not happen to people...


Would be cool with a story like, we did these adjustment so each server could handle 10% more requests. etc. This blog post seems to only cover software, you can also gain a lot of performance from hardware modding. Someone said optimizing is the root of all evil ... So first identify bottlenecks in real work-loads, no micro-benchmarking!


That was: ''_premature_ optimization is the root of all evil".


> If you are going with AMD, EPYC has quite impressive performance.

Does this imply that Dropbox has started testing out EPYC metal?


This is really great. I love these kind of articles! I learned a few things (like about the non-cubic tcp congestion algorithms). Probably orthogonal to TFA, but I'd be interested to know though how they solve 0 downtime nginx deployments.


No mention of VPP - does it not apply to applications? or Routing/ switching?

https://wiki.fd.io/view/VPP/What_is_VPP%3F


VPP has amazing performance, but it's mostly limited to switching, routing, NAT, and VPN at this point; it doesn't have (finished) TCP and porting a Web server like Nginx is a ways off. It's also questionable who needs something like serving 500 Gbps of Web traffic from a single server.


So how does Linux compare now with FreeBSD in terms of throughput and latency? I remember like 10 years ago Linux had issues with throughput, which is why Netflix went with FreeBSD. Are they similar now?


So, this is an honest question. What kind of performance do Linux based CDNs get out of a single box?

At Netflix, we can serve over 90Gb/s of 100% TLS encrypted traffic using a single-socket E5-2697A and Mellanox (or Chelsio) 100GbE NICs using software crypto (Intel ISA-L). This is distributed across tens of thousands of connections, and all is done in-kernel, using our "ssl sendfile". Eg, no dpdk, no crypo accelerators.

I'm working on a tech blog about the changes we've needed to make to the FreeBSD kernel to get this kind of performance (and working on getting some of them in shape to upstream).


It doesn't hurt to get a performance boost from your processor's crypto instructions, assuming you optimized your cipher lists to prefer crypto with a modern hw implementation (AES128-NI is 179% faster than RC4).

But is this traffic ongoing connections, new connections, a mix? They have different penalties, and result in different numbers: 90Gbps of ongoing connections might be, like, 100,000hps, but 90Gbps of new connections during primetime might only net you 50,000hps. And are you using Google's UDP TLS stuff?

Google also hacked on the kernel a lot to improve their performance, I don't know if any of that's upstream currently though. Maybe Cloudflare can answer you, as they seem to support the most HTTPS wizardry of the big CDNs.


Yes, as I said, we're using ISA-L. This is Intel's library with thin wrappers around hand-tuned assembly routines for crypto.

The traffic is mostly long-ish lived connections. Eg, the duration of a TV show or movie. So there is some churn, but not a lot.

This is all TCP. By "UDP TLS", I assume you mean Quic?


That would be the part of Netflix not running on AWS. :-)

So the library feed can serve tens of thousands of streams, an aggregate 100Gbps, on a single node. And then... how many nodes to support the front-end UI operations to get to that point?

It's funny how amazingly efficient we can be moving encrypted bits, but to support the APIs for login, browsing titles, updating account info, and setting up a stream; I'm going to guess ~100 of those nodes for every one of your stream-tanks?


> So how does Linux compare now with FreeBSD in terms of throughput and latency? I remember like 10 years ago Linux had issues with throughput, which is why Netflix went with FreeBSD. Are they similar now?

I believe the open connect team at Netflix choose FreeBSD because a lot of them used to work at Yahoo and had lot's of FreeBSD experience. Not so much because of a performance difference between the two. As for now, the two network stacks are pretty equal when it comes to performance, some work loads are better on FreeBSD some are better on Linux.


Can you go into more detail about what workloads are better suited for Linux and FreeBSD?


Yea no problem, I'm having a hard time finding all the articles/ benchmarks I remember reading about this subject. But this paper has some useful info[1]. Basically FreeBSD is good for throughput. In the paper I linked to you can see FreeBSD has a higher throughput than Linux, but you can also see that FreeBSD is using more CPU than Linux is. Also Linux generally has lower latency than FreeBSD, which makes sense because Linux is used extensively by high frequency trading firms. However there are still HFT firms using FreeBSD.

I'll edit this post as I find the other articles, videos and benchmarks about this subject.

Edit: I don't really care for Phoronix benchmarks but here's some benchmarks showing Linux winning some and FreeBSD winning some benchmarks[2].

[1]: http://conferences.sigcomm.org/hotnets/2013/papers/hotnets-f...

[2]: http://www.phoronix.com/scan.php?page=article&item=netperf-b...


Lots to dig into here, thanks!


I also like this talk by a George Neville-Neil a FreeBSD networking engineer. Start at 57:45 [1]. That part of the talk is about differences in how the two different Network stacks are implemented.

[1]: https://www.youtube.com/watch?v=5mv_oKFzACM


Linux is probably faster and has always been for years now. Benchmark a recent kernel and prove me wrong.


> Linux is probably faster and has always been for years now. Benchmark a recent kernel and prove me wrong.

How about you post the benchmarks you used to form your answer, otherwise your post is nothing but speculation and probably just favoritism.


What good is benchmarking a kernel going to do when performance in this context has many more moving parts?


Interesting article. Well, to be honest, some of the concepts were totally new to me and learning from this article found it interesting. And thanks for the other links too.


Great write up for traditional/kernel tuning! I guess i'm naively waiting for dpdk-based user space solutions to appear.


Just goes to show how much one should know in our field to make the machine work well for you. For somebody that can understand the article, the stuff is mostly known, but if you don't know it, the article is pretty dense.

It would be nice if someone make a docker image with all the tuning set (except the hardware)

It would have be nicer, if the author has shown what the end result of this optimization looks like, with numbers, comparing against a standard run-of-the-mill nginx setup.


>It would be nice if someone make a docker image with all the tuning set (except the hardware)

Have you not read the article?

>In this post we’ll be discussing lots of ways to tune web servers and proxies. Please do not cargo-cult them. For the sake of the scientific method, apply them one-by-one, measure their effect, and decide whether they are indeed useful in your environment.


I don't think they meant for production, just testing and toying with it.


> It would be nice if someone make a docker image with all the tuning set (except the hardware)

This degree of optimization is something that really depends on a specific use case and the precise configuration of the entire hardware+software stack, and not a general-purpose best practices list that can be put into a container.

The author says as much at the top in the disclaimer:

> Please do not cargo-cult them. For the sake of the scientific method, apply them one-by-one, measure their effect, and decide wether they are indeed useful in your environment.

Better to take away the profiling and optimization methodologies from the article rather than specific settings.


The entire idea of tuning is that you tune it to your particular workload.


Great post! Contents of it aside, I very much like the disclaimer:

> In this post we’ll be discussing lots of ways to tune web servers and proxies. Please do not cargo-cult them. For the sake of the scientific method, apply them one-by-one, measure their effect, and decide whether they are indeed useful in your environment.

Far to often I see people apply ideas from posts they've read or talks they've seen without stopping to think whether or not it makes sense in the context they're applying it. Always think about context, and measure to make sure it actually works!


Excellent!


Very interesting


You can skip all of that nonsense and run FreeBSD.


When I was at Yandex, we were a FreeBSD-CURRENT shop (FreeBSD 9 at that time). So, that said, FreeBSD will had exactly the same[0] issues on the low/mid levels, just not all of them had a good solution[1][2].

[0] https://wiki.freebsd.org/201305DevSummit/NetworkReceivePerfo...

[1] https://wiki.freebsd.org/NetworkPerformanceTuning

[2] https://wiki.freebsd.org/TransportProtocols


While I love freebsd, you can definitely take the time to tune it if your application requires it.


I wish they could make a MacOSX app that doesn't use almost 100% of one core all the time.


Sorry you're downvoted, but what you mention is a real concern.


Not to this topic.


great article overall - but starts off by saying 'do not cargo cult this' and then proceeds to proscribe many 'mandates' without giving any rationale behind them..


I've tried to:

1) give an bcc/perf example to check the need for tuning and verify its effect.

2) give code/docs/paper reference as an embedded link.

3) give a generic monitoring guideline at the start of a "chapter".

Seems like I've (at least partially) failed. I'll do better next time.


Know these Linux backup best practices to avoid data loss. https://buff.ly/2gJK5Ew




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: