Hacker News new | past | comments | ask | show | jobs | submit login
Nginx optimization: Understanding sendfile, tcp_nodelay and tcp_nopush (t37.net)
143 points by dirtyaura on Feb 13, 2015 | hide | past | favorite | 29 comments

To avoid network congestion, the TCP stack implements a mechanism that waits for the data up to 0.2 seconds so it won’t send a packet that would be too small. This mechanism is ensured by Nagle’s algorithm, and 200ms is the value of the UNIX implementation.

Sigh. If you're doing bulk file transfers, you never hit that problem. If you're sending enough data to fill up outgoing buffers, there's no delay. If you send all the data and close the TCP connection, there's no delay after the last packet. If you do send, reply, send, reply, there's no delay. If you do bulk sends, there's no delay. If you do send, send, reply, there's a delay.

The real problem is ACK delays. The 200ms "ACK delay" timer is a bad idea that someone at Berkeley stuck into BSD around 1985 because they didn't really understand the problem. A delayed ACK is a bet that there will be a reply from the application level within 200ms. TCP continues to use delayed ACKs even if it's losing that bet every time.

If I'd still been working on networking at the time, that never would have happened. But I was off doing stuff for a startup called Autodesk.

John Nagle

Are you John Nagle, or was that a quote?

He is. Check his Animats blog. And it's a kind of awesome that here we have a bare-bones internet forum, in which we can have uninformed discussion about Nagle's algorithm only to be enlightened by Mr. Nagle. Hooray for HN and John :)

Indeed! Thanks for confirming :)

Yes, it's me. I did my networking work at Ford Aerospace in the early 1980s. But I left in 1986. It still bothers me that the Nagle algorithm (which I called tinygram prevention) and delayed ACKs interact so badly.

That fixed 200ms ACK delay timer was a horrible mistake. Why 200ms? Human reaction time. That idea was borrowed from X.25 interface devices, where it was called an "accumulation timer". The Berkeley guys were trying to reduce Telnet overhead, because they had thousands of students using time-sharing systems from remote dumb terminals run through Telnet gateways. So they put in a quick fix specific to that problem. That's the only short fixed timer in TCP; everything else is geared to some computed measure such as round trip time.

Today, I'd just turn off ACK delay. ACKs are tiny and don't use much bandwidth, nobody uses Telnet any more, and most traffic is much heavier in one direction than the other. The case in which ACK delay helps is rare today. An RPC system making many short query/response calls might benefit from it; that's about it. A smarter algorithm in TCP might turn on ACK delay if it notices it's sending a lot of ACKs which could have been piggybacked on the next packet, but having it on all the time is no longer a good thing.

If you turn off the Nagle algorithm and then rapidly send single bytes to a socket, each byte will go out as a separate packet. This can increase traffic by an order of magnitude or two, with throughput declining accordingly. If you turn off delayed ACKs, traffic in the less-busy direction may go up slightly. That's why it's better to turn off delayed ACKs, if that's available.

One of the few legit cases for turning off the Nagle algorithm is for a FPS game running over the net. There, one-way latency matters; getting your shots and moves to the server before the other players affects gameplay. For almost everything else, it's round-trip time that matters, not one-way.

Toward the end of the post the author does explain why NODELAY applies - nginx apparently does send, send, reply to write a header and then sendfile.

Are you suggesting using TCP_QUICKACK more often? Can you give examples on when it is desired or not.

One note about sendfile - if you are using VirtualBox and serving files out of a shared folder with nginx (or I assume Apache) you'll need to disable it or you won't see any updates you do to files once the first version is sent.

This VirtualBox bug was the reason that I started reading more about sendfile and encountered this article about nginx optimizations. The bug actually causes really weird behavior, in which extra bytes are added at the end of the file.

It doesn't really add extra bytes - it "remembers" the original size of the file. So if you update the file to be larger, you only get the first N bytes. If you update it to be smaller, you get the full file + zero bytes at the end to fill out the remainder. If you can't turn off sendfile for some reason, the workaround is to delete the file every time before saving it.

Thanks, enlightening. And is this because due to some implementation difficulties for sendfile when using shared file system between the box and the host machine?

I JUST ran into this issue this weekend! Glad to know I'm not going insane.

Do you have any links specifically discussing the issue?


Google for 'virtualbox sendfile' and there's a lot of discussion.

Worth noting, and this bit me recently, Go's http file handler also uses sendfile beneath the covers.

Yes, this StackOverflow answer solved the mystery for me http://stackoverflow.com/questions/12719859/syntaxerror-unex...

I then found out that it is actually mentioned in the Vagrant docs also: https://docs.vagrantup.com/v2/synced-folders/virtualbox.html

Haha omg this issue. I came across this about a year ago, and it took forever to diagnose the problem. Who would think to look for a specific declaration in the web server config? I thought for the longest time it was a problem with my host<->guest mount. Glad to see other people suffering similarly :p

Not only that, but using an nginx container on OSX with boot2docker (VirtualBox based) was actually just causing all of the static files being served to be truncated. I spent a solid day tracking that down.

This occurred with VMWare Fusion 6 as well.

I always cringe a bit inside, when reading articles like this.

nodelay and cork are different, indeed, but opposites? They both try to achieve the same effect, put more data in before sending a packet.

> [...] This mechanism is ensured by Nagle’s algorithm, and 200ms [...]

Absolutely not. Nagle's algorithm does not have any delay or timer build in. It simply holds back non-full packets, when there is data in flight (not acked). The second half of the problem is delayed acks, but this is not mentioned in the article, instead it goes on saying

> [...] but Nagle is not relevant to the modern Internet [...]

which is indeed popular belief, but a very superficial analysis that holds no water if you study it further.

The feeling I always get from articles like this is they border on "technical religion". It sounds correct, it is technical, it isn't even false, but it doesn't paint a clear picture, instead it mystifies things further.

The problems nginx had:

1. nagle and http keepalive don't play nice together, the last bit of data might be artificially delayed, especially when delayed acks come into play. nodelay seems needed here. (It is not though, see that Minshall bit.)

2. how to send headers and use sendfile for the body, and fill the first packet with more then just the headers? nopush (tcp_cork) is a solution.

The "modern" way to do things is to use the splice syscall and related calls instead of sendfile, which can deal with copying between sockets and appending headers more directly.

Igor said in a talk I went to that Nginx 1.x was written for FreeBSD first, while 2.0 will be written for Linux first so perhaps some of these things may change (hence "nopush" in the config file, the freebsd term).

The interesting question is splice() is so old but why hasn't anybody implemented a production ready server system over it yet?

Is slice supported on nginx already? How do you configure it?

No I don't think so, not at present.

Does anyone else use and/or have thoughts on some of the boilerplate nginx configuration projects?



Is there any point to optimisations like sendfile anymore when increasingly sites are being served over TLS?

Our nginx configurations use:

    sendfile on;
    tcp_nopush on;
    tcp_nodelay off;
With tcp_nodelay off. Is the author suggesting we turn it on?

Yes, as far as I understood "tcp_nodelay on" is more reasonable for the modern web, the whole delay business of TCP was reasonable for terminals.

As the latter part of article describes "tcp_nodelay on" is at the odds with "tcp_nopush on" as they are mutually exclusive, but nginx has special behavior that if you have "sendfile on", it uses "tcp nopush" for everything but the last package and then turns nopush off and enables nodelay to avoid 0.2 sec delay.

Yes, since it with it on the functionality remains off until the final (and only partially full) packet. Rather than being delayed by 200ms, tcp_nodelay sends it out instantly.

sendfile is also not necessarily zero-copy: https://svnweb.freebsd.org/base?view=revision&revision=25560... https://git.kernel.org/cgit/linux/kernel/git/stable/linux-st... (sendfile calls splice; the splice manpage also states that it might copy instead of move pages.)

That said, it equates to a single kernel-to-kernel copy instead of a kernel-to-userspace plus a userspace-to-kernel copy as in the read/write case.

Nagle Angle and Naggle ?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact