Hacker News new | past | comments | ask | show | jobs | submit login

That is an awesome post. If you are thinking about writing distributed systems code you need to understand this behavior of TCP/IP completely. Back when a 100 nodes was a "big" network these sorts of problems were so rare that you could be forgiven for not taking them into account, but these days with virtualization you can have a 500 node network all on the same freakin' host! And the law of large numbers says "rare things happen often, when you have a lot of things."

Fun anecdote, at Blekko we had people who tried to scrape the search engine by fetching all 300 pages of results. They would do that with some script or code and it would be clear they weren't human because they would ask for each page right after the other. We sent them to a process that Greg wrote on a machine that did most of the TCP handshake and then went away. As a result the scrapers script would hang forever. We saw output that suggested some of these things sat their for months waiting for results that would never come.




> If you are thinking about writing distributed systems code you need to understand this behavior of TCP/IP completely.

Not necessarily. If you have a good RPC layer, it will abstract away most of these details. For example, gRPC automatically does periodic health checking so you never see a connection hang forever. (But you do see some CPU and network cost associated with open channels!) gRPC (or at least gRPC's predecessor, Stubby) has various other quirks which are more critical to understand if you build your system on top of it. I expect the same is true of other RPC layers. Some examples below:

* a boolean controlling "initial reachability" of a channel before a TCP connection is established (when load-balancing between several servers, you generally want it to be false so your RPC goes to a known-working server)

* situations in which existing channels are working but establishing new ones does not

* the notion of "fail-fast" when a channel is unreachable

* lame-ducking when a server is planning to shut down

* deadlines being automatically adjusted for expected round-trip transit time

* load-balanced channels: various algorithms, subsetting

* quirky but extremely useful built-in debugging webpages

It's valuable to understand all the layers underneath when abstractions leak, but the abstractions work most of the time. I'd say it's more essentially to understand the abstraction you're using than ones two or three layers down. People build (or at least contribute to) wildly successful distributed systems without a complete understanding of TCP or IP or Ethernet every day.


Hmm, I realize you can't say but when I was there 'Stubby hangs forever ..." was in a painfully large number of bug reports. So gRPC has fixed all that has it? Great to know.

In my experience an abstraction is only as strong as an organization's ability to hold its invariant assumptions, well invariant. And what I took away from that experience was that knowing how an abstraction was implemented allowed me to see those invariance violations way before my peers were starting to ask, "well maybe this library isn't working like I expect it to."


> Hmm, I realize you can't say but when I was there 'Stubby hangs forever ..." was in a painfully large number of bug reports. So gRPC has fixed all that has it? Great to know.

Huh. I don't see bug reports like that. "This RPC with no deadline hangs forever" definitely but I wouldn't call that a Stubby problem. I'd call it a buggy server (one that leaks its reply callback, has a deadlock in some thread pool it's using for request handling, etc.) and a client that isn't properly using deadline propagation.


> For example, gRPC automatically does periodic health checking so you never see a connection hang forever.

So, a simple setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE) call?


Application-level keepalives can be handy in other regards, eg. you can assert that your application's event loop isn't blocked on an infinite loop, deadlock, or blocking call in user-level code, and you can pass operational statistics back to your (presumably proprietary) monitoring system.


I found in practice that is usually not enough and every distributed system I designed or worked on ended up with heartbeats of some point. It could OS peculiarities, or could be inability to tweak the times of keepalives.

Sometimes the heartbeats are sent by the server, sometimes by the client, it depends. But they always end up in the application layer protocol.


... because the TCP keepalive is a minimum of 2 hours, per RFC. Which is far too long, so everyone adds one at the application level.


The minimum default is 2 hours, but applications can configure this to much smaller intervals, like 10 seconds.


I feel like putting heartbeats themselves into the application level is a layering violation. They go in the session or presentation layer. WebSockets does it right, with its own heartbeat frame type that gets handled in userland but outside of your app's code.


On Linux you can set tcp_keepalive_intvl and tcp_keepalive_probed to make this much shorter, but it's global to all sockets, so app keepalives are better for finer control, among other things mentioned.


There's 3 socket options (TCP_KEEPCNT/TCP_KEEPIDLE/TCP_KEEPINTVL) that allows you to control this per socket too, it's not just global.


I guess the HTTP/2 PING messages.


A connection hanging for two hours is awful close to hanging forever.


> Back when a 100 nodes was a "big" network these sorts of problems were so rare that you could be forgiven for not taking them into account

Not really. Writing reliable network programs always had to take this into account. Take the scenario (not mentioned in TFA) where a firewall somewhere in the network path suddenly starts blocking traffic in one direction. Or handle a patch cable being pulled from a switch (or from your computer which has a different result). These scenarios always were real and resulted in comparable connection error states.

Do agree it's a good post though, explains it rather nicely.


Completely agree. I've run into a lot of this problems on a single node.


This practice is a type of tarpitting, and is fairly common and achievable in several ways. I've seen it implemented with several-second delays between bytes, but you have to weigh such active techniques against having the connection and resources set up.


Or you can re-write enough of the TCP stack in user land to do all of the handshaking and then go away. No connections, no resources wasted, infinite numbers of clients waiting forever and ever in their read() call.


Oh, sure, I actually liked your method and wish I had thought of it. I meant active techniques like mine.


So there's a way to hang the connection that won't be caught by a `timeout` setting in most HTTP clients? Or the bot writers were just too lazy to add timeouts in their bots?


I suspect the bot writers were too lazy to add timeouts.


He had no way to know if they were acutally still waiting months later. Just because the peer has not sent a RST doesn't mean remote end still has a process/machine still in existence.


As one data point, Python feed parser has no timeout. I've seen it sit for days when the remote end disappeared.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: