Fun anecdote, at Blekko we had people who tried to scrape the search engine by fetching all 300 pages of results. They would do that with some script or code and it would be clear they weren't human because they would ask for each page right after the other. We sent them to a process that Greg wrote on a machine that did most of the TCP handshake and then went away. As a result the scrapers script would hang forever. We saw output that suggested some of these things sat their for months waiting for results that would never come.
Not necessarily. If you have a good RPC layer, it will abstract away most of these details. For example, gRPC automatically does periodic health checking so you never see a connection hang forever. (But you do see some CPU and network cost associated with open channels!) gRPC (or at least gRPC's predecessor, Stubby) has various other quirks which are more critical to understand if you build your system on top of it. I expect the same is true of other RPC layers. Some examples below:
* a boolean controlling "initial reachability" of a channel before a TCP connection is established (when load-balancing between several servers, you generally want it to be false so your RPC goes to a known-working server)
* situations in which existing channels are working but establishing new ones does not
* the notion of "fail-fast" when a channel is unreachable
* lame-ducking when a server is planning to shut down
* deadlines being automatically adjusted for expected round-trip transit time
* load-balanced channels: various algorithms, subsetting
* quirky but extremely useful built-in debugging webpages
It's valuable to understand all the layers underneath when abstractions leak, but the abstractions work most of the time. I'd say it's more essentially to understand the abstraction you're using than ones two or three layers down. People build (or at least contribute to) wildly successful distributed systems without a complete understanding of TCP or IP or Ethernet every day.
In my experience an abstraction is only as strong as an organization's ability to hold its invariant assumptions, well invariant. And what I took away from that experience was that knowing how an abstraction was implemented allowed me to see those invariance violations way before my peers were starting to ask, "well maybe this library isn't working like I expect it to."
Huh. I don't see bug reports like that. "This RPC with no deadline hangs forever" definitely but I wouldn't call that a Stubby problem. I'd call it a buggy server (one that leaks its reply callback, has a deadlock in some thread pool it's using for request handling, etc.) and a client that isn't properly using deadline propagation.
So, a simple setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE) call?
Sometimes the heartbeats are sent by the server, sometimes by the client, it depends. But they always end up in the application layer protocol.
Not really. Writing reliable network programs always had to take this into account. Take the scenario (not mentioned in TFA) where a firewall somewhere in the network path suddenly starts blocking traffic in one direction. Or handle a patch cable being pulled from a switch (or from your computer which has a different result). These scenarios always were real and resulted in comparable connection error states.
Do agree it's a good post though, explains it rather nicely.