Hacker News new | past | comments | ask | show | jobs | submit login
What developers should know about TCP (robertovitillo.com)
359 points by todsacerdoti on May 15, 2020 | hide | past | favorite | 154 comments



What they mostly should know: TCP provides a bidirectional stream of bytes on the application level. It does NOT provide a stream of packets.

That means whatever you pass to a send() call is not necessarily the same amount of data the receiver will observe in a single read() call. You might get more or less bytes, since the transport layer is free to buffer and to fragment data.

I have seen the assumption of TCP having packet boundaries on application level being made too often - typically in stackoverflow questions like: „I don’t receive all data. Is my OS/library broken?“


> What they mostly should know: TCP provides a bidirectional stream of bytes on the application level. It does NOT provide a stream of packets.

> That means whatever you pass to a send() call is not necessarily the same amount of data the receiver will observe in a single read() call.

Yes, this. For god's sake, listen to them.

I had to fight a coworker on this. I had quickly created some client code just to validate that the server was working. Due to some quirk, all the messages were arriving in full in every read call. He told me to ship it.

I said no! "I need to check if there's more data and if so add a loop to read again" "But it is working, release it". That went on for a while, to no avail. Wouldn't look at documentation either.

Eventually he head to leave for the day, and I took the time to implement it correctly.

I started including basic TCP questions on interviews. Not many people even get past the TCP handshake (if they even know about that).


The problem here was not a lack of knowledge of a particular subject. The problem is that this person was unwilling to learn about a thing they thought they knew.


That's correct.


I feel like I can Google all the syn/ack/packetsniffing bits when those come up during troubleshooting (Why is this disconnecting? What do you mean the gateway sends out rst packets when there's no activity for 5 minutes?!?). Seems a bit harsh to start with those. The guarantees about the protocol are the important part, the rest seems kind of superfluous unless there's an unusual problem or you're pushing the limits of a network.


I've personally been on many outage troubleshooting calls that cost hundreds of thousands of dollars precisely because engineers believed they could ignore the underlying details and corner cases of TCP. My favorite common mistake is that developers assume they can open a single connection with no retries and that connection will be reliable as long as the handshake succeeds.


Stupid question: why would you be writing code that works at the level of TCP? Don't you usually want to use the OS's (or some popular library's) TCP software stack?


It seems to me GP is talking about using TCP, not implementing it.


Right, I mean I thought that the TCP protocol implementation itself handled that issue, and your calls to such a library abstracted away from that.


No, it does not, by design. When you use any standard TCP implementation the abstraction provided at both ends is just a stream of bytes. The guarantee TCP makes is that the bytes received at one end will be in the same order as the bytes that were sent at the other end.

If you want to use TCP to implement some (higher-level) message-based protocol, you need to parse those out for yourself.


The original question didn't refer to such issues of parsing, just checking for whether the full message was received, which, I would think, would be handled by the TCP library read() (or whatever) call. It sounded like the OP (outworlder) was delving into lower-level TCP details that should have been abstracted away -- my thinking is that the TCP caller shouldn't have to concern itself with details about checking whether the full message has been received, at least not as a separate step. That is, it just wouldn't return anything until the full message is received, or would include some data structure that indicates it's not complete. Does that make sense?

Edit: On second thought, I guess OP meant that all of the results were coming back "complete" which doesn't obviate the issue of needing to do a check that handles the "not done" case.


> the TCP caller shouldn't have to concern itself with details about checking whether the full message has been received

The caller absolutely must concern itself with this crucial “detail”. If you do otherwise, your code is broken, full stop.

You can implement a higher-level protocol on top which handles this kind of thing internally and presents a higher-level interface (e.g. not passing any partial data along to its caller until a full message has been received), but if you are just working with TCP directly, what you get is just a stream of bytes. The guarantee you get is that the bytes will be in order and without any gaps.

If you e.g. send UTF-8 encoded text, you must be prepared on the read side to have your stream of bytes cut off arbitrarily in the middle of a character.


Let function A call TCP read() and pass back a struct that includes the bytes it's received and a flag that indicates whether it's read to the end of the message.

Let function B call TCP read() and never return anything until it's received all the bytes of the message.

Both of those seem (IMHO) like functions you could have in a TCP library. Neither seems (IMHO) like a higher level protocol.


The reason that this function doesn't exist is that even if the sender sent a "message" in a single API call, there is no guarantee that the networking code will send it in one IP packet (or one layer 2 frame). We don't want to couple the message size with the lower network protocol, the type of network equipment and so on, and also want to allow merging of messages for more efficient use of the network (If we have a large window size it allows better throughput).

So the job of separating the stream into messages is left to the application layer. (unlike for example in UDP, but then you have to worry about dropped messages)


What is that responding to? Of course TCP can split a message across multiple packets of the lower level protocol; that has no bearing on the concepts TCP works in and whether it can indicate end-of-message.

If your point is just that TCP doesn't have a concept of a "message" (a bytestream with a clear beginning and ending), then that's fair, but, as I said elsewhere [1] the original comment took for granted that TCP does have a well defined notion of "you've reached the end of the message", or at least, "there is no further data to receive". No one seemed to have a problem with that there, and I was just working off that assumption.

As before, I haven't checked whether this is true (I can't quickly verify from descriptions of TCP).

And, interestingly enough, there's this comment [2], which says that what I described does exist, but isn't the default. So ... I'm at least not getting a consistent answer to my question, and people who think they know what they're talking about are inconsistent with each other.

[1] https://news.ycombinator.com/item?id=23200293

[2] https://news.ycombinator.com/item?id=23200906


> the original comment took for granted that TCP does have a well defined notion of "you've reached the end of the message"

No, the original comment was complaining about a coworker who didn’t understand (and refused to listen when told otherwise) that there is no such notion in TCP. It was a response to another comment complaining about people on the internet (e.g. Stack Overflow) too often making the same mistake.

You’re more or less playing the part of that coworker here. It’s unclear why.


> Both of those seem (IMHO) like functions you could have in a TCP library. Neither seems (IMHO) like a higher level protocol.

It's responding to the last part. It is a higher level of protocol. In the traditional TCP/IP model, it's in the application level. There are many libraries with an API like you asked, they are just in a higher level.

(and TCP_WAITALL is a partial solution, applicable only if you know in advance the exact size of the message you are about to receive)


There is no "message" in TCP. So no - you can not have this in a library.


If it makes you feel better, calling read() on a TCP socket is basically the same as calling read() on a file in a file system. In both cases, you can always end up reading less than you expected, and you always must check how much you actually read. In practice, this means that calls to read() for both should always be in a loop.


Those are perfect examples of a higher level protocol, since there is no way you can do it only with TCP.


That might be nice. But that's not how it works.


> which, I would think, would be handled by the TCP library read()

You would be wrong.


Thanks for the explanation, but I don't see why a library utility function wouldn't do that.


But if the goal is to process or retrieve a string of bytes, how do you know when you've got them all? That is the root of the problem: tcp isn't built to exchange messages, it's just the stream transport layer. If you want messages you have to encode and decide that yourself.


Sorry, the rest of the conversation seemed to be assuming that "whether you have received the full message" is well-defined at the level of TCP, as suggested by the original comment where I joined[1]:

>I had to fight a coworker on this. I had quickly created some client code just to validate that the server was working. Due to some quirk, all the messages were arriving in full in every read call. He told me to ship it.

>I said no! "I need to check if there's more data and if so add a loop to read again" "But it is working, release it". That went on for a while, to no avail. Wouldn't look at documentation either.

I was just going along with that.

I didn't check up on TCP further to verify whether this was actually true; if you are saying that's not a well-defined concept, you might want to reply to that comment to say so.

[1] https://news.ycombinator.com/item?id=23195230


Those quotes don't claim messages are received in full, you are misinterpreting them.


Of course, but they do indicate that "there is still more of the message to receive" is a well-defined concept (or at least "there is more data to receive" is), which is all my point requires.

(Edit: also, it would help if you said what the correct interpretation would be, since it has the phrase "messages were arriving in full in every read call".)


>but they do indicate that "there is still more of the message to receive"

Not at the TCP level they don't.

They just give you some bytes. It's up to you to decide whether you have the full message or not and if you want to try to read more. The TCP read functions just give you the data they have. There is no concept of one write at the sender end translating to a complete message at the other end. It's just a stream of bytes.


That poster's coworker was implementing a message-oriented protocol and testing the client and server on the same machine. When running the software in this configuration, the coworker observed that each read returned abd entire message, even though this was not going to be true in other configurations where the client and server are different machines or the messages are larger.


Most TCP libraries simply do not come with a function that does "read, but only give me 0 bytes or n bytes, and if you get some other number please hang on to the leftovers until next time". I guess I could follow your advice, but the first step of doing that would be to write such a function again.


They do... with an option called MSG_WAITALL.

That naturally leads to the question of why it's not the default, which can be answered by understanding the history of TCP and computer networking; and more interestingly, how might things have been different if MSG_WAITALL was the default from the beginning.


I don't think this gets you nonblocking all-or-nothing recv.


If you're using a library, then sure. But if you're just reading from a raw TCP socket, it's just a stream of bytes. It's up to your application to parse those bytes (e.g. into a http request).

The OS will buffer bytes received from TCP packets for you until you read from the socket again to drain the buffer. Your application needs to determine how to semantically chop those bytes up into the protocol it's expecting (e.g. http request).

My low-level networking chops are a little rusty so please correct my understanding if I'm off-base somewhere.


"But it is working, release it".

Famous last words :-)


Sounds like how most software handles security until it’s audited.


Unless you are writing TCP drivers I don't see how knowing the exact TCP handshake is useful for software development.


> the rest of the conversation seemed to be assuming that "whether you have received the full message" is well-defined at the level of TCP

I don't see anything in the comment you linked that implies that.

Sorry you got confused, but there's no need for anyone to go reply to that comment to say so.


In the fall of 2016 I had a lengthy email exchange with an industrial automation vendor who didn't understand this issue. I even mailed them a short Python proof-of-concept snippet that slept a few milliseconds between the write() calls and in response got back my code "fixed" with the sleep removed.

In between the emails I googled a bit and found the changelogs for the RTOS they were using. Turned out that it was a bug in the upstream HTTP server. This also meant that the platform they were using had all the security holes from those five-plus years. The bug was later silently fixed when they acquired a newer release from upstream.

Currently I'm having a similar issue with the very same vendor. This time they don't understand why client-side authentication means no authentication at all and why passwords must not be stored in plain text in the database that can be remotely backed up from the device.


Why don't you tell us the vendor's name? It seems like the responsible thing to do.


Even after the bug gets fixed, it'll probably take years for all the embedded devices in the public internet to get patched, so no.


But in the meantime, won't the vendor keep adding more broken devices to the public internet, making the problem worse?

The longer it takes for this problem to become public, won't the more harm be caused when it does become public?


We're just waiting for the free market to kick in.


> This time they don't understand why client-side authentication means no authentication at all

I've seen this... with an intern! I can't imagine dealing with a whole team like that.


How do you not kill these people? How do you put up with it? How do vendors like this survive?


Just like in nature, they survive because they are good enough, and don't experience enough competition to be eliminated by selection.


Depending on what kind of vendor we're talking about, it might be that such aspects aren't even part of what makes them competitive. The average user is not going to know about these types of issues, and so they're not even going to consider such issues when evaluating the vendor.


full disclosure could put some selective pressure on them.


I would assume the “free market” here is that these companies will over-extend themselves so much that they will no longer be able to hide the bugs from the malicious parties and their devices will start getting hacked en masse.

I would assume, however, that there is no law forcing minimal security so you can class A them, can you?


Just about every industrial automation vendor is like this in my experience. They never upgrade because they don't want to break anything.


In inverse order:

Because nobody gives a shit about quality unless it hits their paycheck.

"Onnnngggg. They pay me hourly. Onnnngggg."

They cut lots of checks.


Yeah. Fun problem for beginners, because 1) your incorrect code may work for a while when reads/writes are small or it's only run on a local network or such, 2) you might design a broken protocol if you don't understand fragmentation, etc., which will tend to be harder than (say) an isolated client bug to fix, 3) the implementation-dependent nature of fragmentation can make it look like you hit a language/library/OS issue, 4) your language/library may or may not offer tools to help a beginner to implement a delimited or framed wire format properly (ideally with things like record-size limits and timeouts).

Not sure it says anything you haven't, but a StackOverflow answer on fragmentation (framed by asker as Go not behaving like C) is one of the more-read ones I've written: https://stackoverflow.com/questions/26999615/go-tcp-read-is-...


A version of Microsoft Exchange had a bug in its SMTP implementation that was tickled when lines crossed packet boundaries. (EDIT: The issue was more likely a bug in Exchange's TLS record processing, breaking when a logical line crossed TLS records.) My async SMTP library used a simple fifo for buffering outbound data which didn't realign the write pointer to 0 except when it was completely drained, so when reading slices (iovec's) from the fifo for write-out it would occasionally call write/send with an incomplete line (i.e. part of a line that wrapped around from the end of the fifo buffer array to the front) even if the application had only written full lines. (At the time it didn't support writev/sendmsg, though I'm not sure it would have helped as the TLS record layer might still have been prone to splitting logical lines across packets.) There was no bug here on my end--everything would be sent correctly--but you can't tell the customer that he can't send e-mail to some third-party because that third-party is using a broken version of Exchange.

The first quick fix was to unconditionally realign the fifo contents after every write (the fifo had a realign method), but that ran into a computational complexity problem when you had lots of small lines (e.g. the application caller dumped a huge message into the buffer and then flushed it out in one go) and a high-latency connection that resulted in many short writes; you were constantly memmove'ing the megabytes of remaining contents in the buffer for every tiny write you did. So then I ended up having to add a new interface to the fifo that returned a slice up to a limit but always ending with a specified delimiter (e.g. "\n") if the delimiter was within the maximum chunk size.

Of course, none of these fixes would have completely remedied the issue as lower layers (the TLS stack, the kernel TCP stack) could have still potentially split logical lines, and I'm sure did on occasion. But it at least seemed to put us on equal footing with everybody else in terms of how often it happened, which is really the best anybody could have done. Complaints did die down.


This probably bites lots of newbies, since when you're just sending traffic over localhost, the send()s and read()s tend to line up.


I have often wished for an "unhelpful testing environment" of sorts, to deal with these things before they get out of hand. It would feature a compiler that had creatively different interpretations of undefined behaviors, randomly compile against glibc and musl, have a base OS lovingly crafted from Ubuntu, but with most coreutils replaced with busybox and/or BSD versions. And, now, I suppose, it would have a customized network stack (kernel module?) that would randomly reorder/drop/duplicate packets, randomly reselect MTU on every boot, or maybe just randomly fragments things regardless of MTU. Ideally it would come with a FAQ of "my program broke on X; what did I do wrong?".

The idea being that if your software is actually written to relevant standards, and actually handles things properly outside the golden path, then it should still work fine. If, however, you accidentally did something implementation-defined, or that only worked by coincidence, this system will break it.


There are tools that intentionally insert failures into the network streams of applications. A few of them are described here: https://medium.com/@docler/network-issues-simulation-how-to-...

The other linking/OS problems can probably be automated with some simple integration tests and a bunch of different docker containers to compile the code in. Should be possible to squeeze it into a CI/CD flow somewhere with some clever tricks.


I created such an environment for my unit-tests: Wrapping TCP sockets in a stream which only accepts 1 byte at a time in both directions and returns EAGAIN on every second read provides an easy way to make sure the code on top of the socket does perform all the correct retries.

That will most likely not help newcomers which directly write their code agains the OS socket. But once you get a better understanding of the topic and start adding tests to your codebase it's rather easy to add.


I've done something similar of forcing the sends to be a single byte at a time. That's usually enough to find the obvious issues in parsing data.


One way to stop falling into this trap is by knowing what happens behind the send syscall: the application is not sending bytes down the wire, it just fills a buffer in the OS. Once in the buffer there is no boundary between bytes from different send calls. Same thing for receiving, in reverse.


For me, at least in this decade, it would have been better if I didn't know that. I put off learning websockets longer than I should have because I don't find packet boundaries fun to deal with, and my interest in websockets was mainly for fun. Then when I finally picked websockets up I was pleasantly surprised that message framing is built in.


> stream of bytes

I've always wondered: What's the best/defacto way to delimit this back into packets at the application level on the receiving end?

I would think the obvious approach would be to insert some magic word into the stream so that you can re-sync.

Or is this not an issue since you know that once you're connected, you'll never drop a single byte, therefore, the only way to get out of sync would be a program error?


You will never drop a single byte.

If you need some packet-oriented messaging, you could use something like http://jsonlines.org/ (i.e. JSON messages separated by newline characters), or https://github.com/protocolbuffers/protobuf if it's more performance-critical.


Protobuf isn't self delimiting so you still have to have some extra packet wrapper around it to say the length.

I like zeromq to get to a packet based system.


The standard way is to include explicit information on the length of the message that is following.

For example if the message is x bytes long then you first send 'x' then you send the x bytes of the message.

Or your messages have a defined header that contains the length of the message payload.


It will never get out-of-sync because TCP guarantees that the bytes will be delivered in the same order they've arrived.

The best approach is typically put a length in front of every message. The good things about that approach are:

1. The receiver can allocate buffer that is exactly the size it needs to fit the message. 2. The receiver can check whether the message is too long before seeing the entire message.

The only disadvantage is that you have to know the length of all messages in advance.


Definitely be sure to check the length though. Imagine a mistaken client trying to send HTTP, but of course the first four bytes "HTTP" when interpreted as a 32-bit integer, whichever endian, is an absurdly large buffer.


I mean, at this point you're effectively defining a new network protocol (or, you will be shortly once you implement ways to work around all the other issues you're going to run into). I'd go all-in from the start and start every packet with a magic string/byte sequence of your own, a length, and probably a version code just to make it extensible.

Or see if there's an existing protocol you can abuse for what you want. If it's transactional, you get a pretty big ecosystem of battle-tested clients/servers/proxies/etc if you use HTTP.


A 16-bit length (64k max message size) is usually sufficient, or even 24-bit (16M max) if you really feel the need, but 32 bits is far more than should be needed for parsing messages in memory; it would be fine for a streaming application, however (in which case a 64-bit length wouldn't be a bad idea either.)


Good advice. I was actually referring to this: https://rachelbythebay.com/w/2016/02/21/malloc/ I read this article a long time ago, and yet I made a similar mistake in my own code.



preamble of chunk length and 1 bit for end-of-message indicator.. if you only do chunk length you will eventually find you can't stream but want to.

or just use http.


I often tell people to assume the TCP stack's buffering is arbitrary & capricious and will do the most inconvenient thing for your code. That can mean ether a) dribbling data in one byte at a time per recv() call, or b) buffering multiple megabytes and returning it all in a single recv() call.


If you do want that, then SCTP will provide it.


if you turn off Nagle's algorithm, it gets closer to this though


No, it has nothing to do with that.


I suppose I should say something.

The thing to turn off is delayed ACKs. See "TCP_QUICKACK". Delayed ACKs were a feature which is only useful for things like Telnet, where the payload in each packet is one character when the user is typing. The fixed timer for delayed ACKs is for keyboard typing speeds, and for networks so slow that human typing could congest them. There's a reasonably good explanation here.[1]

As others said above, TCP is not a message protocol. It's a stream protocol. If you're sending messages over a stream, you need something that's reading data from the stream, and when it has a full message, it send that off to be processed. There is no set of TCP options which will reliably cause one write at the sending end to result in one read at the receiving end. If there were, it would be inefficient for small messages and would fail for large ones.

[1] https://www.extrahop.com/company/blog/2016/tcp-nodelay-nagle...


Your terminology is a little off. TCP does not provide anything for the application layer as it is transport layer. The application layer rides on top of that. Examples of transport protocols are TCP and UDP while application protocols are things like http, ssh, irc, and all those things your applications use.

The network layer on which the transport layer rides is packet switched. The TCP uses segments with each segment having its own header and sequence numbers. Streams are just a series of segments populating across a single established handshake without a prior defined termination segment.


I didn't meant to talk about OSI terminologies. It was more about: [user-space] applications which use the TCP/IP stack do not observe packet boundaries, whereas the Kernel certainly does. Obviously this is a bit ambiguous, and you can even get packet boundaries in user-space by running a TCP stack there. But for most TCP/IP usages it holds true.


> It was more about: [user-space] applications which use the TCP/IP stack do not observe packet boundaries

That is still a bit imprecise. Userland applications won't directly see TCP as they are just looking at an application protocol. Typically it's the OS that packages and unpacks the application protocol data into a TCP segment, so of course the userland application won't see it since its not managing that part of the communication.

https://en.wikipedia.org/wiki/Transmission_Control_Protocol#...

There are some exceptions where some application platforms allow developers to write custom TCP protocols, such as Node.js, but these exceptions generally apply to network services and don't commonly apply to the end user application experiance.

https://nodejs.org/dist/latest-v14.x/docs/api/net.html#net_n...


When I first started working with computer networks, I just thought of TCP/IP as "low-level stuff" and I focused instead on the higher level stuff. After I kept running into incomprehensible errors seemingly over and over again, I finally broke down and picked up a copy of Richard Steven's "TCP/IP Illustrated". Hands down, the best investment in time I've ever made. If you deal with distributed systems (hint, you do), you need to understand how they actually work.


A lot of benefit that I add in my day job is bridging the gap between high level folks (OS people) and what's-actually-happening-on-the-wire.

While so, so, so much of this is rarely the network, knowing how to look under the covers and see what's actually hitting the wire (versus what the API call asked for) leads to far, far faster resolution of problems.

It's frustrating to me that so many people see this as a mystery of "knowing networking" when it's really just basic protocol analysis.


> I focused instead on the higher level stuff.

That's fine. But every developer should have a basic understanding of networking. But that can also be dangerous.

I still have people in the company who swear you can't have more than 65k incoming connections to a machine, because "that's how many ports there are". Don't get me started on all the misconceptions on TCP_TW_REUSE AND TCP_TW_RECYCLE. Lengthy discussions because apparently "TIME_WAIT is bad and uses up ports! "(see also, 65k). For context, these are servers, with multiple clients, from different source IPs.


Great I thought I'll take a look. Three volumes each over 1000 pages? Any other suggestions? Did you mean all 3 books?


Hehe - I did end up reading all 3, and enjoyed them all, but I'd say I got 90% of the value from volume 1. Volume 2 walks through the BSD implementation of TCP/IP, which is fascinating, but way more detail than you'd ever need to know, and volume 3 goes off into some esoteric topics that seemed promising at the time but mostly ended up being abandoned (along with a brief discussion of HTTP as it was around the 90's).

If you're going to read it, though, find a used copy of the original Stevens' first edition, not that terrible desecrated second edition.


What is wrong with the second edition?


It was rewritten by a different author (the original author, Richard Stevens, died in a car accident in the late 90's). I guess the new guy tried his best, but he just doesn't have the writing skill that Stevens had.


Just read the first one- it reads more like a novel than a textbook IMHO, though I may be biased- I have always been fascinated by networks and when I was coming of age this was the "high tech" of the time- I used to read RFCs for fun (I highly recommend this as well if you want to dig a little deeper- Jon Postel's are great reads).

This is one of the best written textbooks, if not the best, I have ever read.


No. They are huge because they are highly detailed and nitpicky. I suggest the following for understanding Networking and TCP/IP;

* An Engineering Approach to Computer Networking: ATM Networks, the Internet, and the Telephone Network by S.Keshav

* TCP/IP Illustrated Vol -I by Richard Stevens (any edition will do).

* Effective TCP/IP programming by Jon Snader.


Depressing, isn't it? There are so many books that I probably should read about almost any number of topics that, in my work, I "touch on".


I have books that I purchased decades ago like this that are still languishing on my bookshelf.


Lol I'll probably buy it and put on my bookshelf with all the others.


Not OP but the first volume is the one that's cited frequently and the only one of the series I believe to have a second edition.


Is this still a reccomend book? I was looking for a good TCP/IP reference book, but many seemed rather old. Of course, I imagine protocols like that dont get modified too much.


Well, it's definitely out of date: the first edition predates even IPv6 (and the second edition is awful, don't buy it). Still, the way it's laid out is so well done that once you understand how TCP/IP worked in the mid-90's, you'll easily be able to work out the evolution of it since on your own. It's a shame there's no better up-to-date book, but Stevens was one-of-a-kind. The Comer book isn't bad (but it's not really good, either), and the Kurose & Ross book is less not bad (and more not good), but even though both are more modern, I'd still recommend TCP/IP Illustrated to really understand what's going on in the network stack.


I just finished Kurose’ “Computer Networking: A Top-Down Approach” and I’d recommend it.


Something worth knowing: TCP checksums should not be relied upon. If you aren't using TLS, your application needs to do its own checksums. A 16 bit 1's complement sum over a packet is not sufficient, especially given modern switch ASIC design. They take really fast signals off the fiber and turn them into slow moving, parallel signals within the ASIC. Often these signal buses are nice even numbers of bytes wide, 204 in this example. When there is something flawed in the chip in the slower moving internal pipeline, it will hit the same bit position every 204 bytes. If it is borked enough to flip two bits within the packet, those flips will be in the same bit position a multiple of 204 bytes away, meaning the same bit position in the 1's complement checksum. If one flips up, and the other flips down, it passes! In my case it ended up corrupting data in a BGP session's TCP stream, executing the world's most confusing route hijack in our network.


One of our favorite troubleshooting stories was "3% of TLS connections fail on this particular frontend IP address. HTTP works."

Turned out our cloud provider's networking gear had a bug that disabled ECC and there was a bit flip happening. Convincing the provider's support that we had found faulty hardware in their datacenter was an interesting journey.


I signed up to thank you for this insight. It's the kind of thing you would eventually find out on your own too but having read it somewhere in words saves time when something wonky happens somewhere and the data supports the conclusion. There are people who would swear by the chips/ hardware not being an issue (which is usually correct but not always).

I wonder, why would somebody use 204 bytes -> 1632 bits, why not less (why not more for e.g. jumbo frames). Is there some data sheet / source that you would recommend?


Ah... Network hardware troubleshooting. Truly, it separates the frustration tolerant from the frustration intolerant. If not because of the challenge of having to take into account things mist programmer's treat as invisible, but because it usually involves at least one conversation with another network provider to check their stuff; which generally leads to the most passive aggressive dodging of blame until you get that one operator who's one goal in life is to keep the Internet running correctly.


A very important point every develeper should know: successful write(2) syscall doesn't not grantee that the data received by a remote application. TCP is described as a protocol which grantees packet delivery and this often misleading.

write(2) syscall returned without a error means that data has been placed in OS kernel buffer. OS kernel then will try to send it to a remote host. If couple packet will be lost it's not a problem - kernel will retry a few times. But if power will be lost shortly after a write, data may never hit the wire. Then there is possibility that network link will be broken for a long time. OS will retry, but for a limited time and then will give up. Also remote host can crash at any time before remote application actually will read the data.

So if you need reliable delivery you need acknowledgement on application protocol level despite the fact that TCP already have acknowledgements.


When I used to do networking tech support for some networking equipment the guy's who sat next to me supported the load balancer product.

I swear a high percentage of their calls were questions about how the load balancer wasn't working and sending all the traffic to one server and then after some investigation we discover all traffic is in fact directed to that lone server... because the client code has the IP of that server hard coded. A tedious discussion would then ensue about how that is not how to do it.

The next week? Same angry call...

Partly that is what inspired my decision to change careers. "Man if these developers can't figure out basic networking, maybe I could be a developer...?"


I have the same tedious discussion every time a new box is replaced. It goes in the lines of "no, changing a box will not make your application behave differently, fix your dawn thing".


Just enough information to be dangerous? Article attributes behaviors of loss-based congestion control schemes like Reno and Cubic to TCP itself. In practice, the congestion control scheme is not really part of the protocol (there is, for example, BBR). There's also ECN, showing that loss is not the only way to discover congestion.


the RTT discussion was a little misleading. its true that slow start rates are entirely dependent on RTT...but eventually the sawtooth should reach the same steady state.

there is work that shows that higher RTT connection do statistically suffer a smaller fair share, but that's a subtler if related issue. actually, I really wish the author would have shown the sawtooth.


At some point, with increasing bandwidth and increasing RTT, you end up with your effective bandwidth capped by receive windows and/or send buffers. Cross country high def video might not be quite enough to hit that, but intercontinental high def video would be.

Being closer means faster initial 'slow start', but also faster 'slow start' on congestion, which is why you get a bigger share.


sure. but thats really just a window being under the bandwidth delay product. the discussion makes it seem like you suffer an outright linear performance hit


> But what about large files, such as videos? Surely there is a latency penalty for receiving the first byte, but shouldn’t it be smooth sailing after that?

So the articles (unstated) conclusion seems to be that, as long as there isn't network congestion, it is smooth sailing after that.

But that congestion reduces bandwidth. But of course, that applies just as much to a national backbone as to last-mile.

So I'm curious: where does most packet loss occur? Is it last-mile, at your ISP, or along major backbones? Because that has major implications as to whether caching video content closer to users actually results in higher-quality video (e.g. supporting 1080p instead of 720p) or not.


> where does most packet loss occur

Here's an interesting paper from SIGCOMM (it won best paper at the conference in 2018, FWIW) that attempts to figure out what links are congested without direct access to ISP networks: https://www.caida.org/publications/papers/2018/inferring_per...


Ive been debugging packet loss issues lately and they did all occur in the datacenter. For backbones and network exhanges they move so much traffic already that things like everyone working remote only increases traffic by a few percent, and they have a lot of over capacity in order to handle spikes or when a new game is released and everyone downloads it at the same time.

So yes it would really help to have more decentralisation. Like putting the content closer to the user.


Why doesn't the congestion control part of TCP prevent buffer bloat[1]? Is it because ISP throttling of the internet connection doesn't touch the TCP packets themselves?

I recently started doing off-site backups, which requires my entire internet uplink to be used for uploading said backups for about a week at a time. The internet basically becomes unusable because all the packets end up in a buffer on the router and latency spikes to 5000ms.

[1] https://www.bufferbloat.net/projects/bloat/wiki/What_can_I_d...


Your router is holding ("buffering") packets in the hopes that they can be sent soon. Your measurements indicate that the router is "bloated", holding about five seconds (5,000 ms) worth of data.

This gives the sending TCP algorithm the wrong impression. It's waiting to hear about a dropped packet to indicate that there's congestion. When your router holds on to those packets (instead of dropping them), the TCP algorithm doesn't get any feedback, so it keeps shoveling data into the connection.

This leads to the bad state you're seeing. And that's where the advice on "What can I do about Bufferbloat?" comes in.

There's no benefit to having more than one packet buffered by the router. (Hanging on to more than one packet per connection only causes the latency/lag you're seeing.)

There are routers that actually check the time the router has held packets. If packets have been queued for "too long", the router discards them immediately, giving the vital feedback to the sending TCP. Those routers use the technique known as SQM (Smart Queue Management) and the fq_codel, cake, PIE algorithms to keep the queues within the router short - typically less than 5 msec.

To solve your problem, investigate getting a router that implements one of those SQM algorithms. They're listed on the "What can I do..." page. I am a fan of OpenWrt (use it at home), but have installed a bunch of IQrouters and Ubuquiti devices for friends.


TCP has its own buffers too. In a media application I had to use UDP because I could know how deep the local transmit buffering went. TCP just swallows the packets and maybe sends them, maybe buffers them. Adding to the problem.


If there is a huge FIFO queue on your router, the rate-finding algorithms associated with TCP will be forced to conclude that the RTT to your site is enormous. They may try to open the window to compensate, but here's a fun fact: most operating system default settings are insufficient to utilize very high bandwidth-delay products. If you want to send a 1gbps flow across an 80ms distance on Linux, you'll need to change some parameters with sysctl before it will work. If your apparent RTT is 5000ms, the flow you can get will be reduced in proportion.

In any case, the solution to bufferbloat is queue discipline, not congestion control.


Up to what speeds/latencies are the default sysctl parameters alright? Is there any easy way to know whether you are getting hit by this? Nowadays many people is getting 1 Gbps links at home!

What do you mean by queue discipline?


You know, the worst part is that Linux sets the maximum receive window size at boot time depending on how much memory the system contains, ensuring that it's never quite right. On this machine, with 32GB of main memory, it defaults to 6291456 bytes.


I see, thanks!

What about the queue discipline?


If you face a choice of what frame to put on the wire at any moment, the queue discipline makes that choice. The easiest policy is to simply send the oldest frame, but this is also the worst policy.


Ah, so the eviction/priority algorithm. Thanks!


Most of the common TCP congestion control algorithms (Reno, Cubic) are loss-based: they try to send more and more data until the link no longer can buffer all of the packets, and drops some of them. Naturally, this approach requires the buffer to fill up, causing the latency to spike.

There are algorithms that try to use increased delay as a signal that the link is full. This approach has multiple problems, one of which is that delay can be really noisy on wireless networks; another is that if you have a loss-based and a delay-based connection sharing the same link, the delay-based one will get much less than a fair share of its bandwidth. People have been trying to make an algorithm that both coexists with Reno/CUBIC and does not induce bufferbloat for the last 25 years or so, and there's been some progress, but none of it has reached the point where it could be used as a default congestion control for all operating systems.

The problem of "I have files to transfer in background, but I want my connection to yield to more important traffic" can actually solved using a special congestion control algorithm called LEDBAT [1]; it's used by Apple for things like software updates, and BitTorrent uses it too. Unfortunately, I think only Apple implements it in its TCP stack, so anyone who wants to do that would have to roll their own thing using UDP.

[1] https://en.wikipedia.org/wiki/LEDBAT


> Why doesn't the congestion control part of TCP prevent buffer bloat[1]? Is it because ISP throttling of the internet connection doesn't touch the TCP packets themselves?

Most of the congestion control algorithms use packet loss as the only indicator of congestion. In a network with oversized buffers, congestion will result in delay and not packet loss. If the delay gets large enough, recieve and congestion windows will restrict the effective bandwidth, but the latency at that point is terrible.

There are some alternate congestion control algorithms which do use latency as a signal, but they aren't universally available, and may not be a good fit for all flows.

For your backup use case, probably the simplest thing is to reduce your sendbuffers for the backup sender process. Although allowing packets to drop instead of queue at your router/modem would really be best, often that's difficult to acheive.


A major reason explicit congestion notification is not used is firewalls that block anything that isn't bog standard TCP or UDP. Some even ban odd combinations of flags. There are enough of these to make ECN useless.


A router that is willing to buffer 5 seconds worth of packets probably wasn't going to mark for congestion and drop either.

Note also, Apple is using MP-TCP and ECN in iOS, and the world didn't stop. It might not work everywhere, and I don't praise Apple lightly, but there's a pretty clear path to using things like this. Send a syn with it enabled, wait a bit, and send one with it disabled. Keep track of networks where it doesn't work and stop trying it there. If you have leverage, yell at people to not do dumb things, otherwise, let them figure out why expensive things work better on their competetors' networks. You can't rely on being able to use these things, but you can use them for progressive enhancement.


This is a fundamental problem on the internet. Ram is so cheap that every device has too big buffers that don’t allow for proper TCP back pressure. Eric Raymond gave a talk on this a few years ago. He was going to distribute a lot of small embedded devices around the world to measure this to try to address it. I’m curious what happened to that effort.


Big buffers that can be filled fast trick congestion control algorithms into thinking your wire is really fast. The point of the buffer is to be transparent to the transmitting ends, so they see the packets going out at lightning speed and assume it's because they're actually going that fast, and not just piled into a buffer that fast.


> Why doesn't the congestion control part of TCP prevent buffer bloat[1]?

It can. Enable BBR + fq/fq_codel on the box in question and CAKE on your router.


The article just explains some details but not the main concept behind TCP i.e. TCP is a connection and stream based protocol of bytes. All it takes is to explain the idea of Counting bytes with a Sliding Window and almost everybody will "get" TCP. Never talk about packets.


The RFC is useful as well.

https://tools.ietf.org/html/rfc793

TCP state machine diagrams can be useful too.


Single biggest TCP issue I have had to debug and fix numerous times is about not doing connection reuse properly leading to tcp port exhaustion and causing seemly random delays causing timeout failures at higher level protocols, usually http. This one single issue has taken down multi-billion dollar production systems.

So, I hope people learn to check their http client/server implementations to have proper connection handling. Client should have a thoughtfully sized bounded connection pool with reasonably large idle timeout. It shouldn't close the connection after every application request (say, http request). There shouldn't be sockets in TIME_WAIT state accumulating at the client end.

Server should accept thoughtfully limited number of connections per client. Server should never close the connection except when it is shutting down.

There should be tcp keepalive messages to keep the connection alive with intermediate hop stateful firewalls (connection tracking table entries in firewalls expire when the connection is idle for too long) and to detect stale connections and re-establish them.

All of these things can be verified by analyzing at a packet capture. You can get a manageable sized pcap file by filtering on client/server ip/port-range pairs for at least 330 seconds.

Knowing tools to understand/debug tcp issues is an essential skill. sock stat command - ss, wireshark/tshark with Lua scripting is super useful. Knowing higher level application protocols like TLS and http is essential too.


The biggest issue with TCP is that it can randomly freeze and you have to restart it in pretty much any network. You CAN NOT rely on socket closing on any side, you have to maintain connection by yourself.

I am super puzzled why something like websockets not solving this problem, simple heartbeat could solve the problem, but no one implements it.


You can use keepalives at the protocol (TCP) level.


In 99% of cases you don't have an api to do so.


That's not the protocols fault. Besides, the OS usually does this, not your program.

SO_KEEPALIVE is available on all relevant OS.


Default keepalive time of two hours is also quite long, in some situations, where you would expect to get the notification just a bit faster.


Yes, default values are not sane for most connections.


Good luck access this from browser.


Browsers can do this.


Aha, show me how i can access low level TCP stuff from JS. Browsers could do this, but they would not.


There is no need for that in JS. Browser (and any other program) can open connections with SO_KEEPALIVE.


I bet they could not. It will ruin whole internet with little gain since you still need application level pings anyway because issue could be in a broken server, not on TCP level.


https://chromium.googlesource.com/chromium/src/+/master/net/...

I don't see how this breaks the whole Internet. Yes the server might not behave, but that's not a TCP problem.


i bet you missed the times when uTorrent turned on uTP and almost everywhere in the world ISP hardware just failed because of x100 traffic.


Websockets certainly has simple heartbeats. It has Ping and Pong frames defined in the spec: https://html.spec.whatwg.org/multipage/web-sockets.html#ping...

The libraries I use tend not to enable it by default, but they are generally implemented.


Yes, they are just not implemented anywhere.

Even MDN is not mentioning it: https://developer.mozilla.org/en-US/docs/Web/API/WebSocket


MDN doesn't mention it because MDN is focused on web developers writing code for browsers.

Pings are sent from the server to the browser. Browsers are supposed to automatically send Pongs back to the server. So a client implementation has no reason to know about Pings or Pongs—it's handled by the server or the browser.


No one cares about server side pings. On server it is not an issue at all - feel free to hold dead connections for a while.

Networking issues are always on customer side, you have to detect it from the app otherwise it will literally freeze.


I loved this area when I was at university. At the end of Computer Networking course I brought a project on based on https://www.isi.edu/nsnam/ns/

It was really fun expecially because it allows you to understand better all networking layers.

I did some tests about network topology to minimize lost tcp packs as possible, given different network traffics


This reminds me of an issue I had to debug over a decade ago. Our product had its own protocol written atop TCP, but its handshake was written in a way such that it was much slower than it should have been due to delays caused by the Nagle algorithm.

Turning on TCP_NODELAY was a quick-n-dirty fix, but the real fix was to rewrite the handshake to be more compatible with the inner workings of TCP.


Back in the day I remember learning about TCP and other systems subjects through Beej Guides! [1]. Is there other material recommended to review network programming? I know about the Stevens' books but they look so bulky and dry...

1: https://beej.us/guide/


EDIT: Sure wish I could delete this post.

Wait, this isn't TCP, this is protocol level above TCP, right? TCP doesn't shape traffic by itself through rate limiting and congestion analysis, does it? I thought the layer above it used TCP to send/receive the buffer size, and that has nothing to do with TCP.

Am I wrong?


TCP definitely does congestion control itself: https://en.wikipedia.org/wiki/TCP_congestion_control


You are wrong! Obviously the application layer on top of TCP could be the bottleneck, but TCP itself has mechanisms to ensure traffic is flowing as fast and as smoothly as possible. Look up "TCP Flow control" and "TCP Congestion Control"


So the video example here, does it indicate that UDP is more suited for video transmission?

Does http3 fix this?


Unrelated: does anybody know the tool used to make those diagrams ?


It looks like excalidraw.


Yeah, it's so good and it's open-source !




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: