
What developers should know about TCP - todsacerdoti
https://robertovitillo.com/what-every-developer-should-know-about-tcp/
======
Matthias247
What they mostly should know: TCP provides a bidirectional stream of bytes on
the application level. It does NOT provide a stream of packets.

That means whatever you pass to a send() call is not necessarily the same
amount of data the receiver will observe in a single read() call. You might
get more or less bytes, since the transport layer is free to buffer and to
fragment data.

I have seen the assumption of TCP having packet boundaries on application
level being made too often - typically in stackoverflow questions like: „I
don’t receive all data. Is my OS/library broken?“

~~~
outworlder
> What they mostly should know: TCP provides a bidirectional stream of bytes
> on the application level. It does NOT provide a stream of packets.

> That means whatever you pass to a send() call is not necessarily the same
> amount of data the receiver will observe in a single read() call.

Yes, this. For god's sake, listen to them.

I had to fight a coworker on this. I had quickly created some client code just
to validate that the server was working. Due to some quirk, all the messages
were arriving in full in every read call. He told me to ship it.

I said no! "I need to check if there's more data and if so add a loop to read
again" "But it is working, release it". That went on for a while, to no avail.
Wouldn't look at documentation either.

Eventually he head to leave for the day, and I took the time to implement it
correctly.

I started including basic TCP questions on interviews. Not many people even
get past the TCP handshake (if they even know about that).

~~~
SilasX
Stupid question: why would you be writing code that works at the level of TCP?
Don't you usually want to use the OS's (or some popular library's) TCP
software stack?

~~~
jfkebwjsbx
It seems to me GP is talking about using TCP, not implementing it.

~~~
SilasX
Right, I mean I thought that the TCP protocol implementation itself handled
that issue, and your calls to such a library abstracted away from that.

~~~
jacobolus
No, it does not, by design. When you use any standard TCP implementation the
abstraction provided at both ends is just a stream of bytes. The guarantee TCP
makes is that the bytes received at one end will be in the same order as the
bytes that were sent at the other end.

If you want to use TCP to implement some (higher-level) message-based
protocol, you need to parse those out for yourself.

~~~
SilasX
The original question didn't refer to such issues of parsing, just checking
for whether the full message was received, which, I would think, would be
handled by the TCP library read() (or whatever) call. It sounded like the OP
(outworlder) was delving into lower-level TCP details that should have been
abstracted away -- my thinking is that the TCP caller shouldn't have to
concern itself with details about checking whether the full message has been
received, at least not as a separate step. That is, it just wouldn't return
anything until the full message is received, or would include some data
structure that indicates it's not complete. Does that make sense?

Edit: On second thought, I guess OP meant that all of the results were coming
back "complete" which doesn't obviate the issue of needing to do a check that
handles the "not done" case.

~~~
jacobolus
> _the TCP caller shouldn 't have to concern itself with details about
> checking whether the full message has been received_

The caller absolutely must concern itself with this crucial “detail”. If you
do otherwise, your code is broken, full stop.

You can implement a higher-level protocol on top which handles this kind of
thing internally and presents a higher-level interface (e.g. not passing any
partial data along to _its_ caller until a full message has been received),
but if you are just working with TCP directly, what you get is just a stream
of bytes. The guarantee you get is that the bytes will be in order and without
any gaps.

If you e.g. send UTF-8 encoded text, you must be prepared on the read side to
have your stream of bytes cut off arbitrarily in the middle of a character.

~~~
SilasX
Let function A call TCP read() and pass back a struct that includes the bytes
it's received and a flag that indicates whether it's read to the end of the
message.

Let function B call TCP read() and never return anything until it's received
all the bytes of the message.

Both of those seem (IMHO) like functions you could have in a TCP library.
Neither seems (IMHO) like a higher level protocol.

~~~
yuribro
The reason that this function doesn't exist is that even if the sender sent a
"message" in a single API call, there is no guarantee that the networking code
will send it in one IP packet (or one layer 2 frame). We don't want to couple
the message size with the lower network protocol, the type of network
equipment and so on, and also want to allow merging of messages for more
efficient use of the network (If we have a large window size it allows better
throughput).

So the job of separating the stream into messages is left to the application
layer. (unlike for example in UDP, but then you have to worry about dropped
messages)

~~~
SilasX
What is that responding to? Of course TCP can split a message across multiple
packets of the lower level protocol; that has no bearing on the concepts TCP
works in and whether it can indicate end-of-message.

If your point is just that TCP doesn't have a concept of a "message" (a
bytestream with a clear beginning and ending), then that's fair, but, as I
said elsewhere [1] the original comment took for granted that TCP does have a
well defined notion of "you've reached the end of the message", or at least,
"there is no further data to receive". No one seemed to have a problem with
that there, and I was just working off that assumption.

As before, I haven't checked whether this is true (I can't quickly verify from
descriptions of TCP).

And, interestingly enough, there's this comment [2], which says that what I
described _does_ exist, but isn't the default. So ... I'm at least not getting
a consistent answer to my question, and people who think they know what
they're talking about are inconsistent with each other.

[1]
[https://news.ycombinator.com/item?id=23200293](https://news.ycombinator.com/item?id=23200293)

[2]
[https://news.ycombinator.com/item?id=23200906](https://news.ycombinator.com/item?id=23200906)

~~~
jacobolus
> _the original comment took for granted that TCP does have a well defined
> notion of "you've reached the end of the message"_

No, the original comment was complaining about a coworker who didn’t
understand (and refused to listen when told otherwise) that there is no such
notion in TCP. It was a response to another comment complaining about people
on the internet (e.g. Stack Overflow) too often making the same mistake.

You’re more or less playing the part of that coworker here. It’s unclear why.

------
commandlinefan
When I first started working with computer networks, I just thought of TCP/IP
as "low-level stuff" and I focused instead on the higher level stuff. After I
kept running into incomprehensible errors seemingly over and over again, I
finally broke down and picked up a copy of Richard Steven's "TCP/IP
Illustrated". Hands down, the best investment in time I've ever made. If you
deal with distributed systems (hint, you do), you _need_ to understand how
they actually work.

~~~
rb808
Great I thought I'll take a look. Three volumes each over 1000 pages? Any
other suggestions? Did you mean all 3 books?

~~~
commandlinefan
Hehe - I did end up reading all 3, and enjoyed them all, but I'd say I got 90%
of the value from volume 1. Volume 2 walks through the BSD implementation of
TCP/IP, which is fascinating, but way more detail than you'd ever need to
know, and volume 3 goes off into some esoteric topics that seemed promising at
the time but mostly ended up being abandoned (along with a brief discussion of
HTTP as it was around the 90's).

If you're going to read it, though, find a used copy of the original Stevens'
first edition, not that terrible desecrated second edition.

~~~
Bootvis
What is wrong with the second edition?

~~~
commandlinefan
It was rewritten by a different author (the original author, Richard Stevens,
died in a car accident in the late 90's). I guess the new guy tried his best,
but he just doesn't have the writing skill that Stevens had.

------
mitchs
Something worth knowing: TCP checksums should not be relied upon. If you
aren't using TLS, your application needs to do its own checksums. A 16 bit 1's
complement sum over a packet is not sufficient, especially given modern switch
ASIC design. They take really fast signals off the fiber and turn them into
slow moving, parallel signals within the ASIC. Often these signal buses are
nice even numbers of bytes wide, 204 in this example. When there is something
flawed in the chip in the slower moving internal pipeline, it will hit the
same bit position every 204 bytes. If it is borked enough to flip two bits
within the packet, those flips will be in the same bit position a multiple of
204 bytes away, meaning the same bit position in the 1's complement checksum.
If one flips up, and the other flips down, it passes! In my case it ended up
corrupting data in a BGP session's TCP stream, executing the world's most
confusing route hijack in our network.

~~~
dharmab
One of our favorite troubleshooting stories was "3% of TLS connections fail on
this particular frontend IP address. HTTP works."

Turned out our cloud provider's networking gear had a bug that disabled ECC
and there was a bit flip happening. Convincing the provider's support that we
had found faulty hardware in their datacenter was an interesting journey.

------
citrin_ru
A very important point every develeper should know: successful write(2)
syscall doesn't not grantee that the data received by a remote application.
TCP is described as a protocol which grantees packet delivery and this often
misleading.

write(2) syscall returned without a error means that data has been placed in
OS kernel buffer. OS kernel then will try to send it to a remote host. If
couple packet will be lost it's not a problem - kernel will retry a few times.
But if power will be lost shortly after a write, data may never hit the wire.
Then there is possibility that network link will be broken for a long time. OS
will retry, but for a limited time and then will give up. Also remote host can
crash at any time before remote application actually will read the data.

So if you need reliable delivery you need acknowledgement on application
protocol level despite the fact that TCP already have acknowledgements.

------
duxup
When I used to do networking tech support for some networking equipment the
guy's who sat next to me supported the load balancer product.

I swear a high percentage of their calls were questions about how the load
balancer wasn't working and sending all the traffic to one server and then
after some investigation we discover all traffic is in fact directed to that
lone server... because the client code has the IP of that server hard coded. A
tedious discussion would then ensue about how that is not how to do it.

The next week? Same angry call...

Partly that is what inspired my decision to change careers. "Man if these
developers can't figure out basic networking, maybe I could be a
developer...?"

~~~
joana035
I have the same tedious discussion every time a new box is replaced. It goes
in the lines of "no, changing a box will not make your application behave
differently, fix your dawn thing".

------
jeffbee
Just enough information to be dangerous? Article attributes behaviors of loss-
based congestion control schemes like Reno and Cubic to TCP itself. In
practice, the congestion control scheme is not really part of the protocol
(there is, for example, BBR). There's also ECN, showing that loss is not the
only way to discover congestion.

~~~
convolvatron
the RTT discussion was a little misleading. its true that slow start rates are
entirely dependent on RTT...but eventually the sawtooth should reach the same
steady state.

there is work that shows that higher RTT connection do statistically suffer a
smaller fair share, but that's a subtler if related issue. actually, I really
wish the author would have shown the sawtooth.

~~~
toast0
At some point, with increasing bandwidth and increasing RTT, you end up with
your effective bandwidth capped by receive windows and/or send buffers. Cross
country high def video might not be quite enough to hit that, but
intercontinental high def video would be.

Being closer means faster initial 'slow start', but also faster 'slow start'
on congestion, which is why you get a bigger share.

~~~
convolvatron
sure. but thats really just a window being under the bandwidth delay product.
the discussion makes it seem like you suffer an outright linear performance
hit

------
crazygringo
> _But what about large files, such as videos? Surely there is a latency
> penalty for receiving the first byte, but shouldn’t it be smooth sailing
> after that?_

So the articles (unstated) conclusion seems to be that, as long as there isn't
network congestion, it _is_ smooth sailing after that.

But that congestion reduces bandwidth. But of course, that applies just as
much to a national backbone as to last-mile.

So I'm curious: where _does_ most packet loss occur? Is it last-mile, at your
ISP, or along major backbones? Because that has major implications as to
whether caching video content closer to users actually results in higher-
quality video (e.g. supporting 1080p instead of 720p) or not.

~~~
boryas
> where does most packet loss occur

Here's an interesting paper from SIGCOMM (it won best paper at the conference
in 2018, FWIW) that attempts to figure out what links are congested without
direct access to ISP networks:
[https://www.caida.org/publications/papers/2018/inferring_per...](https://www.caida.org/publications/papers/2018/inferring_persistent_interdomain_congestion/inferring_persistent_interdomain_congestion.pdf)

------
bsamuels
Why doesn't the congestion control part of TCP prevent buffer bloat[1]? Is it
because ISP throttling of the internet connection doesn't touch the TCP
packets themselves?

I recently started doing off-site backups, which requires my entire internet
uplink to be used for uploading said backups for about a week at a time. The
internet basically becomes unusable because all the packets end up in a buffer
on the router and latency spikes to 5000ms.

[1]
[https://www.bufferbloat.net/projects/bloat/wiki/What_can_I_d...](https://www.bufferbloat.net/projects/bloat/wiki/What_can_I_do_about_Bufferbloat/)

~~~
richbhanover
Your router is holding ("buffering") packets in the hopes that they can be
sent soon. Your measurements indicate that the router is "bloated", holding
about five seconds (5,000 ms) worth of data.

This gives the sending TCP algorithm the wrong impression. It's waiting to
hear about a dropped packet to indicate that there's congestion. When your
router holds on to those packets (instead of dropping them), the TCP algorithm
doesn't get any feedback, so it keeps shoveling data into the connection.

This leads to the bad state you're seeing. And that's where the advice on
"What can I do about Bufferbloat?" comes in.

There's no benefit to having more than one packet buffered by the router.
(Hanging on to more than one packet per connection only causes the latency/lag
you're seeing.)

There are routers that actually check the time the router has held packets. If
packets have been queued for "too long", the router discards them immediately,
giving the vital feedback to the sending TCP. Those routers use the technique
known as SQM (Smart Queue Management) and the fq_codel, cake, PIE algorithms
to keep the queues within the router short - typically less than 5 msec.

To solve your problem, investigate getting a router that implements one of
those SQM algorithms. They're listed on the "What can I do..." page. I am a
fan of OpenWrt (use it at home), but have installed a bunch of IQrouters and
Ubuquiti devices for friends.

~~~
JoeAltmaier
TCP has its own buffers too. In a media application I had to use UDP because I
could know how deep the local transmit buffering went. TCP just swallows the
packets and maybe sends them, maybe buffers them. Adding to the problem.

------
rramadass
The article just explains some details but not the main concept behind TCP
i.e. _TCP is a connection and stream based protocol of bytes_. All it takes is
to explain the idea of _Counting bytes with a Sliding Window_ and almost
everybody will "get" TCP. Never talk about _packets_.

------
29athrowaway
The RFC is useful as well.

[https://tools.ietf.org/html/rfc793](https://tools.ietf.org/html/rfc793)

TCP state machine diagrams can be useful too.

------
vinay_ys
Single biggest TCP issue I have had to debug and fix numerous times is about
not doing connection reuse properly leading to tcp port exhaustion and causing
seemly random delays causing timeout failures at higher level protocols,
usually http. This one single issue has taken down multi-billion dollar
production systems.

So, I hope people learn to check their http client/server implementations to
have proper connection handling. Client should have a thoughtfully sized
bounded connection pool with reasonably large idle timeout. It shouldn't close
the connection after every application request (say, http request). There
shouldn't be sockets in TIME_WAIT state accumulating at the client end.

Server should accept thoughtfully limited number of connections per client.
Server should never close the connection except when it is shutting down.

There should be tcp keepalive messages to keep the connection alive with
intermediate hop stateful firewalls (connection tracking table entries in
firewalls expire when the connection is idle for too long) and to detect stale
connections and re-establish them.

All of these things can be verified by analyzing at a packet capture. You can
get a manageable sized pcap file by filtering on client/server ip/port-range
pairs for at least 330 seconds.

Knowing tools to understand/debug tcp issues is an essential skill. sock stat
command - ss, wireshark/tshark with Lua scripting is super useful. Knowing
higher level application protocols like TLS and http is essential too.

------
ex3ndr
The biggest issue with TCP is that it can randomly freeze and you have to
restart it in pretty much any network. You CAN NOT rely on socket closing on
any side, you have to maintain connection by yourself.

I am super puzzled why something like websockets not solving this problem,
simple heartbeat could solve the problem, but no one implements it.

~~~
gsich
You can use keepalives at the protocol (TCP) level.

~~~
ex3ndr
In 99% of cases you don't have an api to do so.

~~~
gsich
That's not the protocols fault. Besides, the OS usually does this, not your
program.

SO_KEEPALIVE is available on all relevant OS.

~~~
Sami_Lehtinen
Default keepalive time of two hours is also quite long, in some situations,
where you would expect to get the notification just a bit faster.

~~~
gsich
Yes, default values are not sane for most connections.

------
resca79
I loved this area when I was at university. At the end of Computer Networking
course I brought a project on based on
[https://www.isi.edu/nsnam/ns/](https://www.isi.edu/nsnam/ns/)

It was really fun expecially because it allows you to understand better all
networking layers.

I did some tests about network topology to minimize lost tcp packs as
possible, given different network traffics

------
dblohm7
This reminds me of an issue I had to debug over a decade ago. Our product had
its own protocol written atop TCP, but its handshake was written in a way such
that it was much slower than it should have been due to delays caused by the
Nagle algorithm.

Turning on TCP_NODELAY was a quick-n-dirty fix, but the real fix was to
rewrite the handshake to be more compatible with the inner workings of TCP.

------
emmanueloga_
Back in the day I remember learning about TCP and other systems subjects
through Beej Guides! [1]. Is there other material recommended to review
network programming? I know about the Stevens' books but they look so bulky
and dry...

1: [https://beej.us/guide/](https://beej.us/guide/)

------
freefriedrice
EDIT: Sure wish I could delete this post.

Wait, this isn't TCP, this is protocol level above TCP, right? TCP doesn't
shape traffic by itself through rate limiting and congestion analysis, does
it? I thought the layer above it used TCP to send/receive the buffer size, and
that has nothing to do with TCP.

Am I wrong?

~~~
scott_s
TCP definitely does congestion control itself:
[https://en.wikipedia.org/wiki/TCP_congestion_control](https://en.wikipedia.org/wiki/TCP_congestion_control)

------
Jonnax
So the video example here, does it indicate that UDP is more suited for video
transmission?

Does http3 fix this?

------
agnivade
Unrelated: does anybody know the tool used to make those diagrams ?

~~~
abhorrence
It looks like excalidraw.

~~~
agnivade
Yeah, it's so good and it's open-source !

