Hacker News new | past | comments | ask | show | jobs | submit login
Why TCP Timers Don’t Work Well (1986) [pdf] (cuny.edu)
41 points by tjalfi 20 days ago | hide | past | favorite | 13 comments



That paper would need to be put into a current context.

Latencies have changed drastically since 1986. While the basics of TCP haven't, some details have. So unless you are very deep in TCP (I am not) reading this is pretty pointless because you don't know where you learn about historical mistakes and where about current issues. Both are useful, but not without the classification.

That said every time I use Wireshark other than in a small LAN I see horrible retransmission errors. So probably some issues haven't been solved since 1986.


This DOES need to be put into context. It is very, VERY old, so old it even predates all modern congestion control algorithms. The timer problems were solved by Van Jacobson in 1988, who replaced the fixed beta parameter with a mathematical variance estimator, which made solved the convergence problems. The slow connection start problem was resolved using the slow-start algorithm (which ramps up exponentially). The congestion problem was solved by using exponential backoff and linear ramp-up after slow start, which solved the "retransmit storm" problem which had plagued ARPANET since the beginning. You can read about those improvements in [0], straight from Van Jacobson himself.

His scheme became known as TCP Reno, and was the basis for almost every future congestion control algorithm (CUBIC, Vegas, Chicago, Compound), until BBR (also by Van Jacobson), which operates on different principles [1].

[0]: https://ee.lbl.gov/tcp.html [1]: https://research.google/pubs/pub45646/


If you're seeing retransmission errors on a wired lan, then you should replace cables and any network equipment until they go away.

A basic assumption of TCP is that retransmission only happens due to congestion. It fundamentally assumes a perfect channel, which is why wireless connectivity doesn't play well with it. There have been many attempts to fix this but they require changes in both the AP and the client, so I don't think anyone has really bothered to implement in common hardware yet.


> There have been many attempts to fix this but they require changes in both the AP and the client, so I don't think anyone has really bothered to implement in common hardware yet.

A large part of the wifi stack is in software and these things have in fact been implemented already, at least on linux, it is just that deployment is lacking.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...


>"A basic assumption of TCP is that retransmission only happens due to congestion."

This is a not a basic assumption of TCP. TCP's inability to differentiate between problems due to congestion vs problems due to noisy or lossy channels is based on the assumption that the link-layer protocol operates independently of higher-layer protocols IP, TCP, etc.


> It fundamentally assumes a perfect channel, which is why wireless connectivity doesn't play well with it

Wireless generally handles the retransmissions on a lower layer (802.11 and cellular data connections both do it), so TCP doesn't see lost packets.

(Of coursee with bad wireless connections you get complete cut outs completely so that even link layer retransmissions won't work)

If there was a lot of room for improvement in TCP here, other protocols would have done them and probably modern TCP would have incorporated them in turn.


TCP assumes that error correction of the physical channel is handled elsewhere, right? I guess that's just stating your point a different way. TCP doesn't know whether it goes over a wire or a radio link, it doesn't care. But that doesn't mean that no error correcting is happening, just that the error correction is handled outside of TCP.

> A basic assumption of TCP is that retransmission only happens due to congestion.

From the perspective of a single computer, what's the difference between an error caused by congestion and an error caused by a noisy channel? Can you tell the difference? Isn't a transmission error just a transmission error?


The difference is in what you chose to do after you see an error. If you assume that packet loss can only mean congestion, you back off rather than re-transmit, as your re-transmission is likely to make the problem worse, not better.

In the case of wireless comms though, that may just increase lag, as the packet may have been lost just because of physical issues and a swift re-trnasmit could be the best option.


No, TCP assumes packets can show up out of order. If they can’t then you can much more quickly detect a missing packet when you receive a packet put of sequence.

This really falls down not to TCP but IP and the protocols below it. TCP if for when a packet get’s lost somewhere along the line not for transmission errors over a wireless channel.


Traditionally in TCP the receiver ACKs the last sequence number before any missing packets. When it receives packets out of order it responds by sending duplicate ACKs. This causes the sender to retransmit the packets after the ACK’s sequence number. If there’s no loss, just reordering, then this retransmission is a waste.

There is a SACK (selective acknowledgement) TCP extension which allows a receiver to say in more detail which packets have been received, to handle reordering more efficiently. But even so, network engineers try hard to avoid anything that can cause reordering: for instance, ECMP (equal cost multipath) balances traffic load across multiple links using a hash of the TCP quadruple to ensure that each flow follows a consistent path.


From my understanding it is assuming that the channel is perfect because that‘s the problem of the underlaying layer when looking at the OSI layer. TCP does not know on which layer it is transmitted and therefore should not care about error correction on the transmission at all.


TCP puts checksum in every packet proving it does not assume a perfect channel.


This is one of the many causes of DoS attacks we experienced in the last 90s and early 2000s. I worked as an engineer for Cisco then, and we could only resolve many of the problems with timers. Waiting for the event that never came caused numerous issues. We continue to learn.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: