
Uncovering performance regressions in the TCP SACKs vulnerability fixes - rxin
https://databricks.com/blog/2019/09/16/adventures-in-the-tcp-stack-performance-regressions-vulnerability-fixes.html
======
djhworld
This is a fun write up to read, because I experienced exactly the same problem
in a similar system dealing with lots of writes to S3. Sporadic 15 minute
timeouts for no immediate reason, especially as the files were only a few
megabytes in size.

It led to me going on a similar journey of diving deep into the stack right
into doing diffs on the kernel tree to work out what had changed between
kernel versions. Eventually I came to the same conclusion, and only recently
the problem has been patched in CENTOS6/RHEL6
[https://access.redhat.com/errata/RHSA-2019:2736](https://access.redhat.com/errata/RHSA-2019:2736)

Interestingly after identifying this problem, I also noticed similar behaviour
on AWS Lambda shortly after June 20th, with the TCPWQueueTooBig metric spiking
and causing Lambda timeouts. Took a few rounds through AWS support (and our
account managers) to get them to look at it, but they eventually fixed it.

I think the common thread between this post and my experience is we are both
using a Java/JVM based stack. When trying to reproduce the bug for Amazon I
could only reproduce it with a simple Java example, whereas my attempt with
Golang seemed to run fine - so not really sure why that was.

Maybe I'll write a similar blog about my findings, at least I learnt a lot
from it!

~~~
ccstevens
Chris from Databricks here.

Glad you enjoyed the write up and glad to hear we aren’t alone.

We also had difficulty creating a repro outside of Spark (JVM). I tried with
Python sockets without any luck. That said, hitting the issue requires the
right mix of dropped packets, socket buffer sizes and MSS. I don’t think there
is anything special about the JVM influencing those variables. Now that I know
more, maybe I can craft a minimal repro in another language.

A datapoint I didn’t mention in the post is that we had a significantly higher
repro rate when talking to S3 through a VPC endpoint. The only difference I
could see was that the VPC endpoint connections had an MSS of 1412, while the
MSS was slightly higher (1436 IIRC) on non-VPC connections. Yet to draw
conclusions from that.

