This is the way I block such things on my own VM's (not at work) using iptables:
iptables -t raw -I PREROUTING -i eth0 -p tcp -m tcp --tcp-flags FIN,SYN,RST,ACK SYN -m tcpmss ! --mss 640:65535 -j DROP
Here it is in action:
iptables -L -n -v -t raw | grep mss
84719 3392K DROP tcp -- eth0 * 0.0.0.0/0 0.0.0.0/0 tcp flags:0x17/0x02 tcpmss match !640:65535
My settings may be a little aggressive and may block some old pptp/ppoe users. Perhaps 520 would be a safer low end. As a funny side note, this also blocks hping3's default settings (ping floods) as it doesn't set mss. This also blocks a slew of really poorly coded scanners.
For everything else at work, we are behind a layer 7 load balancer that is not vulnerable.
You may also find it useful to block fragmented packets. I've done this for years and never had an issue:
iptables -t raw -I PREROUTING -i eth0 -f -j DROP
If you have the PoC, then feel free to first verify you can browse to https://tinyvpn.org/ then send the small MSS packets to that domain, then see if you can still browse to it. I don't care if the server reboots or crashes. Just don't send DDoS please, as the provider will complain to me.
To see the counters increase, here is a quick and silly cron job that will show you the MSS DROPs in the last minute, that I will disable after a couple days: [1]
The iptables commands listed in the Mitigation section of the RedHat article lists only SYN in the tcp-flags mask section:
iptables -I INPUT -p tcp --tcp-flags SYN SYN -m tcpmss --mss 1:500 -j DROP
If I interpret the man page correctly, the above is more broad because it does not care about the presence or absence of other flags, whereas your rule explicitly requires the other listed flags to be unset. In fact it seems like the above might be broad enough to include incoming SYNACK response packets that are the result of outgoing connections.
Am I understanding this correctly, and if so, do you have a thought about why they suggest this?
Theirs is just more specific. Both should mitigate the attack, but I would follow theirs instead of my example. They are certainly a better authority on the subject. If something goes wrong, much better to say "Followed vendor suggestions" than random HN poster. :-) That said, I would still use the raw table vs input.
FYI Debian Security Team recommends setting new sysctl value net.ipv4.tcp_min_snd_mss to 536, even though they (Debian) are preserving the default kernel value of 48 for compatibility.
It's worth noting that lwIP, the network stack used by most microcontroller based IoT devices, has a very low MSS. It's configurable, but typically defaults to 512.
How would you do that for the INPUT table (not a VM)?
Just:
> iptables -t raw -I INPUT -i eth0 -p tcp -m tcp --tcp-flags FIN,SYN,RST,ACK SYN -m tcpmss ! --mss 640:65535 -j DROP
??
here [1] gives example of ... is your just inverting/negating the DROP rule ?
>iptables -A INPUT -p tcp -m tcpmss --mss 1:500 -j DROP
The raw tables does not contain INPUT. For the raw table you would have to use PREROUTING. If you are using the default table of filter, then you can use INPUT.
So for the raw table, it would be
iptables -t raw -I PREROUTING -i eth0 -p tcp -m tcp --tcp-flags FIN,SYN,RST,ACK SYN -m tcpmss --mss 1:500 -j DROP
For the default (filter) table
iptables -t filter -I INPUT -i eth0 -p tcp -m tcp --tcp-flags FIN,SYN,RST,ACK SYN -m tcpmss --mss 1:500 -j DROP
Generally speaking, if you know you are going to drop everything that matches a pattern or address, it is useful to put that in the raw table, so that malicious traffic can't spike your CPU load as easily. Every packet to the filter table will incur potentially CPU expensive conntrack table lookups. As your conntrack table gets bigger, this gets more expensive.
The reason I use the opposite method is that we not the normal range we want. Programs can also set super high values or not set mss at all (which is not the same as 0).
I explicitly set the interface, so that we don't match interfaces such as lo, tun, tap, vhost, veth, etc... because you never know what weird behavior some program depends on. In my example, eth0 is directly on the internet. In your systems, that might be bond0.
It's worth noting that use of -A instead of -I in your example from [1] likely makes this rule ineffective, since it will be appended to the end of the INPUT chain. This has already been reported as an issue[2].
This example should work for most use cases, but don't do this unless you for sure know the implications. Dropping bad inbound options is easy, but outbound can get more complicated. I am just showing this in case anyone asks and I am asleep. :-) Talk to your network admins and ask what is the highest mss/mtu your VPN's and 3rd party networks support.
This may not even help, as the packet has already been generated and we are too late in the process. I just figure someone might ask. There are probably use cases where this may help (for proxies, edge firewalls, hypervisors, docker hosts, maybe)
Or just log and drop the connections, or send yourself (your app) a tcp-reset.
I'd have to dig into the source code of that module. It makes sense to not raise it artificially higher than the requesting host. I just didn't expect them to have that logic. IPTables usually blindly does whatever bad idea I give it. :-)
Increasing the value would cause problems. If the remote end says its MSS is 512 octets, either it doesn't have enough memory to receive larger packets, or its link MTU is small enough that larger packets will always get dropped.
Good point. I don't have a place to test ipv6. You would also have to create ip6tables rules to test that. If you have a test server on ipv6, please apply the rule and link it here. Well, test first, then link here. :-)
"To avoid fragmentation in the IP layer, a host must specify the maximum segment size as equal to the largest IP datagram that the host can handle minus the IP and TCP header sizes.[2] Therefore, IPv4 hosts are required to be able to handle an MSS of 536 octets (= 576[3] - 20 - 20) and IPv6 hosts are required to be able to handle an MSS of 1220 octets (= 1280[4] - 40 - 20)."
So it seems like one needs to clamp tcpv4 to at least 536, and tcpv6 to at least 1220. (tcpv6 is common shorthand for TCP over IPv6, similar to udpv6 and icmpv6)
Thats a great idea. That would certainly help if someone unwittingly had some malware that enabled this behavior on their host, but you still wanted them to reach you.
FYI if your instances are behind an Application Load Balancer or Classic Load Balancer then they are protected, but NOT if they are behind a Network Load Balancer.
A patched kernel is available for Amazon Linux 1 and 2, so you won't have to disable SACK. You can run "sudo yum update kernel" to get it, but of course you have to reboot. Updated AMIs are also available.
Thanks for pointing this out. Applying this buys us time before we can properly patch all our systems. In our case this was easy to roll out in a jiffy.
I do wonder though, can anyone guess what kind of impact one might see with TCP SACK disabled? We don't have large amounts of traffic and serve mostly websites. Maybe mobile phone traffic might be a bit worse off if the connection is bad?
Disclaimer: I worked on initial Red Hat article linked above.
In my personal AWS instance from the last few days less than half a percent of the traffic had hit the firewall rule to log the error.
Most of that traffic seemed to come from the China, this was possibly port probing / portscans or really old hardware accessing my the server.
I would say that the iptables rule is a 'better' solution than dropping sack as you may find you use significantly more CPU/bandwidth when dealing with retransmits when not using selective acknowledgements.
I was involved with this one for another cloud provider.
I have a personal Digital Ocean (not my employer) instance that is frequently being probed for stuff (primarily Russian and Chinese IPs). Same old, same old.
I've been running with the rule for around a week just logging & dropping small MSS packets out of curiosity, but hardly seen anything worth writing home about. I was somewhat surprised. I'm curious to see how long it takes for that rule to go nuts (my shellshock rule still triggers from time to time, that had a definite curve of action)
Small MSS is often IoT devices which only have a kilobyte or so of RAM, so often have an MSS of below 256 bytes. They won't be rendering a webpage, but are totally capable of doing REST API requests.
More and more are moving away from $0.25 microcontrollers, and up to $5 SoC's running Linux, so the problem is going away gradually...
You are probably seeing scanners. Most of them probably have the same source port. There are some really poorly coded scanners that set minimal tcp options so they can scan super fast. It seems they don't care about the RFC's when writing those tools. I bet if you set the logging options in iptables to log ip options, you will see very similar options used across most of them. My theory is that they are compensating for the transcontinental latency.
So far on 3 VMs where I've checked (all are public facing, on is fairly high traffic MX, the other is a webhost), netstat -s informed me that SACK is barely used.
I'm guessing an MX sees mostly server to server traffic, so I kind of expect that; however for services used by consumers around the world it might be a very bad idea to disable SACK.
The bigger impact will be for users far away, with increased risk of packet loss and higher latency.
It's too easy to drop packets with very low MSS and, unless you've got specific needs (someone mentioned IOT), there's no reason to not drop packets with MSS < 536 or so. I believe Window's smallest MTU (MSS + IP and TCP headers) size is 576 bytes for example.
The original link includes links to the patches. Fascinating how the SACK MSS problem seems to be a relatively simple situation nobody realized can occur.
You'd have to dig pretty deep to realize that the kernel structure is limited to just 17 entries, and then do the math with minimum packet sizes vs. header sizes.
Multiple vulnerabilities in the SACK functionality in (1) tcp_input.c and (2) tcp_usrreq.c OpenBSD 3.5 and 3.6 allow remote attackers to cause a denial of service (memory exhaustion or system crash).
No, not "any system". Besides needing SACK enabled (which is by default) you also need segment offloading and non-shite networking hardware that will respect and preserve stupid MSS fields in packets.
and/or disable segmentation offloading:
~$ ethtool -K eth? tso off
TCP and Checksum offloading still aren't super standard on customer grade NICs or virtual machines. I'd assume less than half of the internet's linux hosts are actually at risk.
> TCP and Checksum offloading still aren't super standard on customer grade NICs or virtual machines.
I thought VMware shipped that at least decade ago — is there some specific sub-feature you had in mind? Similarly, at least Apple's consumer hardware had checksum offloading back in the early 2000s and segmentation support shipped in 10.6 (2009) so it seems like it should be relatively mainstream since they tended to use commodity NIC hardware.
No doubt check summing support's been around for a while, ASIC Md5 is dirt cheap. Yes VMware shipped with it about a decade ago in ESXi but that was dependent on host NIC support. OSX may ship with the driver support but I'm having trouble finding hardware specs to verify hardware support. I said "not super standard" and "customer grade" I didn't say it wasn't supported at all.
As to the specific subset; TCP Segmentation Offload. As was mentioned in the article.
Yes, I know. I was asking for clarification on the off chance that you were describing something which didn’t ship a decade ago. I first used TSO on servers in the early 2000s and by 2010 even the consumer-grade hardware I was seeing had it.
"When Segmentation offload is on and SACK mechanism is also enabled, due to packet loss and selective retransmission of some packets, SKB could end up holding multiple packets, counted by ‘tcp_gso_segs’."
Segmentation offload in linux is dependent on checksum offloads per here:
Sorry, this is probably a bit simplistic, but I am curious: How likely is this to affect embedded devices? E.g., hardware firewalls, routers, IoT devices that all use a Linux kernel?
If you are not exposing any tcp ports or reaching out directly from those devices to a malicious host, then very unlikely. Either way, it's best to check the vendors site or open a ticket with them, if that is an option.
An alternate option would be to put the device behind another firewall or load balancer or proxy that you know is not vulnerable.
I guess a question I have is then: can I hose you during a TLS handshake? If I can forge DNS, then I can DoS, right? Which makes BGP a prime target right now?
Well, anything that gets a vulnerable device to talk to a malicious device would be an issue. Probably best to check with each vendor and see what their story is if you can't filter the traffic between your devices and potentially malicious devices.
They are equally as affected. Linux is Linux, it makes no difference in what box it runs. What is usually a mitigating factor is that embedded devices usually have a very different configuration compared to non-embedded devices (built with minimal options, not a lot of services running on them etc.).
My favorite of that era was simply the working-as-designed simplicity of sneaking the Hayes modem hangup sequence into various protocols: actual Hayes modems used +++ with a time-delay to send commands such as ATH0 (hangup) but everyone else skipped that time-delay in an attempt to avoid the patent so you could disconnect any modem-connected system if you could figure out how to get it to echo "+++ATH0". Some IP stacks (e.g. Windows 95) would simply send the received ICMP payload as the response so a simple `ping -p …` would do it but people found ways to cause similar problems with sendmail, FTP, etc.
Pop into some random channel, send "/ctcp #channel ping +++ATH0", and wait patiently... a moment or two later you would be rewarded with a flood of "signoff" messages as the users' TCP sessions to the IRC server timed out (by responding to the CTCP, they had, in effect, told their modems to hang up).
The goal, of course, was to get the highest "body count" possible from a single CTCP message.
Smurf attacks, the "ping of death", AOHell, the latest sendmail and wu-ftpd holes of the week, open proxies... the Internet was a very entertaining place for a bored teenager from the midwest back then.
Ah, yeah. Takes me back to my college years. I was a sophomore at the time and was running Win2k server release candidates. Had a new freshman brag about having WinME, which was on the 9x kernel. Went back to my room in the dorms amd alternately sent a ping of death. Ping of death would crash him, but a ping flood was a DoS. His computer would hang trying to handle all of the traffic. Rendered the network unusable on my end while I was doing it, but the PC was otherwise fine (i.e. I could play offline games). Proved my point, he was humbled and stopped bragging and I left him alone after my little demonstration.
("CVE-2019-11479: Excess Resource Consumption Due to Low MSS Values (all Linux versions)", "CVE-2019-11478: SACK Slowness (Linux < 4.15) or Excess Resource Usage (all Linux versions).")
I know you're being glib, but are 2.6.x kernels vulnerable? The big corps tend define all linux as all linux that they support and isn't end of life. As far as I read early 3.x kernels on the Ubuntu side are not effected. Like version 12 and before.
So there's plenty of linux being used out there that's probably not effected.
If you click the Diagnose tab, there's a script that will check your kernel versions and relevant TCP settings. https://access.redhat.com/sites/default/files/cve-2019-11477... If you're not running RedHat, the kernel detection might be too strict, but at least there's some example code for you to check your own settings.
Looks like the issue was fixed upstream a month ago. Might have been nice to know earlier? Is this how long it takes for the distros to lurch into action?
Apparently HN has a time limit on editing, I no longer see an Edit option on my comment. To add to yours (thx!), here are some additional vendors who are now live:
I am finding that HN has limited formatting options - if I don't indent by two spaces (aka code mode) it's a run-on mess, which then requires a large amount of ugly line breaks making the post 5x as large and unreadable. HN formatting instructions are all of 3 sentences long. https://news.ycombinator.com/formatdoc
Normally code mode is entirely unusable for text on mobile phones, as it prevents wrapping beyond the phone truncation point at somewhere around 20-40 characters. Good idea using it for a list of headings and links here.
Nope, it's just the random dupe that ended up getting the upvotes for whatever reasons. This happens constantly and the one that gets the upvotes has more to do with chance than any other factor.
s/chance/timing/. Social media scoring can be fickle indeed, but there are entire industries devoted to optimizing and reverse-engineering the hotness algos of various traffic-drivers. This case is probably coincidental, but it's naive to ascribe high scoring merely to luck. It looks SEJeff posted around 12pm PDT, maybe on his way out to lunch. This post was two hours later, as everyone got back from lunch. :)
For everything else at work, we are behind a layer 7 load balancer that is not vulnerable.
You may also find it useful to block fragmented packets. I've done this for years and never had an issue:
If you have the PoC, then feel free to first verify you can browse to https://tinyvpn.org/ then send the small MSS packets to that domain, then see if you can still browse to it. I don't care if the server reboots or crashes. Just don't send DDoS please, as the provider will complain to me.To see the counters increase, here is a quick and silly cron job that will show you the MSS DROPs in the last minute, that I will disable after a couple days: [1]
[1] - https://tinyvpn.org/up/mss/