Hacker News new | past | comments | ask | show | jobs | submit login
Abusing Linux's firewall: the hack that allowed us to build Spectrum (cloudflare.com)
241 points by jgrahamc 11 months ago | hide | past | web | favorite | 33 comments

I built an HTTP service that listens on all 65535 TCP ports and tells you which port you connected to (very useful to diagnose which outbound ports are firewalled by ISPs or by Wifi networks):


The folks at Cloudflare have done it with an iptables TPROXY rule (which requires the socket to have the IP_TRANSPARENT option) which is how I did it too. But there is another way to do this in Linux: you can use an iptables REDIRECT rule, and the userspace program can obtain the original destination port by doing a getsockopt() call to read SO_ORIGINAL_DST.

Edit: oh I see now the blog post does mention the REDIRECT & SO_ORIGINAL_DST option, but criticize its performance... which makes sense given its dependence on conntrack.

There is a typo in Cloudflare's blog post: s/SO_TRANSPARENT/IP_TRANSPARENT/

Indeed! Conceptually REDIRECT is _very_ similar to TPROXY. The subtle difference is that REDIRECT seems to rewrite the destination-host-and-port while TPROXY keeps it the same, only does the routing earlier.

In practical terms to recover original target in REDIRECT you have to use the obscure SO_ORIGINAL_DST, while for TPROXY getpeername() will just work.

By this token TPROXY is a bit easier to use. This is for TCP. UDP is a bit harder.

neat. but why do you need a service at all to detect blocking? you can use timing to also do it easily without the need for any server component at all.

perhaps this works poorly for firewalls near to the service but you declared the problem to be one close to the client. AIUI

When an ISP blocks certain ports by dropping the SYN packet, the client sees a time out. There is nothing to "time" that can prove it's the ISP dropping it.

yes there is. When you don't get a RST back at the expected time (say *2), you know SYN was dropped. Are you arguing that it could be packet loss? You address that by taking multiple samples, and by comparing against loss to ports that you get ACK back from.

«you know SYN was dropped»

But you don't know who dropped it: the ISP or the remote server. In order to show it's the network between the client and server dropping it, you need a server that behaves in a known way, hence open.zorinaq.com I used to work in the InfoSec industry, running port scans from various locations, and open.zorinaq.com was incredibly useful to ensure there was no random firewall preventing us from finding certain open ports. That was the primary motivation why I built the service.

Author here. TPROXY module is pretty special, it really would have been hard to handle any inbound port without it. I guess it shows that there are benefits in keeping firewall and network stack code tied close.

Excellent article all the way around. I may assign this to my networking students further into the quarter. It provides a nice alternative use case that should help them think more critically about the relationship between applications, the socket API, and the rest of the network stack. Thanks!

This is great, thanks for sharing. I'm curious about the downstream proxy process (i.e. ::1234) and how you scale it and balance load across multiple instances of the process. You can't really use iptables to load balance your processes as either the DNAT or REDIRECT mechanism will modify the destination address, right?

Ex. # TPROXY directs all traffic to :1234, and these rules load balance to 4 different processes

iptables -t nat -I OUTPUT -p tcp -o lo --dport 1234 -m state --state NEW -m statistic --mode nth --every 4 --packet 0 -j DNAT --to-destination

iptables -t nat -I OUTPUT -p tcp -o lo --dport 1234 -m state --state NEW -m statistic --mode nth --every 4 --packet 1 -j DNAT --to-destination

iptables -t nat -I OUTPUT -p tcp -o lo --dport 1234 -m state --state NEW -m statistic --mode nth --every 4 --packet 2 -j DNAT --to-destination

iptables -t nat -I OUTPUT -p tcp -o lo --dport 1234 -m state --state NEW -m statistic --mode nth --every 4 --packet 3 -j DNAT --to-destination

We have a single Accept queue for all the ports. For TCP it doesn't create any problems - the new connection rate is rarely significant.

For the accept-queue load balancing see these blog posts:



Wow, these are some great resources. Thanks for sharing! I have a call with one of your colleagues in 5 minutes ;)

I have a vpn for myself and some friends that accepts connections on all ports. I set it up over a decade ago on OpenBSD with a simple pf redirect. I have never had any problems with it, but it obviously doesn't see nearly as much traffic as Cloudflare.

Does OpenBSD handle this differently than Linux, or am I doing this wrong?

The trick is that they want the application to be able to see what the original destination IP and port were. I'm not sure if a pf redirect preserves that information.

It does -- or appears to, at least, for my instances of (OpenBSD's) spamd.

Not sure the headline is accurate: surely these kernel mechanisms were invented specifically _to_ allow this functionality? Therefore there is no abuse. More like “we found a mostly-forgotten netfilter feature designed to do the thing we’re trying to do, so we used it”.

Not exactly. TPROXY is designed for transparent proxying, but by way of the mechanism it works, can also be used to approximate binding to all TCP ports. The latter use case is a bit different.

but they are binding all TCP ports to implement a transparent proxy, no?

Transparent _reverse_ proxy, which very likely was not the originally intended use case.

TPROXY is totally amazing. We used it to modify nginx to create a transparent SMTP proxy that scales. Using TPROXY, we can pretend to be millions of ISP subscriber IPs at once in a single process.

In the near future, I'll need to do something likely very similar to what you did (albeit, probably on a smaller scale). Are there any technical details about this that you can share or perhaps just some pointers to relevant and/or helpful documentation?

(N.B.: I won't even be starting on this for probably a month or two so I haven't even begun to look into it. If there is documentation easily/readibly available via a Google search (i.e., I'll find 'em as soon as I Google for 'em) then just ignore my request. Thanks!)

There is a TPROXY mailing list where you can easily get questions answered by the community if not the original author of the patch.

This Python example sets up a transparent HTTP proxy which will show you the basic socket stuff you need to get going.


Discussion for the Spectrum product: https://news.ycombinator.com/item?id=16820631

What happens to this once NFTables takes over? I'm still using iptables in production, but I'm wary since my understanding is that iptables is sort of deprecated in favor of NFTables

nftables doesn't support TPROXY.

Right, TPROXY is an iptables module (which implies that without someone to port it (assuming porting is even possible due to architectural differences), it isn't going to work on NFTables).

To clarify my original question, what will cloudflare do if/when iptables finally goes away? Has thought been put into it? Will they implement their own type of TPROXY? Will they continue to support iptables themselves? There's quite a few paths, and I'm interested in which one they deem most optimal because I respect their opinions a lot.

actually, TPROXY is very very lightly coupled with iptables. In fact, you can directly use TPROXY without iptables.

here's a 50 line kernel module that uses TPROXY to do the samething without touching iptables.


looking at the nftables code, I think the only reason nftables doesn't support TPROXY is that no one wrote some of the config parsing / seralization stuff.

Sounds like cloudflare might want to start trying to submit some nftables TPROXY support now, so it's there in the vanilla kernel when they end up needing it. :)

It'd expect someone to eventually submit such a patch, though I don't know how urgent this issue is. Iptables isn't going anywhere anytime soon, so Cloudflare can continue to use this method on the edge nodes.

What's the problem with SO_ORIGINAL_DST? Could you please explain a bit why the code is not encouraging? The author of TPROXY also mentioned somewhere else that SO_ORIGINAL_DST is racy, but I'm not a kernel developer and don't understand why. Thanks!

Digging deeper, I found more explanation on StackOverflow https://stackoverflow.com/a/5814636/184061 (seems to be written by tproxy author Balazs Scheidler judging by the username).

I hope this eventually becomes available to everyone (even if in a limited fashion).

Being able to setup a Gitlab/Gitea server behind Cloudflare without having to hack around the SSH port limitation would be fun.

> For completeness, there is also a sysctl net.ipv6.ip_nonlocal_bind, but we don't recommend touching it.

Any any ideas for an explanation of this recommendation?

> Well, we can't ever know what the world looks like through another species' eyes.


Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact