Hacker News new | past | comments | ask | show | jobs | submit login
Upstreaming multipath TCP (lwn.net)
167 points by pabs3 on Sept 27, 2019 | hide | past | favorite | 67 comments

We had that in the early days of TCP. Then Berkeley broke it.

Originally, an IP address identified a destination host, not an interface card. Didn't matter over what device a packet arrived. The BSD people, though, tied IP addresses to the network interface, so it mattered over which device a packet arrived. This simplified outbound routing in BSD, but broke multipath TCP.

There are good reasons for layer 3 addresses to be network interface specific. In fact, that is how layer 3 is supposed to work. After all, each interface is potentially connected to acdifferent network. Hence, routing to those interfaces is different, and layer 3 is about routing.

The mistake instead is that TCP is not fully layer 4. They are entangled with layer 3. Specifically, a TCP socket is defined by an (IP, port) pair. Where the IP is layer 3. As such, it is fundamentally impossible to persist a TCP connection accross multiple IPs.

There is no reason for this. If instead a socket were identified as (uuid, port) then after I change my ip address, I can continue receiving packets sent to (uuid, port). And the other side will still recognize packets from me, because only my IP changed, not the connection uuid.

You'd maybe need to add some spoofing defenses, but we need those in current TCP too.

You might want to be careful lecturing John Nagle on TCP ;)

(intended in the most lighthearted way possible!)

Watching people try to teach John Nagle how to do networking is half the reason why I like reading the HN forums :-).

I'm not one to defer to authority too easily, but it's my experience that, when someone with enough experience says something that sounds out of this world, it's a good idea to think about it for a bit.

There are arguments for unique IP addresses for interfaces. I'm just amused at the addition of yet another layer of address indirection.

Certainly, I'm convinced that neither you, nor the fine folks at BSD were/are clueless about these things. I think it's important to know why these decisions were made, and what trade-offs were involved, so that I don't get dogmatic about them.

I had no idea. This kind of stuff is why I like HN. I have no idea who the real 'celebrities' are. It feels like most of the discussion is on a pretty even ground.

SCTP, another Layer 4 protocol, was designed with multi-homing in mind:

* https://en.wikipedia.org/wiki/Stream_Control_Transmission_Pr...

I think a lot of the effort put into QUIC, HTTP/2,3,4,whatever, etc, could have been avoided if SCTP was more broadly adopted.

It is a shame.

Do note that SCTP is heavily used in telecom. Every network element in that world needs to be redundant, along with the multiple routes to get to each of those redundant elements. SCTP helps hide some of that from the application layers.


Yes, I think that SS7 architecture was used as some design criteria when it came SCTP. Just about everything in the telco world has (at least) an A- and a B-side for HA.

I believe the issue with SCTP is/was broken middle boxes though and not an aversion to SCTP as a protocol no?

P.s. this rant/idea was taken from this excellent blog post: https://apenwarr.ca/log/20170810

> Specifically, a TCP socket is defined by an (IP, port) pair

It's actually the quadruple (local ip, port, rem ip, port) in the steady state.

Does that mean `IP` by itself it a layer 3 term, but (ip, port) is a layer 4 term?

Example: layer 3 knows nothing about ports?

> Example: layer 3 knows nothing about ports?


Totally right, I misremembered the concept. That makes some of the explanations much easier too.

Any architecture that needs two different identifiers for the same thing is very likely in need of some revision.

But a connection (TCP) and a reachable/routable entity are not the same thing.

Just like a phone number isn't the same thing as an SMS thread.

Sorry, my response was rather cryptic. I was more referring to the interface having both an L2 and L3 address. From an academic perspective, I agree with OP that assigning the L3 address to the interface probably causes more problems than it solves. MPTCP provides workaround over L4 to what is essentially an L3 problem. But it would take a huge architectural change to go back to assigning addresses to L3 hosts.

Just to provide my 2 cents on the issue:

There are a lot of things at play... Rekhter's law ("Addressing can follow topology or topology can follow addressing. Choose one.") doesn't combine well with the fact that it's not desirable to have end-hosts participate in routing updates. When these two are taken together, they have implications on scalability of any routing solution, especially in the cases of end-host mobility and multihoming.

When the "topology follows addressing" (like TCP/IP), the constraint of not wanting to advertise the end-host address (as someone mentioned, no ISP will accept your /32 on BGP) assigning the end-host a different address per attached network is the simplest solution. This indeed boils down to the assigning L3 addresses to the interface. The TCP connection problem can be handled by tunnels (as is the case in Mobile IP or LISP) but tunnels are rather expensive since the tunnel endpoint in the network has to maintain al lot of state.

In the case of "addressing follows topology", changes in topology, when taking the constraint of not advertising end-hosts via routing, requires address renumbering of the network elements.

So, when taken as a pure L3 problem, it boils down to choosing which problem to solve: constantly tracking the end-hosts, or constantly renumbering the network. The latter, when used in combination with recursive network layering, shows quite some promise. It requires less addresses, but I know from experience that it's not the easiest idea to sell.

MPTCP provides a pragmatic workaround on top of L4 that doesn't require tunnels. It's a shame that it's not completely transparent to the applications (it requires explicit code changes to enable).I agree with a lot of the concerns of getting this upstream in the kernel. I tried to play around with it a couple of times, but the distribution as a full kernel was a bit of a roadblock. Would be easier if it was distributed as a kernel patch. And it seems perfectly doable to implement it in user space, that might further speed up adoption.

Anyway, take care :)

Ah, that is an interesting point. I think I see what you mean.

In my mind, L2 addresses are not for routing across organizations, and L3 addresses are. So the L2 address identifies an interface, and a L3 address identifies a routable entity. The weird thing is that there is an almost one-to-one mapping between these.

It might make sense for the same host on e.g. a WiFi and Ethernet interface off the same organization to have the same L3 address. After all, the organization responsible for routing to that host can know those interfaces belong to the same host.

However, once you get into multiple interfaces at disparate organizations things change. Take for example, a phone with LTE from some provider and WiFi from some how ISP. There are two separate organizations who are responsible for routing to those interfaces. Hence, the decisions needed to route to those interfaces differ. This makes routing based on the same address a lot harder.

I think my argument boils down to "topology follows addressing" being highly beneficial in our federated world of routing on the internet. It allows every autonomous network to handle internal routing however they want.

You are basically adding an address to TCP (the UUID) that names the host... Moreover, what happens if the connection fails during setup? A lot of edge cases here.

I am also removing the remote ip address from a socket. Hence, if you change your ip, you can just send from your new ip, include the same connection uuid, and I will still receive. You'd also need to know how to send data back. This could be done in many ways.

Biggest thing is to deal with really easy hijacks where you tell a server that your victim 's data should instead be sent to you. This is harder with the current TCP.

You're essentially describing QUIC.

Hmm, I don't quite follow (but I know you were there!). I know there was disagreement about the right way to do multihomed hosts on the Internet; RFC 1122 gets at this with its discussion of the "strong ES model" vs. "weak ES model," and RFC 6418/6419 has a survey of current-ish behavior. By now, most implementations seem to use the "weak ES model" by default or have it configurable. This allows all interfaces to have the same IP address if you want.

But that doesn't solve the problems that MPTCP solves, i.e.: (a) break-before-make failover of a TCP connection across different network paths between two hosts, and (b) combining the resources of multiple end-to-end network paths in one connection.

Because even if a host uses the same IP address for different interfaces (and therefore a TCP connection can survive failure of one interface), it's not like that IP address is going to be individually globally routable. There's no way for some router in the middle of the Internet to know that you just walked out of range of the coffeeshop Wi-Fi and are now only reachable via a commercial LTE ISP, and would like to have incoming datagrams start arriving via the LTE interface (and ISP) instead. They won't let everybody's laptop be its own one-host AS and do a BGP announcement every time it loses a Wi-Fi interface, and even if they did, it would take too long to propagate to be useful. MPTCP (and the mobility in QUIC and Mosh) solves the problem by keeping the network ignorant of the roaming and letting the connection failover to a different network path by having the end hosts address each other at different IP addresses. Similar story, I think, for aggregating network paths between two individually multihomed hosts.

So QUIC provides Mosh-like semi-arbitrary data transfers? This would be great for stuff like an IRC bouncer, because while Mosh exhibits excellent resilience, it prevents efficient rich-client functionality like instant auto-complete (Mosh triggers a prediction delay in that case) and window-switching.

I can add an IP as a loopback just fine - my routers have that all the time.

The problem is then how does the network know where my IP address is. When a packet aims at my loopback, how does it work its way to me?

The answer is the same as interface addresses - we use routing protocols for that, mainly today that's OSPF (in local networks) and BGP (globally).

Convergence takes a relatively long time for BGP, and BGP (at least at a global scale) is limited to /24s. ISPs won't accept your /32 advert.

Even if they did, traffic still only goes in one direction from a given host. On top of that, theoretically if every BGP peer rand BFD you might see failovers in a few seconds, but that's not the real world.

> The problem is then how does the network know where my IP address is. When a packet aims at my loopback, how does it work its way to me?

Mobile IP:

> The basic idea behind the separation is that the Internet architecture combines two functions, routing locators (where a client is attached to the network) and identifiers (who the client is) in one number space: the IP address. LISP supports the separation of the IPv4 and IPv6 address space following a network-based map-and-encapsulate scheme (RFC 1955). In LISP, both identifiers and locators can be IP addresses or arbitrary elements like a set of GPS coordinates or a MAC address.[2]

* https://en.wikipedia.org/wiki/Locator/Identifier_Separation_...

Very complex solutions that don't scale.

And we still need to assign at least one /32 to every mobile device.

IPV6 and give a /64 to each device. That's fine, however you're still with a routing table of upto 2^64 entries. Currently BGP routing table is upto 2^24 entries, but it's really less than 1 million entries.

BGP is not the answer to multi path devices for many many reasons. Tackling it at higher levels (OSI 4-8) is the solution.

No we didn’t in practice - because you also need the return routability.

Modern day example: If you are connected over WiFi and 3G over two different ISPs, the packet for your (probably RFC1918, statefully NATted by your home gateway) address of the WiFi interface has absolutely zero chance of arriving over 3G (which probably has a different RFC1918 address, statefully NATted in the CGN in the mobile infra). And vice versa. So strong host model vs weak host model is irrelevant in this context.

MPTCP works absolutely fine in this scenario.

MIPv6 reportedly works to roam (assuming no NAT66 in the path), but can not use two paths at once.

that's neat, but isn't really the same thing. a mptcp socket has multiple subcarrier sockets, potentially routed over different paths, which each have their own loss statistics, so mptcp will dynamically balance over the links that work the best.

Right. Early thinking was to do that in the routing algorithm, before the Internet became so hub and spoke.

This is how Linux still works by default, right? Behaviour that is quite surprising, at least it was to me when I discovered that it, in conjunction with many tools (e.g., docker) conveniently enabling up forwarding on all interfaces, unwittingly transforms multi-homed hosts into routers...

I don't think this is correct, it should be disabled by default(net.ipv4.ip_forward=0). At least it is in the major distros I'm familiar with. What distro are you using?

Distros disable it by default but it gets silently enabled when you install things like docker (and maybe livbirt) and so on.

At least Docker (as it was pointed out by a sibling reply to my comment) also sets the FORWARD chain policy to DROP.

I believe Docker stopped doing that a couple of years back. Or it sets the default on the forward chain to drop.

You're right, something is setting the FORWARD chain policy to DROP and I guess it's docker.

>"Originally, an IP address identified a destination host, not an interface card. Didn't matter over what device a packet arrived."

How would an upstream L2/L3 device be aware and be able to handle this. We have things like LACP now but that requires more physical ports and chipsets that understand the protocol. In the early days of TCP a router was just another UNIX box. But even something like LACP requires multiple links to the same upstream device how would this have worked if the box was connected to say two different upstream switches or routers? Today we have things like Juniper's MLAG and Cisco vCP but those are proprietary and very expensive solutions.

Fully agree. This is one of the biggest mistakes that was made in the early days of TCP/IP, with far reaching consequences (such as breaking multihoming, but also for instance increasing routing table sizes).

In Ouroboros (our recursive network implementation: https://ouroboros.rocks/), we only identify the interface once (MAC address). Addresses are contained within each recursive network layer, and identify a specific process at each layer (similar to how an IP address should actually identify the host).

Based on what I have read so far Berkeley is responsible for everything that is bad with the world, such as signals for example.

signals might have gotten more complicated in BSD, but the concept certainly existed earlier: https://github.com/dspinellis/unix-history-repo/blob/Researc...

source? or further details?

e.g. either historical discussion or example of other IPv4 system which implemented things as you describe

Pre-existence of IMP's doesn't apply I don't think since we are talking IPv4

Cyclades [0] for instance also named the host, as a historical example.

[0] https://en.wikipedia.org/wiki/CYCLADES

A shame this has to be done within the kernel. There was never much reason to implement TCP handling in ring zero.

I've never seen anyone explain how keep-alive is supposed to work in user-mode, and this worries me in HTTP/3. If your program blocks on something in user-mode, should the connection really die? Doing it in the kernel avoids that. e.g. I know I want to be able to stop my program and continue it in a debugger without breaking network connections and I would be surprised if other people didn't care for this.

If you're sitting in the debugger for N seconds, there's a good chance the other side of the connection is going to give up based on in-band protocol anyway. If you want this to work, you really need a separate process local proxy that satisfies the liveness criteria of the other side, but lets you single step through your side. Of course, debugging the local proxy is an exercise left to the reader; you may need to study at the printf school of debugging.

A separate thread? I assume that's all the kernel would be doing anyway, no ring zero magic involved.

A debugger stops every thread though?

Fair point, I was more addressing the blocking-call concern. There are a number of ways you could get your debugger of choice to not halt a particular thread (this would admittedly require some extra work), but won't a typical browser have an application-level timeout that you'd hit anyway.

Browser? I'm not talking about browsers, I'm talking about every other kind of application that tries to use a userland protocol, because that seems to be the direction network libraries are aiming to go. I have no idea how to tell most (all?) debuggers not to halt particular threads, but it sounds painful. And my point is, there are consequences like this that no one seems to care about at all, but that I would hope would've been addressed before seeing people jump on the ship.

TCP keep alive has a minimum of 2 hours or so. Debugging can live with that.

And debugger ruins everything; If the other side sends something larger than the local receive buffer it will usually disconnect after a while. as it will sense no one on the other end.

All the things that debugger can "ruin" should just be parametric - increase buffer sizes / keep alive times when debugging.

Also, there are more options than "kernel" and "each process for themselves" - you could have a "network/TCP daemon". QNX successfully does that for disk drivers and file system drivers - and likely network too. So does Minix.

It's just that historically, unix/linux/NT don't.

> TCP keep alive has a minimum of 2 hours or so. Debugging can live with that.

If the "minimum" was 2 hours then e.g. MSDN wouldn't be recommending that it be configured to 5 minutes would it? https://docs.microsoft.com/en-us/previous-versions/windows/i...

> And debugger ruins everything; If the other side sends something larger than the local receive buffer it will usually disconnect after a while. as it will sense no one on the other end.

"Everything"? Well when data isn't being sent then the connection could live on as long as it's kept alive by the kernel, right? Whereas with a userland implementation even that possibility becomes difficult.

You seem really intent on making unfounded blanket claims to rebut my point... but I feel like there's some validity to the point I'm making? It's be more helpful to see if you can instead find parts of it that might have some truth to them.

> It's just that historically, unix/linux/NT don't.

Yeah, hence why this approach seems problematic...

Use Mozilla's rr. Record the offender, then debug offline.

Yes, I have not heard any good arguments for why TCP should be in a kernel, other than, it's convenient, and it's always been done that way, and your apps get to share it. Like using the kernel as a shared library...

You could put BitTorrent in the kernel. It makes about as much sense architecturally, but isn't too widely used.

POSIX supports sharing filedescriptors between processes.

For example, you can have a process that reads a few bytes from a TCP socket and then passes the socket to another process.

Unix tries to model all I/O, including networking, as operations on files. Realistically it is only possible to get this right if the kernel is involved.

Of course it is quite possible to come up with different models. But Unix seems to be uniquely powerful in its ability to create complex systems from lots of small processes.

The kernel being involved does not imply that all code has to be in the kernel. The FUSE file system interface is a well known way to run filesystem code as user processes. Likewise, there are ways to run device drivers in user space.

The disavantage is that the extra context switches cause performance loss. So this approach is use for protocols that are rarely used and do not warrent full kernel implementation.

taken to the logical extreme, why do anything in the kernel for that matter?

I think a good rule of thumb should be that a kernel should be responsible making hardware devices safe to use among multiple processes, and little else.

Some amount of TCP handling needs to be in the kernel to handle arbitration of the shared resource (namely, connection tuples). Once you've already handled IP defragmentation [1] and looking at the structures to get port numbers, and associate that with a userspace file descriptor/process, you've already got all your cachelines primed, and you may as well finish processing the packet, before handing it to userspace.

[1] boo hiss; I wish the spec was simply to truncate overlong packets at the MTU, and indicate that with a flag; the peers could then figure out what to do when a packet arrived that was shorter than its original length. Handling it in-band would mean it was more likely to arrive. Instead we can fragmenting it, which is icky, because defragmentation sucks; or we can drop it and sending an out of band message to the sender, but that message may not make it (and often doesn't). TCP could very easily adapt to 'i sent 1480 bytes of payload, but my peer is only acking 1472 each time, maybe I should send 1472 --- much easier and quicker than I keep sending packets and they don't get acked, maybe i should try sending smaller packets 15 seconds later.

Maybe the kernel implementation of TCP could have been more lightweight like UDP.

Handle socket and port allocation, and then forward all packets to the relevant process unchanged.

The application software is then responsible for reconstructing packets into a reliable stream.

No, it should be implemented in hardware instead.

(This is partly a joke and partly a description of what actually happens with TOE "TCP Offload Engine" network cards.)

DJB's CurveCP / MinimalT framework is probably a much better solution to the problems MPTCP is trying to solve - and also adds security, increases address space as well. But .. MPTCP it is; at least something is gaining wide use.

> Baerts said, so there will be no user-space access to subflows for now.

So one has to tcpdump to observe the sub flows?

> But, naturally, there are users who want their unmodified binary programs to start using MPTCP once it's available. There is a working, if inelegant, solution to this problem. A new control-group hook allows the installation of a BPF program that runs when a program calls socket(); it can change the requested protocol to IPPROTO_MPTCP and the calling application will be none the wiser.


Does anyone know the state of MPTCP on iOS? I heard they were testing it long ago on iOS. And then we haven't heard anything about it.

We implemented it for Siri in iOS 7. Exposed APIs for third party developers in iOS 11.

APIs are available for developers, and MPTCP is also now used for Apple Music

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact