Hacker News new | past | comments | ask | show | jobs | submit login
The lost Van Jacobson paper that could save the Internet (root.org)
280 points by wglb on Dec 30, 2011 | hide | past | web | favorite | 48 comments

Interestingly enough Van Jacobson has continued in this line of research, but the future of Internet networking looks radically different than what was proposed in these early days.

A pivotal recognition is that the Internet nowadays is dominated by content, not by endpoints. The modern CDN is really a very complex kludge in order to deal with what we now believe the Internet's primary purpose to be: content delivery. The Internet was originally designed in a TCP/IP point-to-point structure, with the most important aspects being the communication of two parties. Now the most important part is that a given party (or really many parties) receive a specific piece of content, regardless of where that content comes from. In this light, it is wise to redesign the Internet's infrastructure against standard TCP/IP and instead build knowledge of content and caching layers directly into the transport protocols.

The preliminary paper which deals with many of these issues is Jacobson's Networking Named Content[1]. The general idea is that one requests content instead of communication with a specific party. ISPs and other intermediate layers cache content blocks (because its in their own financial best interests by saving money in peering costs and pure networking hardware costs) and your request for content is delivered via the nearest available hot cache. Many other papers have been written with their own tweaks, but this is the general idea for what the next generation of the Internet should look like (from the distributed systems researcher perspective). In my Distributed Systems in Challenging Environments class one of the things my partners and I were working on was building a prototype of CCN (the protocol described in Jacobson's paper) on top of the current TCP/IP layer to examine its real-world characteristics and swarm behavior. We never completely finished the prototype for Contemplating Content Centric Protocols, but it was an interesting project nonetheless.

[1] http://conferences.sigcomm.org/co-next/2009/papers/Jacobson....

This is such a great thread.

I've been hoping to see something tangible from VJ on this for awhile now. He was doing content-based networking even back when he was at Cisco.

To an extent I think recognized by a lot of networking researchers, including IIRC YC's Robert Morris at MIT, the future of the Internet is probably not IPv6 and a network consisting of operating systems addressed by scalar integers.

Instead, IP-type protocols are going to assume the same role that Ethernet and 802.11 (and ATM) play today, as a substrate for connectivity on top of which real applications will be built. The wide area discovery and "session" (OSI style) management roles played by IP and TCP will be played by overlay networks. We have primitive overlays now (DHTs, BitTorrent, Skype) and proprietary ones (CDNs like Akamai, whatever the cable content providers are using to push VOD to the edge), but eventually we'll get a general-purpose open one and the game will be on.

I think this follows straightforwardly from Reed's end-to-end argument. Intelligence belongs to the edges, not the core of the network; intelligence at the core is necessarily a lowest-common-denominator affair, and is hard to scale. That's a key reason why we don't have multicast today. Routers are so bad at scale that tier 1 NSPs filtered BGP announcements by prefix length, because they couldn't handle /32-granular routing for even the tiny subset of machines that actually wanted it (begging the question of how we ever expect IPv6 to do the things the typical HN reader hopes they'll do, like provide them with their own portable /16-equivalents). Multicast effectively asked those same overtaxed routers to address web pages, individually.

That's obviously not going to work when the service model involves random people demanding that AT&T and Level3 add routing table entries for their podcasts. But it scales just fine when the thousands of machines interested in those podcasts share a protocol and an infrastructure for arranging a fan-out overlay.

So many unsolved problems here, from "what is the most reasonable kernel of overlay assembly, group management, and routing to deploy to end-systems" to "how do we avoid congestion collapse in arbitrary group messaging protocols" (a study of multicast reliability algorithms --- at least, from the late 90s when I last did this stuff --- will probably horrify you). But the great thing about it is, nobody has to ask permission to make this stuff work; we can evolve to the real next Internet without getting Comcast's permission, or for that matter (more perniciously) the IETF's.

Yes, there are a huge number of advances to be made in this area. IP multicast was certainly the first draft in making this kind of networking possible but the fundamental problem is the incompatibility with the rest of the way the Internet is run.

The problem we were trying to address in the graduate class is that Jacobson's paper simply assumes that the Internet suddenly "switches," like a lightbulb, to this new method of routing. I think everyone agrees that something that looks like CCN will become the future, but we were interested in the road to get there. We were trying to examine the problems associated with porting to this kind of network without preexisting infrastructure in place.

One thing that you may find interesting is that the network tends to look more like a very layered Bittorrent network in current IP infrastructure, so it may end up that the most effective research ground for an intermediate protocol would be in adapting Bittorrent. The second area that I think is really interesting is secure and (perhaps more importantly) authenticated communication in this protocol. There's an entire other paper cited at the bottom of the CCN paper that I linked to that details the cryptographic authentication used in CCN but there are a number of things that I think either need more detail to be worked through or put me on the edge a little.

We got funded for this idea, by Sony, during the first bubble. Before BitTorrent, when FEC-based multicast file transfer was either FLID/DL (an IETF standard that never went anywhere) or the startup by the guy who invented Tornado codes.

We had a centralized tree-structured directory for discovery, and then (at first) deployable nodes running group messaging and running a weighted link-state routing protocol (more or less cribbed from OSPF), then later a small kernel message forwarding scheme with a programmable control plane so we could build arbitrary routing protocols in Tcl.

Our initial application was chat (we overengineered, like, a little) and we pivoted to streaming video.

We died in part because we hadn't the slightest clue what we were doing, and in part because the VCs replaced our management with a team that decided our best bet was to take our superior technology and go head-to-head with Akamai with our own streaming video CDN.

We'd have been better off just open-sourcing.

Anyhow: I'm obviously biased, but the way I think this is going to happen is, some open source project to build arbitrary overlays is going to catch on (the overlay will be a means-to-an-end for some killer app people actually care about; my bet is, telepresence).

Very interesting. The main difference in our work is that we really latched on to VJ's idea of breadcrumbs and caching blocks. When I alluded to Bittorrent I wasn't really referring to FEC-based distribution. The problem with FEC is as you said somewhere else -- it's not very efficient for small files and it currently isn't very location aware (aside from Ono, my professor's research lab's Bittorrent plugin) so you can end up going halfway around the world for relevant pieces of the file.

Instead, we were more interested with the robust swarm implementations that were already present in some Bittorrent clients, like fairly efficient DHTs. It is also an already existing protocol with millions of users so we could piggyback on Bittorrent clients with plugins to measure the real performance all over the world.

The key difference is the caching. Jacobson proposes that the caching would happen in the ISPs, but there's no reason peers couldn't take up the slack, only different monetary incentives. The protocol here becomes less like OSPF and more like distributed k-minimum spanning tree. If you can find the "supernodes" then you can implement caching on their front and prevent unnecessary communication.

Of course, if you think anything like me your first thought is, "do we actually know if data requests are reliably concentrated in a specific area such that local caching will actually provide much of a benefit?" My partners thought the same thing, which is why they ended up doing a Bittorrent traffic analysis to attempt to plot an AS distribution curve for data requests[1]. They found that there was huge redundancy in intra-AS requests, so there's at least potential here.

Of course there are still a number of problems to be worked out -- if the supernodes don't have an easy incentive to provide caching to other people in their AS, how do we develop such an incentive? But the overall protocol seems very interesting. Most notably it would not work like FEC and small files would probably benefit even more from the caching (which is why you'll also note that files are split into blocks or datagrams or whatever you wish to call them in the VJ protocol).

[1] http://acm.eecs.northwestern.edu/site_media/cccp.pdf

I really don't deserve to be on this paper -- Zach and John really ran this show.

libswift.org is a nice place to start. It compacts and simplifies the tcp/bittorrent stack into 4k lines of cpp. I'm working through an educational port at the moment and will blog about it as I go.

I'm not as optimistic as 'Locke1689 is that BT is the best primitive for a general-purpose overlay, as FEC is less useful for small filesizes and discrete messages.

On the other hand, congestion control is easier to do for FEC-based protocols than it is for generic group messaging.

FEC = Forward Error Correction? Could you expand on that? I'm not familiar with this whole area at all.

Yes. Simplified: you take a message and break it into n blocks such that any unique subset of k many of those blocks reconstitutes the message, even those k is less than n. Think RAID parity (though it's more complicated than that).

One thing that makes FEC attractive in wide-scale group messaging is that you have knobs to turn, in terms of how many simultaneous blocks you attempt to download to reconstitute the message.

As you can imagine, this is an idea that works really well for ISOs and DVDRips, but is clunky for emails.

Ah, ok. Swift uses merkle hash trees for error detection which scales down quite nicely. Packets which don't match the hash are just dropped. They seem to be trying to avoid long request-response chains - most messages just cause a state transition and don't require a direct response.

I've only just started working on understanding swift so I can't speak for its suitability for small content. It is much simpler than bittorrent at least.

The linked Gettys article is fantastic too:


VJ is at PARC now.

The old VJ LBL papers are still collected at ICSI/ICIR (Vern Paxson is still there!):


There used to be some attention paid to deliberate evasion of congestion avoidance (Stefan Savage did a paper on it), the idea being that a malicious user could monopolize a link by ignoring congestion signals (interview question: when you upload a file via Wi-Fi over your DSL connection, how does your computer know how fast to send packets?). Maybe some of the reason that turned out not to matter was, we were in a temporary period of not needing congestion avoidance.

Finally, as I understand it, at least some of the fundamental ideas behind VJ's "hydrodynamic" congestion avoidance scheme are derived from Raj Jain's CUTE, which was done for DEC in '85.

Yes, that blog post by Jim is what led me to investigate and write about this.

The Stefan Savage paper you refer to wasn't so much about ignoring congestion signals as deliberately faking reception of data that hadn't actually been received. With things like duplicate acks or optimistic acking, you could cause the sender to open their window size to whatever you want.


Back in 1999, I was working on integrating these ideas into a router product. It would be software in your home gateway that would keep a cache of the last observed window size for each pair of hosts for a few seconds.

If you started another TCP connection to the same host within that short interval, the router would spoof both sides to increase the window size to the previous value / 2 and then let additive increase take over from there. This bypasses slow start but the current available bandwidth is likely to be very close to the estimate within that short range and hosts didn't cache window size, even when making multiple connections in a short period (e.g., HTTP 1.0)

I archived this paper and a bunch more here. Prepare to lose an afternoon if the history of TCP/IP interests you.


Interesting. I've been looking into the Stephen Savage paper faking-acks part as well for a while now, especially in the context of hosted services in a shared infrastructure (cloud!). My hunch is that it can cause a lot of problems, but I am yet to verify. :)

Am interested in your opinion on using Bayesian or Fuzzy approach to this system control problem.

It's all really nice but because of the patent minefield adoption of newer algorithms will be very slow. IMHO.

For internet routing, theory and reality might be very distant. RED in particular has a terrible record of making things worse on live systems. But that was a few years ago and perhaps newer variants are better.

Also there's a huge problem on routing algorithms due to hundreds of patents covering very trivial uses. For example, there is an awful patent covering the very important case of not dropping tiny ACK packets. For congestion control, it's more useful to drop big TCP data packets to make the endpoints adjust. And the key networking gear players are full on supporting those crazy software patents.

So my prediction is things will stay the same and nothing meaningful will happen for 10 years until a few key patents expire. Nobody will risk exposure to a multimillion IP lawsuit. And anyway, we the users will just pay the bills either way so telcos don't care.

You can thank Big Telcos, Cisco and many others.

Of course big players with a huge patent war-chest (i.e. Google/Amazon/Microsoft) will advance their WANs.

It's worth remembering that there are cheaters on the internet. I have been one.

I built and sold a product that cheated on TCP in order to minimize negative customer experiences during periods of congestion.

It was a one line change in the linux networking code, and suddenly their crappy networks were no longer a problem!

(Just so you don't think me completely evil…

All traffic was across their own networks.

Their networking folks considered 5% packet loss to be acceptable.

The product's purpose was to aggregate and tunnel connections. Multiple TCP inside TCP gets unfairly punished by the congestion algorithms, which is why most VPNs don't use TCP, but sometimes you find yourself under the thumb of administrative rules and you have to do it.

I simply capped our back off time to something like 10 seconds. Short enough that no one thinks the device broke because "the network was stuck, but now I can ping but your device is hung so I'll power cycle it". If your network can't take 1 packet per 10 seconds on behalf of a few dozen aggregated connections, it isn't me that broke your network.)

I had been thinking of:


(And Nate's right, I drastically oversimplified my summary of it). This is "cheating TCP" by Stefan Savage from '99. A great, great paper.

VPN's don't use TCP because having a reliable protocol with retransmission inside another reliable protocol with retransmission can cause lots of extra traffic when a packet is dropped and the data gets retransmitted multiple times, once from each of the protocols.

Ah, the joys of SSL VPNs it seems.

These days, doing proper packet scheduling nowadays is probably a better alternative to AQM than approximated solutions such as RED and many many variants that appeared in academic papers.

RED was designed at a time when implementing packet schedulers was expensive (theory was not completely there yet, firewalls were not widespread, etc.). Now there are fast packet scheduler implementations available. Our own QFQ [1] is distributed with the official FreeBSD and Linux sources, and runs reasonably fast even on an OpenWRT device (remember that AQM needs to act at the bottleneck [2], which is mostly your DSL router for upstream traffic).

You could really try it out easily, and if you are willing to set up a few dummynet pipes [3] (FreeBSD/Linux/Windows) you don't even need to act on the DSL router.


[1] http://info.iet.unipi.it/~luigi/qfq/ but the code is already part of recent FreeBSD and Linux distributions (edited: and even on OSX you'll find an older but similar scheduler as part of ipfw+dummynet), you just need to load the relevant modules.

[2] there were, long ago, some proposals/products which did ack pacing on the reverse path to do cong.control on the forward path. I think it is tricky to make them work.

[3] http://info.iet.unipi.it/~luigi/dummynet/ . You can set two pipes with a bandwdith slightly below that of the DSL link, so you can control the bottlenecks. How can you know what is the bottleneck bandwidth is a different story. You could look at the negotiated speed on the modem, often available through the web pages, but probably you'll have other bandwidth caps elsewhere. You could try to run some of the BW probing apps to some nearby host, hoping that the bottleneck is your local DSL, and dynamically reconfigure the pipes accordingly. I have not tried this yet.

Isn't packet scheduling experimentation also one of the reasons for the Click Modular Router project at MIT PDOS?

The Click codebase is one of my favorite pieces of C++ code ever.

Shaping the link at the source works as long as you're the only computer along the bottleneck link. Once there are varying number of other sources competing for the bottleneck, determining the "rate" is hard (which is what congestion control algorithms do!)

Incidentally, regarding the bottom of Nate's post: if you're at all interested in code analysis, you really should reach out to him about a role on his team. I know a little about what they're building, and it is possibly the coolest business idea any HN participant is working on.

VJ talks about something called "Evolutionary Pressures" at the end of his Google Talk a few years ago. Watch the video if you haven't already.

SOPA if it passes may create such pressure. It's main target is DNS.

CDN's like Akamai rely on universal use of one DNS, the one that SOPA aims to regulate, to accomplish their kludge.

Food for thought.

I prefer "Chicago" to any of the other alternatives.

End-to-end can be realised. Overlays work for small groups. Small groups can share with each other. Networks connecting to other networks. It's been done. ;)

Evolutionary pressures may help.

I stupidly downvoted you (SOPA! GAH!) but you made a good point, so, sorry.

Generally one of the reasons I don't freak out about laws regulating the Internet is that the Internet as we know it today is rickety anyways.

In the last 10 years we've seen the monumental shift (predicted in the late '90s but widely scoffed at) from ad hoc network protocols and native client/server implementations to a common web platform. Nerds still recoil a little at this, thinking about the native/ad-hoc stuff they still use (like Skype), but if you look at the industry as a whole, particularly in the line-of-business software development that actually dominates the field, it's all web now. TCP ports are mostly irrelevant. If you were going to start a new app today, would it be a native implementation of a custom protocol? Probably not!

One of the things that got us to this state was in fact regulatory: default firewall configurations that don't allow anything but web out, and disallow arbitrary inbound connections.

Over the next 10-15 years, I'm hoping we'll get similar nudges away from first-class addressing only for machines (which are less and less useful as an abstraction for completing tasks using technology) and towards first-class addressing for subjects, interests, content, &c. This is an insight lots of people have had, from David Cheriton & TIBCO in the '90s through the RON work at MIT through VJ's work at PARC & so on.

I wrote off Lessig for a bunch of years after reading _Code_, but I think he fundamentally has it right in the sense that implementors have as powerful a say in how things are going to be regulated as legislators do. America had the Constitutional Convention after the Articles stopped scaling; the Internet will have little blips of architectural reconsideration when the impedance between the technology and what people want to legitimately do with the technology gets too high.

(I'm trying to make a technical point here, not a political one; I support copyright, and am ambivalent about SOPA.)

With the widespread use of anycast, does "first class addressing for machines" even matter any more?

In situations where anycast is used, how do you even know what machine a given address to connecting to?

RON was a step in the right direction, imo. With small overlays, MAC addressing comes into play and it becomes a little easier to know what machines (not simply what "addresses") you are really connected to.

Yes, because conceptually you're still talking to a computer (it just happens to be a globally load balanced computer). It's still the unicast service model, and it's still fundamentally about building a channel between two operating systems.

Imagine if you dialled a full telephone number, including applicable country code and local regional code, and depending on where you were dialling from, you reached a different telephone in a different geographical location.

As long as it's an answering machine and the message is the same at every telephone reached by this number, it does not matter.

But as soon as you want to reach a live person, and not simply "content", then what?

Is end-to-end about accessing "content" or is it about communicating with someone operating another computer?

I wouldn't want to noodle on this too much. I take your point. Anycast abstracts away some of the notion that you're talking to a specific computer. But the unicast service model is inherently about computers talking to each other. Many, maybe most, of the most important Internet applications aren't about 1-1 conversations, or if they are, they're 1-1 conversations in special cases of systems that also work 1-many.

Importance is a subjective concept.

One opinion is that a very important Internet application will inevitably be 1-1.

Who did the FCC just hire as their new CTO? What is happening to POTS?

1-to-many systems, hacked to give an illusion of 1-1 conversations, e.g. smtp middlemen, social networking's http servers or twitter-like broadcast sms, are what we settle for today, but, imo, this is a limitation not a desired goal.

The video in question, great stuff: http://www.youtube.com/watch?v=gqGEMQveoqg

> packets are queued for a long time during congestion instead of being dropped quickly. This misleads TCP congestion control and leads to even more congestion.

The last statement is true only for TCP/Reno. Reno here refers to a congestion detection algorithm, and it is that of looking for packet loss to recognize the congestion. This is the simplest and oldest approach, and it has been suceeded by TCP/Vegas, which looks at delivery delays instead, or how it is called a congestion avoidance algorithm. Basically the idea is that if the route gets congested, there will be buffeting happening at some hop and that would cause the RTT along the route to climb up.

This being said, I don't know the split between Reno and Vegas adoption in the real world. However I think newer Windows versions are on Vegas.

This critique of Vegas by VJ is pretty famous (I love that you still have to FTP to ftp.ee.lbl.gov to get it):


I've got a copy of that on my mirror, including the postcript diagram:


In summary, RTT variance as a congestion signal is actually quite bad. It's normal to have variation even without congestion, so utilization would be worse if you backed off every time RTT increased.

From my completely unscientific tinkering with Vegas idea few years ago I remember that there was an easily detectable pattern in RTT change during the congestion. I was testing using UDP between my home and a colo server and I saw RTT climb up to several times its original level in a matter of two round trips.

> It's normal to have variation even without congestion

Certainly, but the question is if congestion-based variation is substantially different from an ambient variation. I am guessing that it is. Reacting to just a RTT increase is pretty dumb. The congestion marker clearly needs to be a bit more advanced than that.

Vegas (and uTP and LEDBAT) tends to get outcompeted by Reno variants, so people generally only use it when they're trying to be "nice".

Linux these days defaults to CUBIC anyway.

I've experienced congestion at my parents house. They live in a rural area and have a connection over a WiMax wireless link. I did a little poking around and the bottleneck is not the wireless link but some link further upstream (e.g. maybe between the ISP and the wireless tower).

I suspect what is happening is that the uptake of video services like YouTube and Netflix has resulted in link capacity being exceeded. The result is not pretty. The link acts like it is dead for about a minute at a time. It periodically starts working for a few seconds, maybe up to a minute, but then dies again. I suspect this is the oscillation effect at work.

Buried at the bottom of the writeup is a chilling tale of the effects of proprietary programs and formats: "The lost paper was never published. [...] [Van] lost the draft and the FrameMaker software for editing it."

If it were not for a draft that was discovered, this paper would have been lost forever.

There are plenty of MIF-file (Frame's XML vocabulary) translators out there, so it needn't have been lost as long as the source file was around.

OpenBSD has supported RED for a long time as an option in the packet filter configuration (http://www.openbsd.org/faq/pf/queueing.html#red). I don't know how closely it follows the original paper, although they do cite it as a reference.

I used RED on an OpenBSD firewall years ago at a small ISP that had too small of a pipe for the number of sites they were hosting out of their closet. It didn't seem to make a remarkable difference, but I wasn't responsible enough at the time to do careful metrics before-and-after.

Instead of the first excess packet being dropped, it gets queued somewhere.

So to update our detection scheme, we need routers to get feedback about the >queueing<.

This is what AQM schemes like RED/ECN do!

I can't actually find a link to the new paper, the "Red in a new light" paper. Just the citeseer link. Is it online?

Right, the citeseer link is to the old copy of the paper. The new version has not been finished yet so I doubt you'll find it anywhere. It's actually likely that it will never be finished and instead will be a jumping off point for new research.

"The lost paper was never published. One roadblock was that the diagram of a toilet offended a reviewer."

Google+ existed as early as 1988?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact