
Reimagining the future of routers - erentz
https://medium.com/@HannesGredler/the-end-of-the-router-e4d769aea60f
======
hueving
"What modern IP routers do is exactly this. Every forwarding entry has to be
fast. Analysis of real backbone traffic data, says it does not have to be. For
all practical purposes, today, forwarding tables are oversized by a factor of
10x today."

What he is suggesting will drastically impact the latency of packets as well
as the throughput. Following his analogy of cache hierarchies in a regular
computer, some prefix lookups are going to be 'downgraded' to the main memory
and will take maybe 100x the time.

If a router is forwarding at 10gbps it has ~51 nanoseconds in worst case (64
byte packets). So this is the overhead you can probably expect at each router
as your traffic traverses the net. I'm 17 hops from my old university's
network. That's just under a microsecond of lookup times. Not bad thanks to
TCAM or other specialized lookup chips.

If all of these routers adopted this, my lowly traffic would have to read from
main memory on each route lookup because I would never be in the top tier for
bandwidth usage even if I was connected all day long. A full routing table is
>600k addresses now[1] so each lookup may have to reference main memory
several times as it walks a trie. Let's assume about 10 times to be generous
which (at 100ns a reference) comes to about 1 microsecond. This is just for 1
packet.

As you start to pile on the thousands of other small connections (we're
talking about provider network routers) that would be put into this steerage
class, there is going to be contention and queuing that could easily push it
into 1ms-10ms depending on bursts.

So if my whole path adopted these routers, I could experience jitter of ~170ms
or worse. Gross.

This is essentially turning into a crappy QoS system where the biggest
bandwidth hogs get the good service and everything else gets garbage.

Thumbs down from me.

1\. [http://www.cidr-report.org/as2.0/](http://www.cidr-report.org/as2.0/)

~~~
hannesgredler
your reasoning is correct (the path latency gets higer) its just that you got
the final numbers wrong ;-)

on modern software forwarding core (fd.io/VPP and DPDK) you can forward with a
sub 100us latency. so your total latency ends up being roughly 1.7ms "slower".

have a look at this:
[https://www.youtube.com/watch?v=T66BTHnENY8](https://www.youtube.com/watch?v=T66BTHnENY8)

~~~
Hikikomori
Performance within one numa space seems to be great. What can you get between
numa spaces? That would be a common forwarding path for any router with enough
interfaces.

~~~
feld
If you want good results you don't build a server with more than one CPU
socket. NUMA is too expensive for high performance networking.

~~~
Hikikomori
So you're basically stuck with the number of cores of a single socket per 1U,
seems like a waste of space and possibly power usage when 2 cores are utilized
per 10GE port (in this video). Doesn't seem to scale well now that 100GE is
getting more common, so 2 100GE ports per 1U? Depending on where this device
would be used the recent 32port 100Ge 1U switches seems to be more
interesting, small fib, but with the correct protocol support it could fit a
lot of use cases, especially with something like sir[1].

1\. [https://github.com/dbarrosop/sir](https://github.com/dbarrosop/sir)

------
hueving
"Every added feature will make future feature additions harder. If you just
make the number of supported software features large enough, you can
extrapolate, that at some point, this will become unmaintainable. External
measure of such a condition, is too hard it is to get functionality into a
particular main line release. If you are already using software which can
never get “de-featured”, I have bad news — you are doomed to spend your life
in the “eternal bug hell.” Availability goes down, operational cost goes up,
and your vendor cannot possibly fix it. Time to change vendors is the only way
out."

Only if you assume a terrible code base. It's very easy to make routing
protocols as modules because they just maintain pokey old slow routing
information bases that can live in main memory that don't have to react on the
nanosecond scale.

I've worked on modules for OSPF on a vendor router and if the customer isn't
using OSPF, that daemon and its code are never even executed. No "eternal bug
hell".

This whole blog is basically just pitching major feature-gaps as a feature to
prepare us for some MVP I expect to see from him in the coming months that
only supports BGP and ethernet or something like that.

~~~
hannesgredler
good guess :-) - our MLP (minimum lovable product) is BGP and IS-IS along with
a VPP based software forwarding module.

~~~
sargun
Since when have you guys been leveraging VPP? How do you find it? I tried to
write some code for it, and I found it kinda difficult in comparison to Click
/ Snabb, but I realize that they're two totally different systems.

~~~
hannesgredler
we started integrating it in March. Arguable the compile chain is a bit heavy
weight, but the code is very structured, easy to extend and a clean
architecture. - Dave Barach and the VPP crew were very quick answering
questions we have.

BTW Snabb is cool, but VPP is more feature complete.

------
makomk
The obvious reason not to encode assumptions about the statistical
distribution of internet traffic in your router hardware is that if the
assumptions ever fail (say, because some new P2P service takes off, or video
viewing becomes less centralized) your routers will fall over. He's
essentially proposing to build centralization and the end of the peer-to-peer
internet into routers at the hardware level. Not only that, but anyone who can
generate traffic that breaks that assumption can launch a denial-of-service
attack against your routers.

~~~
hannesgredler
history traffic patterns show clearly a rising in-equality/ power-law
distribution. so unless that multi-year (decade) long trend reverses,
forwarding lookup hierarchies are going to work even better.

~~~
hueving
That's for bandwidth. You're still screwing low volume connections.

Consider 1 million people watching netflix from the perspective of a transit
provider. If you're just looking at bandwidth you can obviously prioritize
lookups to netflix servers. But then you have 1 million streams to different
client IP addresses throughout the Internet. Each on its own will be a small
fraction of the bandwidth, so are you going to punish them all? Not much gain
from lookup hierarchies there.

~~~
hannesgredler
netflix caches are serving millions of subscribers using a thousands to ten-
thousands of prefixes, well below 100K. their caches are highly regionalized
so its no practical problem.

------
ghshephard
The title is clickbait, of course, (L3 routing is obviously not going away)
but the essay is insightful.

Given that he mentioned Amazon, I'm surprised to see that there wasn't more in
this essay regarding Amazon (and Googles) efforts to build their own routers.
Also, a number of networking companies have started off by discarding all the
legacy networking functions, and starting afresh (Juniper). It would be
interesting to review the field and see who else is doing this, particularly
in the last 5 years, and what their success has been.

Also - surprised that SDN only gets a brief mention in the conclusion - I
thought, reading the essay, that's the direction he was going, and then it was
over.

~~~
dang
What would be a better (i.e. accurate and neutral) title?

~~~
ghshephard
From the conclusion:

 _The router, and the dynamic control-plane, as basic forwarding paradigm of
the Internet, remains undisputed. However, it gets challenged using new
concepts like SDN and NFV, which promise much faster network adoption,
automated control, reduced time-to-revenue, which all are good business
solutions. In order for router designs to be competitive to those challenges,
requires to re-imagine how router hardware and software get engineered._

So, "Re-imagining the future of routers"

~~~
dang
Good idea. Thanks!

(All: suggesting a good title is the best way to complain about a bad one.)

------
hueving
"I am proposing to fundamentally rethink the router, adopt modern software
architecture and paradigms, and urge the industry to catch up after 10 years
of stagnation."

I think the author may have been living in a hole. This is not a new idea.
Datacenter routing in many of the big companies is already being done with SDN
ala OpenFlow or some other custom protocol.

Right now you can buy a whitebox 'switch' with 40gbps interfaces and load
various operating systems that enable different management styles (e.g.
OpenFlow control like Google [http://opennetsummit.org/archives/apr12/hoelzle-
tue-openflow...](http://opennetsummit.org/archives/apr12/hoelzle-tue-
openflow.pdf)).

The router has already been 're-thought', it's currently just hiding under the
term 'whitebox datacenter switch'.

~~~
hannesgredler
where theory hits reality is when those whiteboxes and their routing-stacks
encounter the 600K routes coming from 40 different exits. still no viable
solution out there today- quagga, bird, any ?

~~~
bogomipz
The IPv4 table today is about 610K routes. There shouldn't be a problem
fitting the RIB, associated BGP communities and AS Path info in 32 Gigs of DDR
4 RAM.

There are plenty of boards that will hold 750Mbs of RAM from the usual folks -
DELL, HP and Supermicro. So why is there no viable option? Also this is a
hardware concern. How does choice of open source solution - Vyatta, Qugga or
Bird matter?

What is the issue?

~~~
wmf
The issue is that Tomahawk has far less than 610K entries of TCAM and a dozen
different teams are exploring various types of RIB caching to accommodate
that.

~~~
scurvy
Tomahawk is old and busted. Jericho is the new hotness.

Full tables in FIB.

[https://www.broadcom.com/press/release.php?id=s902223](https://www.broadcom.com/press/release.php?id=s902223)

~~~
wmf
Shh! You're ruining it.

------
moonshinefe
I can appreciate there are better systems out there now, but yeah, it won't be
the end of the router any time soon in my opinion. If the transition from IPv4
-> IPv6 is any indication it's going to be a very, very long time before any
of these new technologies gain traction, if at all.

Remember, IPv6 is a necessity in the future if we want to continue to allocate
addresses without running out, while better routing methods are simply an
upgrade. So I'm not sure there will be as strong as a push either.

~~~
Sarki
Also think about sub networks: your entry point being in IPv6 while the rest
remains in IPv4.

Even though the use of hostnames is convenient (and encouraged in IPv6) it's
already a headache for knowledgeable people to set up, so imagine non tech
guys (you're father or your grandma).

Unless we come up with a simple and easy process for this we're still miles
away from a full IPv6 world...

~~~
soneil
Honestly, it's probably easier for your grandparents. It'll just show up one
day. it'll just work, and they'll never know it exists. (this isn't just
hypothetical - it's already happening)

------
al2o3cr
"Lack of micro-services architecture renders technical debt possible."

[citation needed]

Also, the statement seems designed to encourage the reader to accept the
converse: that somehow a microservice architecture will render technical debt
_im_ possible. High-grade bovine excrement, that...

~~~
pconner
Yup. I like microservices as an architecture paradigm, but they are a tool,
and can be used well or misused just like any other tool. And sometimes they
are not the right tool for a job.

------
NKCSS
This is so true for many software projects.

The premise is asking to remove a piece of functionality from software:

"Not possible? — Reason it is not possible is because things have been
constructed as a monolithic system, mostly by just compiling a new feature.
Most often, a given feature is intimately linked to the underlying
infrastructure (like an in-memory database, or some event queue processor),
and, removing it out of the code base, may get to an effort as large as
originally developing the feature. In most cases there is no dedication on how
to clean things up later. Every added feature will make future feature
additions harder. If you just make the number of supported software features
large enough, you can extrapolate, that at some point, this will become
unmaintainable. External measure of such a condition, is too hard it is to get
functionality into a particular main line release. If you are already using
software which can never get “de-featured”, I have bad news — you are doomed
to spend your life in the “eternal bug hell.” Availability goes down,
operational cost goes up, and your vendor cannot possibly fix it. Time to
change vendors is the only way out."

------
walrus01
I was not expecting to see a photo of a 12 year old kid holding a Juniper T640
FPC with PICs in it. How often do you let a 12 year old hold something worth
possibly $50,000?

------
mcguire
From About the author:

" _Therefore i co-founded rtbrick.com where those Hyper-scale design
principles are followed, to build the next generation distributed routing and
forwarding platform with unbounded scale on your choice of open hardware._ "

And the comments:

" _We are building a routing /system stack which both runs on vanilla ubuntu
14.04 as well as open network linux. The nice thing about our system is that
it does not make any locality assumptions. — You can run the BGP control-plane
distributed over several compute nodes and the IS-IS control running on
different nodes. Yet the whole thing acts as a coherent system and can drive a
set of bare-metal switches (e.g. A Dell Z9100)._"

------
chaz6
It seems every year there is a new layer of complexity added to networks. If
we kept it simple and put the intelligence in the application layer where it
belongs, routers would be more efficient than they are.

------
jewel
Is it possible to simplify the routing table based on the local topology? In
other words, if I have a core router that has 100 local peers, can I take the
full routing table and find multiple entries that have the same common prefix
and the same next hop and combine them, reducing the number of entries that
need to be kept in memory?

I imagine if this provided potential gains that it'd already be a known
technique, but I can't seem to find any information about it one way or the
other.

~~~
amazon_not
It's called route aggregation or summarization. The problem is that the
routing table is not static, but changes constantly. You can't just aggregate
once, you'll have to deaggregate and update the routing table with each route
update. You also have to be careful that you don't lose information when you
aggregate routes, otherwise you'll end up with routing problems.

------
bogomipz
The author states:

"Yet, most routers still support a 100ms+ buffer depth for 100GB/s circuits.
Just do the math. You need 1.25 GB DDR4 RAM for each 100GB/s port in a given
router."

What is the math? It's not clear at all how he arrived at that calculation.
That seems like quite an important detail to omit in your first supporting
paragraph. Just saying "just do the math" when its not clear what that is is a
bit ridiculous.

~~~
cataflam
100 Gigabit/s * 100 ms = 10 Gigabit = 1.25 Gigabyte ?

~~~
hannesgredler
thanks, i have taken the liberty to C&P your explanation into the blog.

------
Animats
It's an ad for the guy's startup.

And why is he capitalizing like it's 1390 AD?

~~~
oofabz
Based on his name I'm guessing his first language is German. He's accustomed
to capitalizing all nouns instead of only proper nouns as we do in English.

------
scurvy
While the author has an pointed view of the world and how to solve all of
these "problems", his answer is basically an Arista switch/router based on the
Jericho chipset. Full routes, wirespeed, big FIB, basic software.

But yeah, it's still a router. It's not "carrier grade" big expensive router
in the words of Dave Temkin, but it's still a router and will probably smoke
the market for lower-end Juniper MX's and Cisco ASR's.

------
caseymarquis
Genuinely curious, is there an alternative to spanning tree? He'd mentioned
not implementing it, but that feature is a life saver the 1% of the time that
you need it.

~~~
pas
Manual configuration, or using iBGP (so you do L3 switching basically), see
Project Calico, see what Facebook does (IP-IP encapsulation where the outer
header is basically their L2 the switched fabric).

~~~
hueving
>so you do L3 switching basically

Ugh this term bothers me. Just call it routing! That hails from a day when
routers sucked so much wind the marketing team at Cisco had to invent a new
term for the stuff that did it fast.

~~~
pas
I usually call it routing, but I hoped OP would recognize the term.

------
the8472
How much more hardware would be needed if we turned on IP-multicast (at least
the source-scoped kind) for everyone?

------
bgilroy26
Related:

It's time to build your own router

[https://news.ycombinator.com/item?id=10936132](https://news.ycombinator.com/item?id=10936132)

------
digi_owl
I must admit this is one dense article.

