
How Fastly coded their own routing layer for scaling CDN - kickdaddy
https://www.fastly.com/blog/building-and-scaling-fastly-network-part-1-fighting-fib
======
amazon_not
TL;DR: Fastly needs full routing tables on all CDN nodes in order to determine
which is the best transit path to push out content through. In order to save
money, they used a programmable Arista switch instead of a traditional router.
Their solution is to reflect BGP routes via the switch to the nodes and and
fake direct connectivity between the nodes and the transit provides, so that
nodes can directly push out content to whichever transit provider they
determine is best on a per packet basis.

Please correct me if I'm wrong.

Maybe I'm just obtuse, but I found the blog post confusing about what it
really is about and long winded, taking a very long time to come to the point.
There is a severe lack of context in the beginning, it would have tremendously
benefited from the what and the why of what they are trying to do.

~~~
scurvy
I'm also _really_ confused as to why they insisted on taking full tables at
their edge if they also built their own route optimization technology. It
seems like it would be much simpler and faster if they just took a default
route from each provider and then did route probing on their top 30% prefixes
in each PoP. You don't need full edge routes for that.

You need a single route reflector (or two for redundancy) to tell you what
routes are in the table. Then probe the busiest prefixes (as determined by
flow data) via each provider and take the best route. Localpref the best route
higher in the route reflector and send it to the Linux hosts. No need for a
huge FIB on the edge. This is basically what Noction IRP/Internap FCP
platforms already do.

This reminds me of the scene in Primer where they ask if they're doing it for
a reason or just showing off.

FWIW, the Netflix networks just take a default route from a pair of transit
networks in each PoP. At their scale, they found it better to negotiate better
worldwide transit deals than optimize networking at the PoP level. Different
game than Fastly, but another data point.

~~~
amazon_not
> It seems like it would be much simpler and faster if they just took a
> default route from each provider and then did route probing on their top 30%
> prefixes in each PoP.

Probing just 30% of the top prefixes is likely not enough for their use case.
Probing is also an after the fact approach. Fastly needs a solution which will
give best possible path from first packet.

> FWIW, the Netflix networks just take a default route from a pair of transit
> networks in each PoP. At their scale, they found it better to negotiate
> better worldwide transit deals than optimize networking at the PoP level.
> Different game than Fastly, but another data point.

Different game indeed. Netflix just cares about throughput, Fastly also needs
low latency, hence the need for more optimizing.

~~~
bogomipz
"Probing just 30% of the top prefixes is likely not enough for their use case.
Probing is also an after the fact approach. Fastly needs a solution which will
give best possible path from first packet."

That's not how BGP works. There's no such thing as getting the best path from
the first packet in terms of latency. There's nothing in BGP that tell you
about the path in terms of latency or performance. The only "best" that BGP
gives you is AS path length between you and the destination. It tells you
nothing about congestion/packet loss upstream. The only way to get that is out
of band probing, like taking latency measurements and pref'ing routes. There
is no approach beyond "after the fact"

There is nothing in this article that talks about optimization of routes.

~~~
amazon_not
> That's not how BGP works.

Of course not. I only stated what Fastly needs, I did not make a statement
about how BGP works.

By having full BGP feeds from all transit providers the nodes can use the AS
paths as part of the heuristics to determine the best possible path (with the
available information) from first packet. Obviously the heuristics will also
use after the fact probe information to tune the model.

The point is that the solution proposed by the grandparent is not better than
the one described in the article.

> There is nothing in this article that talks about optimization of routes.

No, they talk about route selection.

~~~
scurvy
What you're talking about works OK in a single colo or datacenter model, but
it doesn't map to the PoP model. It's the PoP model that's driving them to do
this. In their PoPs they've got local peers and exchanges that will handle the
bulk of their traffic. The leftover stuff is only local and probably minor.
They can definitely find that with flow data and optimize across a pair of
default routes (basically building their own routing table). It's a pretty
common practice when all you deploy is PoPs. Why do you care about a full BGP
feed when you're only handling 4-10% of the Internet in a given PoP?

AS path length doesn't tell you anything about a route's performance. It's a
hint and only a hint.

~~~
amazon_not
> What you're talking about works OK in a single colo or datacenter model, but
> it doesn't map to the PoP model. It's the PoP model that's driving them to
> do this

Could you please explain how the PoP model in this case differs from the
colo/DC model?

Outside the US Fastly's PoP cover multiple countries or whole continents.

------
jjoe
Interesting approach. So Fastly is offloading the full routing table to their
carriers' router(s). That's because routers that can hold full BGP tables are
expensive to purchase and maintain. But to retain some form of control,
they're terminating eBGP at the switch and using iBGP to disseminate (inject)
the providers' route (next hop).

I feel (I don't have direct experience with this setup) like they're just
offloading some compute power (therefore cost) to the hosts. So the cost is
automatically spread out across their relatively massive edge nodes. A line
showing a router plus support costing $100,000 looks bad in expenditures vs
showing a server plus integrated routing showing $2500.

I'm curious about how this impacts Varnish considering how table look ups can
be bus-expensive during odd route changes/flaps (storms). %sys must go through
the roof as a result.

~~~
scurvy
I didn't see him in the room, so that doesn't mean he wasn't there, but it
sounds like Artur was paying attention to Dave Temkin's talk at NANOG (from
Netlix).

Slides:
[https://www.nanog.org/sites/default/files/wednesday.general....](https://www.nanog.org/sites/default/files/wednesday.general.temkin.panel.pdf)

Video:
[https://www.youtube.com/watch?v=-05xWeYGn4A](https://www.youtube.com/watch?v=-05xWeYGn4A)

Titled: "Help! My Big Expensive Router Is Really Expensive!"

Netflix goes even cheaper/simpler and just uses default routes to a pair of
transit providers. It may come as a shock to most of you, but no, Netflix is
not 100% in AWS (compute, yes; network, oh hell no).

~~~
quicksilver03
Considering how high are AWS' network transfer prices, I'm not surprised at
all.

------
francoisLabonte
Also note Spotify published their own stuff also on Arista hardware

[https://labs.spotify.com/2016/01/26/sdn-internet-router-
part...](https://labs.spotify.com/2016/01/26/sdn-internet-router-part-1/)

Podcast where David Barroso talks about it:

[http://blog.ipspace.net/2015/01/sdn-router-spotify-on-
softwa...](http://blog.ipspace.net/2015/01/sdn-router-spotify-on-software-
gone-wild.html)

Arista blog post:

[https://eos.arista.com/spotifys-sdn-internet-
router/](https://eos.arista.com/spotifys-sdn-internet-router/)

Disclaimer I work at Arista

~~~
NetStrikeForce
Didn't they move to Google Cloud?

I was very surprised at the time to see their move shortly after reading
David's articles.

~~~
brazzledazzle
I think they just moved event data processing.

------
ChuckMcM
Great read, love the "hey what does it need?" approach rather than the "how is
this done?" approach. Tut Systems had bought one of the first "hotel internet"
companies back in the 90's which used a similar approach by subverting the ARP
protocol, when you connected any thing you tried to ARP for it would respond
"Yup, that's me! Send me your packets" and you would end up at the "Give us
your credit card" signup.

The nice thing is that at this level networking is really simple. And if you
can get access to the internals of switches to craft behaviors at that level,
it is a pretty good way to go.

~~~
snowy
You mean proxy ARP?

~~~
ChuckMcM
Pretty much, but more like proxy ARP on steroids, sort of proxy DNS, proxy
ARP, proxy everything.

~~~
NetStrikeForce
Proxy ARP it's a thing and it is exactly what you've described :)

And if by proxy DNS you mean you'll subvert ARP to reach a DNS server...
that's proxy ARP. DNS is a few layers above :)

Definition:

Proxy ARP is the technique in which one host, usually a router, answers ARP
requests intended for another machine. By "faking" its identity, the router
accepts responsibility for routing packets to the "real" destination.

~~~
dmourati
And it is a super dirty network hack. Unfortunately can't put that genie back
in the bottle.

------
d33
That's pretty impressive! As a side note, a writeup on BGP security:
[https://security.stackexchange.com/questions/56069/what-
secu...](https://security.stackexchange.com/questions/56069/what-security-
mechanisms-are-used-in-bgp-and-why-do-they-fail)

------
jssjr
This is really great work. Do you have any plans to open source some (or all)
of the code behind Silverton?

------
mikecb
If you like this, you'll like the ONS youtube channel.[1] In particular, the
keynotes by Vahdat are pretty amazing.

[1]
[https://www.youtube.com/channel/UCHo2uqQqpmE_Cg5b4qiUpUg](https://www.youtube.com/channel/UCHo2uqQqpmE_Cg5b4qiUpUg)

------
Twirrim
Amazon, Google, etc. etc. all these companies building their own custom
network devices, and so much of it coming back to both "It does too much, most
of which we don't need" and "We want to do our own thing at that layer".

As James Hamilton noted at the AWS Re:Invent convention, not only is there
they overhead and development expense of these unneeded components, just the
sheer complexity of the application running is inevitably leading to bugs and
unexpected behaviour. By simplifying the device to just do the few things you
actually need it to do, you end up more performant, and more reliable.

I wonder if the entrenched network appliance providers will wake up?

------
erentz
They'll want to be looking at using MPLS and EPE techniques now that there's
support for it on their Arista platforms. This L2 technique is arcane, going
to be painful to scale and reapply generally to other areas.

~~~
bogomipz
Huh? What would MPLS do? They don't operate a backbone. They're operate only
at the edge like all CDNs. There's nothing arcane about this, just basic
networking. Where is the scalability issue?

------
aram26
[https://www.youtube.com/watch?v=TLbzvbfWmfY](https://www.youtube.com/watch?v=TLbzvbfWmfY)

------
pyvpx
I hope Arista pushes other vendors to open up their hardware and provide more
APIs

~~~
pbarry25
That'd be great if they could. We started using Arista 5 or 6 years ago at a
small startup I was working for because Arista was the _only_ vendor we could
find who was comfortable giving us such an amazing amount of access to the
inner workings of their high speed switches. Really enjoyed working with their
gear and, when the occasional question came up, their engineers (top notch
folks).

------
davidu
This is awesomely clever. Not surprised to see from the Fastly team.

------
al_fountain24
extremely cool stuff

