Hacker News new | past | comments | ask | show | jobs | submit login
How Netflix uses eBPF flow logs at scale for network insight (netflixtechblog.com)
310 points by el_duderino on June 8, 2021 | hide | past | favorite | 34 comments



Facebook has a similar system, unsurprisingly. An agent runs on every host to sample 1 in N packets, accumulating counts by source/dest host/cluster/service/container/port and so on before sending the aggregate data to Scuba [0] for ad-hoc analysis. This tool was really useful — in a matter of seconds you could see traffic types and volumes broken down by almost any dimension. Did service X see a huge jump in traffic last week? From where? Which container or service? How much bandwidth did your compression changes save? And so on. It also had some really neat stuff to identify whether or not a flow was TLSed or not, which was crucial for working out what still needed to be encrypted in light of the Snowden revelations.

TCP retransmits were also sampled in detail. Being able to see at a glance if e.g. every host in a rack is the source of (or target for) retransmits made troubleshooting much faster.

These systems were really awesome and a good example of what could be built with relatively little effort (the teams involved were small) when you already have great components for reliable log transmission and flexible data analysis.

[0] https://research.fb.com/publications/scuba-diving-into-data-...


> TCP retransmits were also sampled in detail.

Hosts usually collects SNMP metrics, which includes TCP retransmits and more. Do you know what SNMP was lacking compared to eBPF? What I can think of is more dimensions in eBPF's case.


Doing it with eBPF gives you a hook for each retransmit, this makes it possible to know the exact connection, process, and network interface that hit the retransmit and allows you measure things like "100s of hosts are hitting 1000s of retransmits to 10.0.0.56:443", which 10.0.0.56's netstat metrics may not clearly indicate. It gets more interesting if you break things down by VM hosts, racks, rows, data centers, etc.

If you go deeper with the eBPF tracing, you can also determine which code path the retransmit occurred on, which may or many not be interesting.


Correct me if I'm wrong but the SNMP retransmit counters are just that: a count of retransmits the host sent. Raw retransmit counts are often just a vague indication that something's up — the host is retransmiting, either because there's loss on the path or the receiver is overloaded. But given a host can talk over many paths to many other hosts a raw count isn't specific enough to be useful.

The system Facebook (which predated eBPF via a custom ftrace event) produced, effectively, tuples of `(src_ip, src_port, dst_ip, dst_port, src_container, dst_container, ...)` and aggregated them over all hosts. This allowed counting retransmits by, say, receiving host. If there's one host that has a bad cable and is receiving retransmits from 1000 clients we may not see that signal in simple counters on the clients — for them it's just a tiny bump in the overall retransmit rate. But if we aggregate by receiving host the bad guy will stand out like a sore thumb. Same thing for all the hosts in a rack, or all hosts reachable over a given router interface, or whatever else you want. One of my common workflows when facing a bump in general errors (e.g. timeouts to the cache layer) was to quickly try grouping retransmits by a few dimensions to see if one particular combination of hosts stood out as sending or receiving more retransmits.

tl;dr: the SNMP data is one-dimensional. FB's system allowed aggregating and querying by many dimensions. This is really useful when there are thousands of machines talking to each other over many network paths.


you're comparing apples and pumpkins.

SNMP is a query mechanism, eBPF is the sampling mechanism.


Aha! I was too removed from the infrastructure then. All I knew was that SNMP metrics showed up in our telemetry system, and we got a set of standard metrics to look into. Our platform team took care of having sampling agent in place. I didn't know the protocol was about querying instead of sampling.


eBPF can give detailed stats and TCP state information on a per connection basis (flow), much more powerful than TCP stats you can grab with SNMP that are aggregated.


Thanks for all the answers. I learned so much!


eBPF similarly to DTrace allows you to view internals of your OS. With SNMP you have fixed number of metrics that you can view, while with eBPF you can create new ones. You could implement SNMP daemon that uses eBPF to get the data, and perhaps that will happen in the future if it already didn't.


Datadog does something similar with the network performance monitoring option in their agent, it also uses eBPF. As a network engineer I've wanted tools like this for many years (say goodbye to tcpdump, in most cases), the data it produces is incredibly useful in a cloud environment where you typically do not get insights like this from the native toolset.

https://www.datadoghq.com/blog/network-performance-monitorin...


The history is that Datadog hired the Boundary people, who added this feature about 2 years ago.

(Boundary was a small SF startup that created a network flow product. Netflix evaluated it for use on Cassandra clusters. BMC acquired Boundary in 2015.)

https://www.crunchbase.com/organization/boundary


We had a poor man's version of this in 2006 at Backcountry.com. We ran OpenBSD firewalls on the edge and used pfflowd to translate the pfsync messages (the protocol used to synchronize two PF firewalls in a HA configuration) into Netflow datagrams, which we could then monitor in real time with top(1)-like tools. It was awesome. I miss having that kind of visibility.


Netflow! My first "big data" project was getting netflow data from 100s of routers in a internet backbone and start to learn where traffic came and if it was worthy to do a peereng agreement with them.

It stopped working the day Google activated their proxies on mobile networks. I saw Google traffic increase 1-2% week-over-week during several months. I quitted the job when it was >30% of the backbone traffic and don't know what happened at the end...


> It stopped working the day Google activated their proxies on mobile networks.

What is this referring to? You've piqued my interest.


When you opened a web using Chrome on your mobile phone, you could send the request to google directly.

They had a lot of caching proxies and it made browsing faster.

For us the problem was everybody was using that chache instead of going directly to the end host. And it showed like google traffic was growing _a_lot_ week over week and we weren't able to tell where people was going to optimize the network.


Half of this makes no sense to me whatsoever, but it's fascinating nonetheless! Seems like a huge challenge. And if anyone would care to explain to me what "capacity posture" means in the following sentence:

"Without having network visibility, it’s difficult to improve our reliability, security and capacity posture."

I'd be one happy camper :)


Network bandwidth utilization doesn't degrade gracefully. When network saturation is at 80% or more, it can degrade drastically and the usage can fall off the cliff. It's important to monitor the network bandwidth utilization to raise alarms at lower thresholds so that you can add bandwidth capacity at real time, diverge traffic, or rate limit to slow down the download. That relates to the reliability of the service and the real time bandwidth capacity adjustment. For long term planning, it's important to collect statistics on peak/average/median network usage for capital expenditure purpose, to spend money on buying more bandwidth, adding switches, routers, servers, or building more data centers. This deals with the long term capacity planning.


I'm also guessing, but I think it's something like their plan for how much capacity (i.e. network bandwidth) to have available at particular times and places, and their strategy for updating the plan.


I'd agree with your translation...

Estimating bandwidth requirements is hard. Many use educated guess + over provision.


The three words "reliability, security [and] capacity" are adjectives in list form for the successive noun "posture". Posture here simply means "ability, position(ing), readiness, attributes, quality", etc.

From webster:

> Posture (n.) - state or condition at a given time especially with respect to capability in particular circumstances

To be clear, you could re-write it more verbosely:

> Without having network visibility, it’s difficult to improve our reliability posture, security posture, and capacity posture.

While not exactly the same meaning (the original sentence groups them together as a single, nuclear idea), maybe it helps parse the grammar a bit better.


I've heard "security posture" before which relates to how your organisation is currently set up to handle InfoSec events and activities.

https://www.balbix.com/insights/what-is-cyber-security-postu...

Extrapolating that, I would say that capacity posture is planning around expected capacity. As an example, perhaps they say a given data centre has to handle twice the amount of bandwidth of the peak.


The article is a good start but could use a lot of editing

I'd look at the eBPF books and articles by Gregg.


I don't know if they're allowed to post images because of the information that might be contained within, but this post if very hard to follow without some sort of visualization of what they generate with all this.


The canonical answer to this question is ntop.


Is your question along the lines of what anyone would do with netflow data at all? Check out COTS netflow products like Kentik to see what they do.

https://www.kentik.com/product/core/


Netflix traffic patterns are definitely unlike almost every other large network, but I do wonder how much they're missing by sampling TCP only.

TFA mentions transport of (aggregated) flow data back to base over a choice of protocols, including UDP, which makes sense -- you don't want your monitoring data affecting your business data when you get close to redlining. (You'd expect you'd have enough forensic data leading up to that point to make some informed decisions.)

QUIC runs over UDP, and I can imagine that growing rapidly for most corporate & public-facing networks.


Of course Brendan Gregg, the god of eBPF, tracing and all things profiling has his finger in the pie


I've got his Systems Performance book, and it's really fantastic. This guy is incredible. I really need to pick his BPF one.

He's got a good blog too [1] !

[1] http://www.brendangregg.com/blog/


The dude is my tech hero, just awe inspiring.


Wondering if this is on their Linux Server only and or does it work on FreeBSD with their Edge Appliance?


For FreeBSD they probably use DTrace to get that.


It uses ebpf so just the Linux boxes


I got the weird sense of deja vu reading this, searched around and realized the first half is copied and pasted from another year-old blog post. Last year they were "at Hyper Scale" and this year their flow logs are only "at scale" so I guess they're shrinking.

https://netflixtechblog.com/hyper-scale-vpc-flow-logs-enrich...


The older article is about VPC flow logs, in the new one they are using eBPF on instances/containers to gather flow information.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: