TCP retransmits were also sampled in detail. Being able to see at a glance if e.g. every host in a rack is the source of (or target for) retransmits made troubleshooting much faster.
These systems were really awesome and a good example of what could be built with relatively little effort (the teams involved were small) when you already have great components for reliable log transmission and flexible data analysis.
Hosts usually collects SNMP metrics, which includes TCP retransmits and more. Do you know what SNMP was lacking compared to eBPF? What I can think of is more dimensions in eBPF's case.
If you go deeper with the eBPF tracing, you can also determine which code path the retransmit occurred on, which may or many not be interesting.
The system Facebook (which predated eBPF via a custom ftrace event) produced, effectively, tuples of `(src_ip, src_port, dst_ip, dst_port, src_container, dst_container, ...)` and aggregated them over all hosts. This allowed counting retransmits by, say, receiving host. If there's one host that has a bad cable and is receiving retransmits from 1000 clients we may not see that signal in simple counters on the clients — for them it's just a tiny bump in the overall retransmit rate. But if we aggregate by receiving host the bad guy will stand out like a sore thumb. Same thing for all the hosts in a rack, or all hosts reachable over a given router interface, or whatever else you want. One of my common workflows when facing a bump in general errors (e.g. timeouts to the cache layer) was to quickly try grouping retransmits by a few dimensions to see if one particular combination of hosts stood out as sending or receiving more retransmits.
tl;dr: the SNMP data is one-dimensional. FB's system allowed aggregating and querying by many dimensions. This is really useful when there are thousands of machines talking to each other over many network paths.
SNMP is a query mechanism, eBPF is the sampling mechanism.
(Boundary was a small SF startup that created a network flow product. Netflix evaluated it for use on Cassandra clusters. BMC acquired Boundary in 2015.)
It stopped working the day Google activated their proxies on mobile networks. I saw Google traffic increase 1-2% week-over-week during several months. I quitted the job when it was >30% of the backbone traffic and don't know what happened at the end...
What is this referring to? You've piqued my interest.
They had a lot of caching proxies and it made browsing faster.
For us the problem was everybody was using that chache instead of going directly to the end host. And it showed like google traffic was growing _a_lot_ week over week and we weren't able to tell where people was going to optimize the network.
"Without having network visibility, it’s difficult to improve our reliability, security and capacity posture."
I'd be one happy camper :)
Estimating bandwidth requirements is hard. Many use educated guess + over provision.
> Posture (n.) - state or condition at a given time especially with respect to capability in particular circumstances
To be clear, you could re-write it more verbosely:
> Without having network visibility, it’s difficult to improve our reliability posture, security posture, and capacity posture.
While not exactly the same meaning (the original sentence groups them together as a single, nuclear idea), maybe it helps parse the grammar a bit better.
Extrapolating that, I would say that capacity posture is planning around expected capacity. As an example, perhaps they say a given data centre has to handle twice the amount of bandwidth of the peak.
I'd look at the eBPF books and articles by Gregg.
TFA mentions transport of (aggregated) flow data back to base over a choice of protocols, including UDP, which makes sense -- you don't want your monitoring data affecting your business data when you get close to redlining. (You'd expect you'd have enough forensic data leading up to that point to make some informed decisions.)
QUIC runs over UDP, and I can imagine that growing rapidly for most corporate & public-facing networks.
He's got a good blog too  !