Updated with article about BPF: https://lwn.net/Articles/747551/
Really good talk by the Cilium folks that explains these concepts: https://m.youtube.com/watch?v=ilKlmTDdFgk
BPF is hot today. Tomorrow it will be sooooo openflow.
If i understand this correctly, eBPF is a fairly general-purpose bytecode format that can be executed inside the kernel. It's safe, and there's a JIT, so it's pretty fast (is there really a JIT compiler running in kernel space?). It was originally used for packet filtering, but it's now used at various decision points in networking, and is somehow involved with tracing as well.
But it could potentially go much further. Anywhere the kernel currently gets configured with data-like configuration could be replaced or augmented with an eBPF, right? For example, instead of setting an ACL on a directory, you could set an eBPF program which would run for each attempted access and decide whether to allow it, as well as logging or doing other stuff. eBPF programs could guard the interfaces between a container and its host, allowing more flexible isolation. An eBPF program could respond to every system call a process makes, allowing behaviour like OpenBSD's pledge, only much more sophisticated.
With the right access control model (implemented in eBPF!), normal userland programs could install eBPF programs for resources they control (sockets, files, etc), potentially shifting a significant fraction of their processing into kernel mode, improving performance, reducing system call overhead, and allowing safe access to kernel facilities that are currently inaccessible. Imagine implementing a garbage collector in userspace, but being able to configure your slice of the virtual memory system in the kernel using an eBPF program.
I don't know if this will happen. But a pervasively eBPF world would be very different, and very interesting. We'll have all sorts of fun. We'll get tools we never imagined. And we'll get pwned by black hats harder than ever before.
That is actually one of the oldest and most widespread uses of BPF :) https://www.kernel.org/doc/Documentation/prctl/seccomp_filte...
If you're interested in this notion you might be interested in:
And, to an extent, Taos:
KubeCon Copenhagen just wrapped up, and I'm working through to watching all of them but here's a video on eBPF applied to tracing:
RIP to people who were working on nftables.
seems that BPF is taking the nftables API, which seemed to be the core value-add (reimagined iptables API), and actually delivering on the performance benefits as well.
I hold people that work on the kernel in pretty high regard, and I expect them to be pragmatic about it (the whole "strong opinions loosely held" thing), and if BPF doesn't introduce too many possible security vulnerabilities (that's about the only issue with it I can see), it might represent the best of both worlds -- new api + improved performance.
It seems all this could be just done with traditional networking tech, like microservice endpoints just having real IP addresses and using normal application level auth/load balancing methods when conversing with internal services.
In a Calico setup, every container (or VM - they use the term "workload" to abstract over the two) has an IP address, and talks to other containers in the usual way. AIUI, Calico does a couple of things to make that work at Cloud Scale (tm): it pushes firewall rules into the kernel, to make sure each container can only communicate with the other containers it's supposed to, and it propagates routes around, so each host knows where to send packets destined for containers on other hosts.
The routing bit is important because Calico is designed to run as a flat IP space on top of a non-flat ethernet space, ie one where the access switches are connected by routers rather than more switches. That's useful because scaling a flat ethernet network up to a huge size is apparently hard (network engineers start telling horror stories about Spanning Tree Protocol etc).
Calico still has more moving parts than i'm personally comfortable with, but it seems broadly sensible.
Different orgs work at different scales and in different styles; some orgs are producing monoliths, others are producing "fat services" or "microservices" (without going too much into what that might mean).
Some orgs have template repos or base libraries (big difference!) that they use to produce services. Others just have standards and you can do it how you like but plz conform to the standard (have /healthz, use statsd or export for prometheus, etc etc).
Also, how do all the things auth to each other? Do you TLS all the things or do you have api keys and secrets n-way between all the things? Does stuff trust each other based on IP? Etc.
There are lots of "illities", but particularly various kinds of monitoring, metrics, circuit breakers, access control and so on that you can either bake in to each service independently or implement via shared code somehow or other.
Notice that the above generally implies some degree of language homogenisation (usually a sane thing to have when you take into account other illities like artifact repos, dependency analysis, coding style guides, static analysis tooling etc - adopting a new language is not "easy" at scale) or else you are rewriting all these things a lot.
Anyway one option in this whole rainbow of possibilities is that you pull some of this out of the service itself and push it into a network layer wrapper somehow.
And that is how you end up with a service mesh...
Broadly speaking, my estimation is that if your company doesn't have multiple buildings with lots of people who have never met each other, you probably don't need a service mesh. And maybe not even then.
At this point the solutions seem to be intro a service mesh or copy pasta.
The issue with microservices at scale vis-a-vis microservices is linear scaling of routing tables, very short-lived IP connections, and large update times for massive service maps.
I think networking is complex in general, but the promise of this kind of integration (from my limited understanding), would be a clear reflection of services at the routing level instead of a hodgepodge of ports and IP addresses that provide little context or meaning. Having lost a few half-days learning iptable nuances in our Kubernetes cluster I can see the benefits in having an integrated stack with less impedance between architecture and routing.
Generally we don't eliminate complexity, we just move it from one place to another. I imagine this is a case where a more complex service implementation could provide a simpler user experience.
From an ops point of view, this creates a nearly opaque wall you get to run up against when troubleshooting issues; a completely unnecessary wall.
Edit: Just came across this https://cilium.io/blog/istio/ :)
There is a native integration with Istio, you can read tested at the following url: http://docs.cilium.io/en/doc-1.0/gettingstarted/istio/
Last week was the Kubecon Europe, Thomas Graf, founder of the project, presented some new improvements made in Envoy and TLS integration. Here are the slides:
There are a lot of fast userspace networking projects that bypass the kernel precisely to be faster. Which approach is better is up for debate but the kernel is definitely not faster in all cases.
More concretely, most hardware devices are relatively easy to interface with: for example, you may simply set up a region of DMA memory, poke the hardware device with the address of this memory, and write into it, then read results back out. A NIC is a good example of such a model. This can be done with any block of memory, except normally the kernel is the only thing that can talk to the NIC (to tell it where to write to/read from).
So the main thing you need to do is pass control of the hardware to a userspace process. For the NIC/DMA example, the easiest way is to just allocate some memory, make sure it's non-swappable, and then get its physical address. You then just need a small driver to connect userspace with the hardware -- it must give you a way to tell the hardware where to read/write. Maybe it exposes a sysfs-based file with normal unix permissions (a common method). Writing an address into this file is equivalent to telling the hardware to "read here, and write there". Now you can write to the memory you allocated (in userspace) to control the NIC.
At this point, the kernel is more-or-less out of the loop completely. Of course, this is the easy part, since now you must write the rest of the hardware driver. :)
There are plenty of projects out there to help get the kernel out of the critical path of a networking application. DPDK and PF_RING are the two I hear about most but here is a blog from cloudflare outlining a number of others: https://blog.cloudflare.com/kernel-bypass/
Kernel code doesn't run faster than userspace code.