Rust at Cloudflare
Rust Perf with lolbench
edit: context about the event
> Why Rust (for CloudFlare)
> - Safe (we had a bug once...)
Funny because according to Algolia, the bug in question is the 7th-most-upvoted HN post of all time, clocking in at about 1k comments.
Great to see Rust gaining industry adoption.
edit: to clarify, I like that Cloudflare can look back at a bug in their C code, chuckle about it, and then start to move on to something safer. This is the bug in question: https://news.ycombinator.com/item?id=13718752
It has given us a very keen awareness of just how bad such bugs can be, and hence "we had a bug once" might be considered the soft way of saying "no more effing C". Of course we'll still have C and C++ as we're heavily invested, but if there are safer alternatives that we can use those will definitely be considered first.
Sure logic errors can always happen, but moving away from C would already get a portion of memory corruption issues out of the table.
Unless you are asserting that Mesa/Cedar, NEWP, Oberon, Oberon-07, Active Oberon, Modula-2, Modula-2+, Modula-3, Object Pascal, Concurrent Pascal, Component Pascal, Basic, D, Ada, SPARK, Sing#, Midori, BLISS, PL/I, PL/S, PL/8, PL/M, Swift, HPC# are all unsafe as C.
The goal is "customers should be able to create filters that target traffic passing through our system and then do things" so this is definitely a thing we wish to give to customers rather than an internal toy.
Looks like Linux added in kernel tls termination; sounds like even layer 7 inspection could all happen in kernel space as well...
At the moment we have a simple(ish) implementation running at the edge purely within Nginx and a project underway to see how it behaves, gather metrics. That environment gives us a good place in which to control testing, and we can easily compare it to other parts of our code where we already do more trivial request matching (our Page Rules feature).
It'll be a couple of months before we're satisfied we know enough to say whether we'll keep it simple or will seek to make it more specialised to our environment. We haven't yet determined how far we're going to go with this... could it replace our WAF? Is it cheap enough for the DDoS layer? If we do go down those paths then it's obvious that yes we'd move the filtering to other places.
Do you have in mind some success criteria for the choice of rust that you can evaluate maybe in 6 months?
[I'm not looking for scientific rigor here, just a "this is what we expected from rust" and you can look later and see if those expectations matched reality.]
We chose Rust not just for the expected speed and safety but also because we needed to create a shared object that could provide the API (written in Go) with exactly the same parsing and matching engine as our edge (initially Nginx for web traffic written in C and Lua).
The key was to produce consistent behaviour in the way we work with filter expressions such that there is no difference in behaviour that can be leveraged by malicious users later. i.e. if a customer used a filter to create a security rule and that filter behaved even slightly differently later then that would be a security incident in the making and we would have failed our customer.
Rust stood out for being a safer language than C (we had that bug) that could produce a shared object we can use in our API (unlike Lua which does not make this easy), and didn't come with the garbage collection.
We already have some other small bits of Rust in part of our dev pipeline so were comfortable selecting it, but this is the first time we would be shipping Rust globally and running it at the edge.
Our main expectation and hope is performance.
The matching engine that applies the filters is on the hot path for handling requests, it's early in the request lifecycle and all traffic on a zone (customer domain) would need to be evaluated to see whether traffic applies to any existing filter. So the numbers we are looking to gather relate to the time it takes to execute expressions similar to what we have already, as well as more complex expressions, and what this does to CPU load - those two things will dictate how this new project affects the throughput.
The hope is that a more powerful matching engine that is fast and doesn't increase CPU load will allow us to remove code from our system whilst providing customers with fine grained control of all features. Today a lot of features implement their own matching on paths, headers, etc... and these are not always efficient and are implemented inconsistently.
Performance is therefore what we are focused on measuring and improving, and we hope that if the numbers are good Rust will provide us the opportunity to remove other code and increase throughput whilst not giving us fear that such a change has opened us up to other more fundamental risks.
As a daily systems language Rust is not quite there (for me) yet.
The filter expressions are based on the Wireshark Display Filters https://www.wireshark.org/docs/wsug_html_chunked/ChWorkBuild... and we support everything except the slice operator.
Rust handles the parsing, validation, AST creation, etc. That AST can then be applied to a trait table similar to the Wireshark implementation but without the necessity of a pcap step.
I hope that the filter becomes an invariant form of filter against traffic and that once we've got the AST we can apply that filter to different places. Initially just to itself within a Rust matching engine at the edge, but if you have columns on a DB why not ask for a SQL expression derived from the filter expression and then filter a ClickHouse store using the same filter, and likewise as per your suggestion if we can take some of the expressions that aren't L7 why can't we have these run in the network card, etc.
Right now... just CPU as it is early days. But eventually we can look at all places we match traffic and consider that a contender for the same filter to be applied there.
anyway, another example of soft recruiting done right by cloudflare!
The second slide is about when Cloudflare chooses Rust as a language.
Right, but this is talking about GLOBAL capacity, not at a single datacenter. The CDN I work for has over 49tbps, and we wouldn't claim to be doing 10% of all HTTP traffic:
Plus, capacity is always going to be greater than actual throughput, both for reliability reasons and traffic patterns (i.e. you need enough capacity for your peak traffic in a datacenter, not the average)
I really doubt the 10% claim.
I wasn't taking those numbers from nowhere. Read this:
I think people are misinterpreting what they claim, It's not about 10% of internet throughput of HTTP traffic, it's about 10% of all HTTP requests.
Here, the metric probably means "unique domain names" used on the internet.