Rust at CloudFlare

steven_pack · on May 15, 2018

Here are the videos and slides... (from the May 2018 Bay Area Rust Meetup, which was hosted at Cloudflare)

Rust at Cloudflare Video: https://watch.cloudflarestream.com/4d5d6da3c6217c24f4e44564e... Slides: https://docs.google.com/presentation/d/1ERVTXZbYBMZf-9Zk3YsW...

TrustDNS Video: https://watch.cloudflarestream.com/e14e0d2335ffb94ae505289f5... Slides: https://drive.google.com/drive/folders/1gQn9Uuj34TxS4cfUoW1N...

Rust Perf with lolbench Video: https://watch.cloudflarestream.com/5774ee39218ed516521adb74c... Slides: https://docs.google.com/presentation/d/1BEI7zXhEiCwEd93-UUpW...

edit: context about the event edit2: formatting

manigandham · on May 15, 2018

Thanks for the info. 1 thing: I know you guys have your own Stream video service but Youtube is still much nicer for viewing...

anp · on May 15, 2018

The official (I think?) youtube channel has them: https://www.youtube.com/channel/UCaYhcUwRBNscFNUKTjgPFiA.

steveklabnik · on May 16, 2018

It is.

steven_pack · on May 16, 2018

My pleasure! Hey, what do you like more about YouTube for watching the videos?

manigandham · on May 17, 2018

I have a 300mbps connection in Los Angeles and still had buffering constantly on your video links. YT's player is also more responsive, has speed controls, and I can add it to my other playlists, etc.

kornish · on May 15, 2018

I like the quote from Slide 10:

> Why Rust (for CloudFlare)

> ...

> - Safe (we had a bug once...)

> ...

Funny because according to Algolia, the bug in question is the 7th-most-upvoted HN post of all time, clocking in at about 1k comments.

Great to see Rust gaining industry adoption.

edit: to clarify, I like that Cloudflare can look back at a bug in their C code, chuckle about it, and then start to move on to something safer. This is the bug in question: https://news.ycombinator.com/item?id=13718752

jb1991 · on May 15, 2018

Could you provide a link to it?

atombender · on May 15, 2018

I suspect it's this bug: https://blog.cloudflare.com/incident-report-on-memory-leak-c...

HN: https://news.ycombinator.com/item?id=13718752

jb1991 · on May 15, 2018

I can't tell if the OP is suggesting the bug they had was due to Rust, or if they adopted Rust for a safe language they once had a bug?

atombender · on May 15, 2018

The latter. I think it's a sheepish admission that the bug (caused by unsafe C code) is a reason to prefer Rust's safety, which should help them prevent another one like it.

stavros · on May 16, 2018

It is, the person presenting says so in the video.

serioussecurity · on May 16, 2018

The cloudflare bug was a memory safety issue in C code, which is basically the whole point of trust

MikkoFinell · on May 16, 2018

Hah, could you imagine, someone saying they found a bug in a piece of Rust code? I haven't read the HN terms of service, but I'm guessing they you'd get banned pretty quick for something as egregious as that.

sligor · on May 15, 2018

The blog post says that this bug is inside code written in C and Ragel (a parser generator), Rust seems not to be involved in this bug

buro9 · on May 16, 2018

Rust was not the cause of the bug. It was a C bug and was exposed during the process of deleting the code in question (adding the new code to replace it actually exposed the old bug).

It has given us a very keen awareness of just how bad such bugs can be, and hence "we had a bug once" might be considered the soft way of saying "no more effing C". Of course we'll still have C and C++ as we're heavily invested, but if there are safer alternatives that we can use those will definitely be considered first.

thurston · on May 19, 2018

What rarely gets discussed in this case was that old, working code was modified in a critical way in order to accommodate new code when that didn't need to be done at all. It was actually a failure in the software development process.

tptacek · on May 15, 2018

That's the point; Rust operates in the places they needed C before, and C is unsafe.

pjmlp · on May 16, 2018

There is a huge difference between being able to track down unsafe statements and a language where 100% of the source code is unsafe, given the 200+ cases of UB, implicit conversions and lack of data integrity validations.

Sure logic errors can always happen, but moving away from C would already get a portion of memory corruption issues out of the table.

Unless you are asserting that Mesa/Cedar, NEWP, Oberon, Oberon-07, Active Oberon, Modula-2, Modula-2+, Modula-3, Object Pascal, Concurrent Pascal, Component Pascal, Basic, D, Ada, SPARK, Sing#, Midori, BLISS, PL/I, PL/S, PL/8, PL/M, Swift, HPC# are all unsafe as C.

masklinn · on May 16, 2018

The slide is a bit unclear, the whole deck is terse and clearly not intended to be consumed independently from the presentation, so "Safe (we had a bug once…)" should be interpreted as "safe (we kind-of had a not-very-small bug in our C code once)" not "Safe (we've only had one Rust bug)"

steven_pack · on May 16, 2018

Indeed. It was very much just prepared as some talking points for the folks in the room. Makes more sense with the video. Note to self: Stuff about Rust always makes it to HN. :)

enitihas · on May 15, 2018

I think this one https://news.ycombinator.com/item?id=13718752

buro9 · on May 15, 2018

I'm the Engineering Manager @ Cloudflare for the "Wireshark but at the Edge" thing. Happy to answer any questions, though I'll be clear... this isn't something you can play with yet and we're in early days with this feature.

The goal is "customers should be able to create filters that target traffic passing through our system and then do things" so this is definitely a thing we wish to give to customers rather than an internal toy.

Rapzid · on May 15, 2018

I wouldn't mind hearing some details on perf. I would imagine a lot of filtering can happen at the true edge via BPF, kernel mods, or otherwise zerocopy mechanisms.

Looks like Linux added in kernel tls termination; sounds like even layer 7 inspection could all happen in kernel space as well...

buro9 · on May 15, 2018

Ah, for this I'll defer until we have more data.

At the moment we have a simple(ish) implementation running at the edge purely within Nginx and a project underway to see how it behaves, gather metrics. That environment gives us a good place in which to control testing, and we can easily compare it to other parts of our code where we already do more trivial request matching (our Page Rules feature).

It'll be a couple of months before we're satisfied we know enough to say whether we'll keep it simple or will seek to make it more specialised to our environment. We haven't yet determined how far we're going to go with this... could it replace our WAF? Is it cheap enough for the DDoS layer? If we do go down those paths then it's obvious that yes we'd move the filtering to other places.

jeffdavis · on May 16, 2018

If I understand correctly, this is basically a new project/initiative, so the results aren't in yet, right?

Do you have in mind some success criteria for the choice of rust that you can evaluate maybe in 6 months?

[I'm not looking for scientific rigor here, just a "this is what we expected from rust" and you can look later and see if those expectations matched reality.]

buro9 · on May 16, 2018

Correct on the first part.

We chose Rust not just for the expected speed and safety but also because we needed to create a shared object that could provide the API (written in Go) with exactly the same parsing and matching engine as our edge (initially Nginx for web traffic written in C and Lua).

The key was to produce consistent behaviour in the way we work with filter expressions such that there is no difference in behaviour that can be leveraged by malicious users later. i.e. if a customer used a filter to create a security rule and that filter behaved even slightly differently later then that would be a security incident in the making and we would have failed our customer.

Rust stood out for being a safer language than C (we had that bug) that could produce a shared object we can use in our API (unlike Lua which does not make this easy), and didn't come with the garbage collection.

We already have some other small bits of Rust in part of our dev pipeline so were comfortable selecting it, but this is the first time we would be shipping Rust globally and running it at the edge.

Our main expectation and hope is performance.

The matching engine that applies the filters is on the hot path for handling requests, it's early in the request lifecycle and all traffic on a zone (customer domain) would need to be evaluated to see whether traffic applies to any existing filter. So the numbers we are looking to gather relate to the time it takes to execute expressions similar to what we have already, as well as more complex expressions, and what this does to CPU load - those two things will dictate how this new project affects the throughput.

The hope is that a more powerful matching engine that is fast and doesn't increase CPU load will allow us to remove code from our system whilst providing customers with fine grained control of all features. Today a lot of features implement their own matching on paths, headers, etc... and these are not always efficient and are implemented inconsistently.

Performance is therefore what we are focused on measuring and improving, and we hope that if the numbers are good Rust will provide us the opportunity to remove other code and increase throughput whilst not giving us fear that such a change has opened us up to other more fundamental risks.

zippitydoodah68 · on May 16, 2018

Once bitten, always shy, eh? Not saying you are overreacting because I remember the bug and it was bad news. I use C pretty much exclusively (and have for 20 years) but I would rather see new development for handling user input and frequent memory (re)allocation done in Rust.

As a daily systems language Rust is not quite there (for me) yet.

mooman219 · on May 16, 2018

Rust's allocation API is actually making great progress! The RFC process really speeds these things along so that their merits can be tested before being stabilized (through the unstable API). https://github.com/rust-lang/rfcs/pull/1398

wyldfire · on May 15, 2018

Do you have NICs/FPGAs that this code could target or would those filters run on a CPU?

buro9 · on May 15, 2018

CPU presently.

The filter expressions are based on the Wireshark Display Filters https://www.wireshark.org/docs/wsug_html_chunked/ChWorkBuild... and we support everything except the slice operator.

Rust handles the parsing, validation, AST creation, etc. That AST can then be applied to a trait table similar to the Wireshark implementation but without the necessity of a pcap step.

I hope that the filter becomes an invariant form of filter against traffic and that once we've got the AST we can apply that filter to different places. Initially just to itself within a Rust matching engine at the edge, but if you have columns on a DB why not ask for a SQL expression derived from the filter expression and then filter a ClickHouse store using the same filter, and likewise as per your suggestion if we can take some of the expressions that aren't L7 why can't we have these run in the network card, etc.

Right now... just CPU as it is early days. But eventually we can look at all places we match traffic and consider that a contender for the same filter to be applied there.

squiguy7 · on May 15, 2018

It's cool to see CloudFlare using Rust but I wish they went into a little more detail on the slides. I hope to see follow up blog posts or some of the code open sourced soon.

steven_pack · on May 15, 2018

That's what happens when Product Managers present the things the engineering team is doing. :) The bigger projects are in London. I'm going to try get the engineers out to SF next time we host and go a bit deeper. The EM is on this thread answering questions also.

squiguy7 · on May 15, 2018

Ok, great. Thanks for sharing the information as it is! :)

Already__Taken · on May 15, 2018

2 videos up on the rust channel from this and hopefully more in future - https://www.youtube.com/channel/UCaYhcUwRBNscFNUKTjgPFiA/vid...

jiveturkey · on May 15, 2018

just me, or anyone else find it odd that this is a google docs deck? don't see those published much (usu. pdf and of course ppt).

anyway, another example of soft recruiting done right by cloudflare!

tormeh · on May 15, 2018

Can anyone offer some context? Redox is great, but why is it a pro for cloudflare?

steven_pack · on May 15, 2018

That was purely on the list of reasons why I personally got interested in Rust and I support redox on patreon. I love the idea of an OS built on Rust from the ground up as a way to eliminate a category of security bugs.

The second slide is about when Cloudflare chooses Rust as a language.

jontro · on May 15, 2018

It's listed under the Why rust (for me) slide. I.e. it's the authors own reasons/pros.

weavie · on May 15, 2018

Does that mean it is actually being used in production, or is it just a potential? If so, I hadn't realised it had come on that far!

steven_pack · on May 15, 2018

Rust is being used in production at Cloudflare. Redox is not used here in any capacity.

brian_herman · on May 15, 2018

TrustDNS awesome for 1.1.1.1

bluejekyll · on May 15, 2018

We're putting the 0.9 release of the Resolver together now, hopefully for release in the next few days. Lot's of good things, including DNS-over-TLS with cloudflare and quad9 configs available.

cortesoft · on May 15, 2018

Hmmm, they say only 10Tbps capacity, but they serve 10% of the internet's HTTP traffic? That doesn't seem right.

lossolo · on May 15, 2018

Only? 10 Tbps is HUGE capacity. Biggest IX (Internet Exchange) which is AMS-IX have 5 Tbps at peak. Whole internet traffic in 2016 was around 160 Tbps. So it seems right.

cortesoft · on May 15, 2018

disclaimer: I work for a CDN, but am not representing them with this comment

Right, but this is talking about GLOBAL capacity, not at a single datacenter. The CDN I work for has over 49tbps, and we wouldn't claim to be doing 10% of all HTTP traffic:

https://images.verizondigitalmedia.com/2015/12/VDMS_NetworkM...

Plus, capacity is always going to be greater than actual throughput, both for reliability reasons and traffic patterns (i.e. you need enough capacity for your peak traffic in a datacenter, not the average)

I really doubt the 10% claim.

lossolo · on May 16, 2018

Of course capacity is always greater, average twice greater as that kind of deals you are taking from T1 providers. But depends how much commitment you have, what kind of deals and traffic patterns you are using. CF has 10 Tbps capacity but probably a lot less throughput, they need high capacity because they are DDOSed a lot.

I wasn't taking those numbers from nowhere. Read this:

https://www.cisco.com/c/en/us/solutions/collateral/service-p...

I think people are misinterpreting what they claim, It's not about 10% of internet throughput of HTTP traffic, it's about 10% of all HTTP requests.

noselasd · on May 15, 2018

I read it as 120 datacenters, each with 10Tbps capacity

cortesoft · on May 15, 2018

Yeah, that isn’t possible.

majke · on May 15, 2018

The "10%" is always going to be a metric of something. If you count bytes in the internet, to some approximation it's 100% netflix + youtube (with some error :P).

Here, the metric probably means "unique domain names" used on the internet.

steven_pack · on May 15, 2018

The slide says 10% requests. http[s] requests that is, not packets.

cortesoft · on May 15, 2018

And where are they getting the total requests on the internet number from?

bogomipz · on May 15, 2018

Exactly. Cloudflare throws these unsubstantiated numbers around all the time. It's just marketing puffery.

theweb1 · on May 16, 2018

http traffic from cloudflare is cutting edges.

qiqing · on May 16, 2018

Also, Cloudflare is hiring.