If you had asked me to make a wild guess as to how Cloudflare stores internal headers and then removes them, I would have come up with some options:
- An entire separate dictionary or other data structure.
- One single header containing all internal metadata.
- All headers have a prefix, and the internal ones start with I and the external ones start with E.
- All internal headers start with “CFInt”.
I would not have come up with a scheme in which headers in a particular list are internal. (What if someone else uses this name? What if something forgets to sanitize? What if different simultaneously running programs disagree in the list? What if the Connection header names a Cloudflare-internal header? What if the set-difference algorithm is annoyingly slow?)
The web is already full of obnoxiously ambiguous in-band signaling and header naming, and I find it bizarre that a company with Cloudflare’s scale uses such a tedious and error-prone mechanism internally.
Former employee here; the interesting (horrifying?) fact is that you can set some of these internal headers (I remember a notable bug with cf-cache-status) in workers (the serverless offering) and cause all kinds of bad things to happen :)
The first large scale piece of software I worked on was for telcos pre smart phone. We used internal headers to offload authentication and terminate SSL. We also had to pressure F5 to fix about half a dozen bugs in BIG-IP to do so. Bugs that should in no universe have existed in version 9 of a product.
I used to joke that F5 owed me and my coworker 3 months of salary for all the free QA we did for them.
It helps if you realize that BIG-IP 9.0 was essentially a from-scratch rewrite of BIG-IP 4.5. Among other major redesigns, the data plane was moved from BSD kernel space to Linux user space. Internally, the joke was that it would be two times better when we were done (4.5 * 2 = 9.0). It probably was, but not on day zero.
Are you on a corporate network? Do you use a firewall at home?
You’re on enclaves all the time. This is just a different one. Separate networks per class of traffic used to be de rigeur before Cloud. Now it’s all munged together.
I understand that security needs to be pragmatic, but because security does it, doesn't make it right.
It isn't about being in the enclave, it is having to keep track of what headers you set vs external. It is fragile and error prone and it will absolutely break someone else when there is a name collision.
And what would you have them do instead? SSL to every service? Reauthenticate on every single call? What’s your clever solution?
There’s probably a dozen things you can terminate at a boundary that would cost so much to run through the entire system that it would bankrupt a company.
And then there’s tracing, which also uses headers.
Maybe I'm dumb or missing something, but why not just do what Google does -- convert all inbound HTTP requests into Protobufs, separate out the internal stuffs from the external stuffs, run it through your myriad of internal services, and then on the way out you still have your nicely delineated protobufs which you can convert back into HTTP. Why are we mucking about with headers in the first place?
(Also sibling is right, I spaced on X-CF- being a header sent to CF customers’ servers. I don’t used cloudflare but cloudfront does the exact same thing)
X-CF- seems to be used by some other software. And Cloudflare has plenty of documented headers that don’t start with X.
The whole point is that a genuine user header should never conflict with a Cloudflare internal header, and a browser should never see an internal header.
Having re-read the thread, I totally agree. What you have listed is table stakes. I'd also say that internal headers would also be encrypted to a narrow protection domain.
If all internal headers were prefixed with X-CF-, you could strip them all via SIMD that had no knowledge of any specific header. Hell, you could probably do it on the NIC.
Configure all machines in a way that they can survive on some random untrusted public wifi. This should be the obvious default stance for laptops and phones.
But even for workstations wired to a trusted LAN it still makes sense because you never know which of the various tunnels might assign and expose some IPv6 address to the internet.
For servers you might be able to make an exception if you have a vigilant IT and the people with admin access aren't actively trying to circumvent security, but even then that better not be your only layer of security.
I feel like I can point out lots of similar problems to the other solutions you suggest (heck, I thing some of those problems you list even apply to those other solutions).
The list approach has some downsides, but it also has a bunch of upsides. I feel like when people like to point out potential flaws of these approaches, they're ignoring the history and difficulties that comes with Cloudflare's scope. An enumerated list is the simplest and most flexible of the approaches, and it also doesn't require any a priori agreement on the structure of a header key - this is probably important when you think about the sheer number of teams at Cloudflare, potential technology acquisitions, etc.
> it also doesn't require any a priori agreement on the structure of a header key - this is probably important when you think about the sheer number of teams at Cloudflare
That’s more of an indictment of how little effort is spent aligning things. It’s not that hard to tell every team that any headers have to start with ‘CFint’ and then enforce that.
* Headers that are originally private, but are later changed to public
* Headers that are intended to be public, but then are changed to private
* Headers that were created in the very early days of the company, which are so deeply integrated into everything that renaming them is basically impossible (or at least very hard and risky)
* Headers that are set by some appliance/third party solution not controlled by CF, but which are technically internal and must be stripped.
* Newly acquired company has its own set of internal headers. Management is not amused when a huge, disrupting refactor is suggested by engineers.
And this is just a tip of the iceberg of possible problems when your scale is big enough.
> It’s not that hard to tell every team that any headers have to start with ‘CFint’ and then enforce that.
To be a little bit blunt, this just means you have never built software in a very large company that may have a lot of existing/completely independent software, or have made technology acquisitions after the fact. That is, if you are starting from "time 0" of course it's very easy to say "internal headers need to start with CFInt", but that's rarely the case.
I think if you work for such a very large company the only reason you are ok with it is because nobody at the top says ‘no more’.
If they do, it suddenly becomes very easy to get everything aligned, because there is likely a vast collection of bodies just collecting a paycheck that can be more productively assigned to manually renaming everything.
Lists, extensible protobufs, etc are indeed great in their extensibility. The issue I would see (if I worked at Cloudflare) isn’t to using a list — it’s that the list is the same list that’s used for externally visible HTTP headers.
"Hey any teams out there, before you can add internal headers, you need to register them with some service." Then the list is loaded from static storage at startup time and/or refreshed on some periodic basis.
Yeah, me too, but systems grew over time and grew and grew and we were using HTTP headers for all sorts of stuff. This optimization makes the situation better, but the solution (which is underway) is to use a different mechanism for IPC and get rid of the use of HTTP headers completely.
I’m certainly familiar with systems growing over time and outgrowing their original schema.
Is this the same phenomenon that resulted in Cloudflare Tunnels apparently being domain names? Or why domains that are proxied by Cloudflare show up as “CNAME” in the DNS panel? (The latter one seems extra strange — wasn’t the reverse proxy service the very first service that Cloudflare offered? There must be some history here.)
All new products are full of scaffolding that has to be removed when robustness becomes a higher priority than existence. Where we fail is not calling it that. Instead we just call some code “good” while our thicker skinned coworkers call the rest “fine”. It’s not fine. I don’t want to be working overtime the week of thanksgiving because you can’t think past the horizon.
There is a very fine line between "good enough" and "not good enough" in any product beyond a certain complexity. Finding the pieces that cross that line can be hard and improving "good enough" parts is (sadly) mostly a waste of time in commercial settings.
You have to backdate starting a project so it happens before things stop being good enough. And take into account bad estimates and the five other deadlines between then and now.
Often it’s better to clean as you go. If you have the time and the inclination, shore up one of the least good enough things.
It’s not unheard of to open up new feature sets via refactoring. Something that previously would have taken an “impossible” amount of time now becomes profitable, not just possible.
I miss working at a place where that was encouraged. Now if I do that the testers complain that it’s not in scope for the ticket and to split the changes into another ticket with a separate test plan. The rot thus continues.
I think this is about when I discovered zone defense for performance optimization. QA doesn’t want to chase your code changes halfway across the codebase, and that’s fair. At some point it starts looking like you’re changing things just to change them, even if it ends up being a few percent better.
But if you limit yourself to one part of the interaction, they don’t get as mad. A dozen changes in the signup system piss them off less than five changes across the app.
Indeed. I am very happy with some of the tricks that I developed (speeding up, and indeed allowing, Solaris C++ linking by prefiltering on MD5 hashes of canonicalised .o files -> nm, anyone?) even though they were lost without trace (yep, Lehman Brothers, where are my Imakefiles now?)...
I really hope not, given that that very old Solaris compiler should have been laid to rest many many years ago. Yes, some of my stuff had kinda made it into BarCap (even onto my desktop dev machine!) when I turned up there later, and even some at Nomura when I rocked up there again IIRC!
I recall working at a broker/dealer on Broadway where a curious sysadmin noticed a power cord that underneath an internal wall that fenced off "dead space." Said space turned out to contain a "forgotten" minicomputer living just outside a machine room.
I recall returning to a bank and discovering that router and load balancer admin passwords hadn't changed in seven years..
My likely overly cynical concern here is that this suggests trie-hard will soon end up abandoned, as you're making it sound like it is a stop-gap solution for you.
You chiming in on this post makes me slightly bitter at having gone with CloudFront (I have everything on AWS so it seemed the obvious choice) instead of Cloudflare.
- Ensure all headers added internally that are not for export at the front of the header list
- Preseed the hashtable of all requests with internal headers you plan to remove.
In fact, if you preseed, you are basically combining these ideas but fixing how many internal headers are on each request. At that point, you can use a linked hash table that preserves creation order and just remove the first N from the final list that you send back to clients.
> At that point, you can use a linked hash table that preserves creation order and just remove the first N from the final list that you send back to clients.
While Python provides data structures like this out of the box, designing a big system to require a special, nonstandard data structure to identify private metadata seems like a mistake to me.
I'm not sure I follow? Does rust not have the equivalent of a LinkedHashMap from Java? All I'm proposing is to start each map of headers off with a sentinel value for all of the headers you don't intend to send back to the client. Then, have a method that iterates over all values, for the places you need to send to internal servers, and another (probably the default?) that starts iteration after all of the internal items, which should just be a different pointer. At this point, there is no need to inspect each header by name, as that was effectively done at compile time when you put the protected headers at the front of the list.
Is this somewhat special? I mean, sure. But it wasn't long ago that a hash table was already in the special territory of data structures.
Edit: I should add that I'm not certain this idea would be better. Just more ideas on what you could do.
There’s a third party crate for this. C++’s STL also doesn’t have it by default.
The creation-order-preserving hash map is basically two data structures in one, it’s more complex and slower than a normal hash map, and IMO it’s rather bizarre and I have never really understood why anyone wants one.
I would be surprised if it was dramatically slower, all told. Though, fair that I have not benchmarked in a long time, here.
My main "idea" here is to stop from having to check all of the headers on a "per header" basis. Yes, you can make each check faster, as the TRIE does. You can also remove the entire reason to check individual headers by starting a traversal after where the internal headers are.
You could also go with something like making sure the first header is "InternalHeaderCount: N" where you then keep the internal headers segregated form the others by a simple count. (This assumes you have solid control over what is adding the headers internally, of course.)
(I also forgot to give kudos on the blog title. Any Die Hard reference is a good reference. :D )
It's especially unfortunate that the intuitive name, `std::map`, is that ordered and generally least useful option of the standard library's three hash map containers.
My last job did rely on ordered-ness of `QMap`, because the elements are listed in a Qt GUI and it could confuse users if those widgets randomly rearranged themselves. That's the only use-case I've personally encountered for an ordered hash map.
X years ago the manager of the cache team raised the lack of data plane and control plane separation as an issue and we came up with various plans to fix things, but I guess nothing has happened since
This might of course introduce quite a bit of overhead, but the clean solution would arguably be to not mix requests to begin with, the external requests you are processing are the payload of the requests you are sending around internally. This would also allow you to cook up an optimized encapsulation protocol that might be more efficient to process than parsing and modifying HTTP. This admittedly comes from someone with essentially zero knowledge of what exactly Cloudflare is doing to the requests they receive.
Or perhaps even, insert yet another header with just the list of internal headers being added to the request, assuming this happens at a single place, otherwise a recipe for disaster.
I have a slightly different example of this, where a rpc framework used in my company disallows the service owner from modifying certain headers (say request identifier), and will instead create a duplicate header with a suffix. In that scenario at least, I can see this as a fairly reasonable tradeoff, as the goal is to control certain headers from being modified, not because they are platform internal but because there are certain assumptions associated with those headers that are followed company-wide.
I'll go check what mechanism is used for this matching.
> yet another header with just the list of internal headers
Or the same but with a list of headers which AREN'T internal.
You'll probably have a custom header-adding function that people should always use instead of the regular one. And this way, if someone forgets to use it, their header will get stripped.
You can think of a header escaping the internal network as something that needs to be authorized. This is a deny by default approach.
This doesn’t work if the front end is older than an internal node and a header gets added, at least not without extra logic to patch everything up and handle conflicts.
The HTTP Connection header works like this, and one should generally assume that almost nothing implements it correctly.
It took me a moment to ponder the effectiveness of mapping utf8 characters into a bitmask. At first, it seemed unwise. Then I realized that 32 bits would get you `a-z` plus six special characters, and 64 bits would get you 'A-Z' (uppercase) plus six more. That's plenty of room for HTTP headers, and for a blazing fast matching algorithm since it just masks and compares a couple integers. Even figuring out which bit corresponds to which character is a simple lookup into a 256-word table.
One thing the author leaves out is this technique is technically a Bloom Filter. I find these kinds of things fascinating since they came about in an era where computing was far more resource bound than today (1970 in this case). But here we are, still finding real-world corners to use the same old optimizations.
There are significant differences between trie-hard and a Bloom filter. Bloom filters are probabilistic, and use hashing. They're great for when rare false positives are acceptable in exchange for no false negatives, which isn't uncommon, but isn't what's needed here: this is exact, and hashing is the step to beat.
Rather, this is a refinement/variation on Liang's algorithm, used in TeX for storing a hyphenation dictionary. The thesis mentions Bloom's algorithm, which it calls "superimposed coding", and very much hearkens back to a time when memory was often the most precious commodity. I think you'll like it. ^_^
The problem is that bloom filters are good when your dataset is large and calculating several hashes per item is not as expensive as looking it up in the original dataset. Here, they say that even one hash is expensive, so bloom filters would be much slower than the original hash map.
The presented data for demonstrating a win doesn't have enough power to actually show this - not enough samples were taken.
A very simple analysis in R:
> prop.test(c(9, 4), c(1103,1171))
2-sample test for equality of proportions with continuity correction
data: c(9, 4) out of c(1103, 1171)
X-squared = 1.4915, df = 1, p-value = 0.222
alternative hypothesis: two.sided
95 percent confidence interval:
-0.002409831 0.011897193
sample estimates:
prop 1 prop 2
0.008159565 0.003415884
A p-value of 0.22 isn't below the magic 0.05 and the 95% confidence interval suggests that the trie might actually be slightly worse.
I imagine the trie is better, given the prior analysis, and there is weak evidence for this. But take (a lot) more samples and know with confidence how much better.
It's silly to use Big-O to describe the number of comparisons when the goal is to analyze performance. Comparisons cost 1 cycle, and you can do many of them per cycle with instruction level parallelism and SIMD. The main bottleneck, the actual source of slowness, is memory. It costs thousands of cycles to go to memory, sometimes tens of thousands or hundreds of thousands (if you need a TLB walk or OS interrupt). If you want to use Big-O, use it to estimate the number of cache misses.
I'd probably go with a custom perfect hash function. And Phil Bagwell's popcount trick. That would be faster than everyone else's solution that involves multiple lookups in memory.
I'm not very well versed in data structure optimization, but I was surprised they dismissed hash tables so quickly, especially when the table they are searching through is static. I find it somewhat hard to believe that a specially optimized hash table would not be faster than their trie implementation.
A hash function can't reject a string with a single test of the first byte. It just can't.
For this application, that's a leg up which a hash won't be able to touch. The rest of the art here is shaving down the constant factor of a trie far enough that the initial boost pays off in real-world performance.
The point I'm making is that a specially optimized hashing function would probably blow away any trie traversal. When you know the data being hashed before hand it is possible to make custom hash functions that are as simple as a few bitwise operations on the first, middle and last character of the string (just an example).
Right. Out of curiosity I looked into what code gperf generates and it seems it would end up O(1) for misses (it's always three lookups into a generated table) and O(L) for hits, as a potential hit has to be confirmed by essentially a strcmp. Not sure how something like https://github.com/rust-phf/rust-phf would work, but probably similar?
if all your internal headers start with cf- and most non-internal headers don't (but not all) and most headers are non-internal a trie might be hard to beat
If they just used the default, that's SipHash which is optimized for security and DOS resistance, not performance. XXh3 is ~10x faster than that and there's some newer ones that are even faster than Xxh3 like GxHash (like another ~1.5x-2x faster).
You would precompute the hash for the constant string list & then compute the hash only for for the set of strings that begin with C/c since "cf-" is a common prefix for those internal headers if I recall correctly, using RawTable to find entries by hash & then comparing against the precomputed value to see if a collision matched before checking a string for equality.
As they said though, the hash function on a string type requires looking at every element in the string. Also, modulo is historically very slow. So, they still have to deal with O(n) operations, where n is the length of the input string. If they can improve the memory hopping problem associated with lists / graph structures (which they claim to have done in their library), then a trie could would be much fast enough, which is what they observed. Combined with the fact that they claim that 90% of the time there is a miss when querying the trie, then you exit early a lot, whereas you always need to compute the whole hash on the input string when doing the hash map strategy.
The hash does not need to be computed on the whole string. I pointed this out in my other comment but just as a example: a hash function could be as simple as xoring the first 16 and last 16 bits of the string then indexing a 2^16 array. That means hashing is two pointer offsets and an xor (no modulo required). If there are 100 strings that need to be removed, then ~99% of rejections will be a very fast O(1).
And in the case of a match, a fast hash + memcmp will be way faster than a trie traversal. In fact, according to the trie-hard github readme, std::HashMap is already much faster than their trie implementation when searching for a match.
> As they said though, the hash function on a string type requires looking at every element in the string.
Maybe for the default hash function. As another commenter pointed out, your data may make the following hash very effective: s[0] + s[len(s)//2] + s[-1] which would be very fast. The point being is spending a day seeing if such a hash exists is worth it.
CRC32 is the same speed as an integer multiply, going all the way back to Nehalem (2008). 3 cycles latency, and you can start a new one each cycle (or more than one, on Zen 5).
I’ve used them in real code! They are pretty intuitive for some use cases. The last time I used a trie was for address lookup. I even used a generator, haha. Double whammy leetcode in real life situation I guess.
It’s definitely not a common tool I reach for though.
Routing libraries as well. In Clojure, the reitit library uses a prefix tree implementation when the route table is large enough for linear search to be non-optimal.[1] The Go ecosystem also has httprouter, which uses prefix trees under the hood.
i find myself reaching for a trie/radix-tree from time to time in medium-complexity build tools or code analysis scripts, where I need to store some information about zillions of filepaths, and factoring out the common prefixes saves a ton of memory while being pretty intuitive
Tries are huge for Scrabble move generators and playing programs, of which I've written a few. The fastest structures are compressed tries with rotated representations of words.
Given that the set if items to match is static, I wonder if they tried a perfect hash table; that would reduce it to a few arithmetic operations followed by a single string compare. It would be interesting to see how it compares to the trie.
That is what I immediately thought as well. The problem though is that a perfect hash is still going to be O(key length) for the hash function, which they are trying to avoid.
In theory, if they used a regex it would be using a state machine to do the matching, which should have similar performance to the trie- only O(k) in the worst case. But from what I understand regex libraries don't actually build the state machine, they use backtracking so the performance is no longer O(k).
I'm surprised they couldn't find a performant regex library that already existed that uses state machines. It should have similar performance to the trie. But in reality, the performance is affected more deeply by things like memory access patterns and the performance of specific arithmetic operations, so it's impossible really to speculate.
I wonder how much faster it gets when you shorten the maximum internal key length. If you re-map everything to like "cf-1", "cf-2", .. "cf-;" then you'd only need to hash like 4 characters.
Did you try a (tiny) bloom filter? Doing a quick convolution of the header key and a test against a bloom filter (SIMD? Using builtin CRC ops? 256-bit bloom filter size?) might avoid walking the trie altogether in most cases at the cost of a few cycles.
A bloom filter will have the same weaknesses of needing to do a hash. For static filters like this though where you have the set of strings to check against known up front, something like a binary fuse filter could work really well as it's significantly faster than bloom filters when you don't need online construction (you still need to do a hash of course but the evaluation against std::HashMap is flawed since it uses SipHash instead of modern constructions like xxh3 or gxhash).
Yeah, for this problem you'd want a bloom filter that can entirely fit in vector registers during the computation, which should work well for a few hundred items.
Considering that each node's filter in their trie implementation is bloom-like already, the solution is awful close. I agree that testing wider swaths of the string at once might be a dramatic accelerator.
A more naive approach might also be warranted. Some frequency analysis of hit/miss for trie nodes might allow them to find specific character positions with a higher miss rate than the first one. Testing such special positions first would speed things up. That assumes, of course, that the header data is fairly regular in nature.
1) Is this worthwhile? Looks like ~500 CPU cores were saved (are these real cores, or does this include hyperthread cores?). I don't know cloudflare's costs, but this seems like single digit servers and probably savings only in the $XX,XXX range. Not nothing, but do you expect a positive ROI on engineering?
2) If you do want to go to this detail, did you consider in putting the filter at the deserialization step, preventing the headers from being created in the first place?
Nice to see a company that cares about making things 1% faster rather than trying to analyse headers with AI to sell me sandals or something stupid like that.
That's such a tiny corner case but yet something you can easily publish. Fuelling marketing content to ends with hundred geeks discussing if what CF is doing makes sense or how much they care about perf.
I don't understand your point about the ROI. Let's say it's 40k a year for 5 years that's 200k. That's a multi year salary Hungary/Poland (I have no idea of CF has offices there.)
Not just that but there's no time frame in the blog post. They mention that Pingora was released a few months ago (which was really 10) however I doubt they spent 10 months on the singular change.
So lets very conservatively assume 2 months. 12 months in a year so 6 40k savings a year. Thats 240k which is still below a fang salary but yeah it's per year forever and also as cloudflare gets bigger the 40k would turn into 50k without anybody acknowledging it.
I'm interested to know why the regex crate didn't do better. I believe it should compile a search for multiple literal strings into an Aho-Corasick automaton, which is structured very much like a trie.
regex crate author here. From a quick look at trie-hard, my guess is that trie-hard can do lookups on short strings much more quickly than the regex crate can do matches on short strings. The regex crate has a fair bit of overhead for selecting the right matching engine and applying other optimizations appropriate to general purpose matching.
But the aho-corasick crate, and especially the specific automatons, would be interesting to try. There's much less overhead there. But, there are some differences:
* The contiguous NFA in aho-corasick has a lot more going on inside its "next transition" routine: https://github.com/BurntSushi/aho-corasick/blob/cd400ad792d6... --- This is primarily because an Aho-Corasick automaton isn't just a trie. It also has failure transitions to support unanchored search. You can search without respecting failure transitions (and thus treat it as a trie), but the code is still there to deal with failure transitions. So that may have deleterious effects.
* The contiguous NFA also specializes its transition function depending on the size of the node. It would be interesting to see whether these techniques would be useful for trie-hard to play with. But there is definitely a benefit to keeping one simple and fast code path for all node types. (Less branching.)
* The aho-corasick crate does also expose a DFA directly: https://docs.rs/aho-corasick/latest/aho_corasick/dfa/struct.... --- In theory this should do less work per transition than trie-hard, and so it would be very interesting to measure its perf when compared to trie-hard. The main downside is that it's a DFA and can use up a lot more memory and take longer to build. It seems like Cloudflare cares less about that in favor of optimizing read perf though, so it's not clear why they didn't go this route.
* The aho-corasick crate doesn't let you store arbitrary values. You only get back a PatternID when a match occurs. So to get equivalent functionality as trie-hard, you'd need a separate map of PatternID->YourValue. Since that isn't stored with the data in the trie, this alone might penalize you quite a bit with poorer CPU cache performance.
* trie-hard seems to provide some customization options for its representation. i.e., Let's you use a `u8` to store its transitions versus a `u128` or `u256`. This might also help with cache usage for a small enough trie.
This is a gratifying read (along with the article) because I've been working on closely related things, and had come to the tentative conclusion that taking Aho-Corasick, anchoring it (so no failure transitions), and laying it out in contiguous memory, basically results in Liang's packed trie from TeX's hyphenation algorithm.
trie-hard seems to be effectively that, but with some clever masking to get mileage out of having a limited byte vocabulary to deal with.
The DFA might pull ahead, but I wouldn't just bet on that. Branch predictors don't like lookup tables much, and the memory locality hit is small but real: more of a packed trie fits in a cache line than would a DFA.
You seem to have landed on similar conclusions here: beating a packed trie for this specific use case is going to be tough.
I also like that they hit on the same popcount-and-shift technique I've used in a data structure for UTF-8 character sets: https://github.com/mnemnion/runeset. I admittedly haven't benchmarked this much, correctness first and all, but seeing that Cloudflare has seen good numbers for something closely related reinforces the few tests I have run.
As a text-matching connoisseur, you may find the runeset interesting. It turns out to be hard to search for papers on "UTF-8 character set", so if you're aware of any prior art here I'd love a heads-up. It's fundamentally a generalization of using u64 bitmasks for sets of ASCII, which is as old as the hills at this point.
RE DFA: possibly, but not completely obvious to me. The DFA in aho-corasick doesn't typically have 256 different transitions per state. It uses "byte classes" to represent a typically smaller number of equivalence classes. Two bytes are in the same equivalence class if a match/non-match cannot be distinguished between them.
To be clear, I'd bet trie-hard is still more space efficient. And using byte classes does require an extra look-up per transition.
But, worth measuring.
As your your RuneSet thing, I'd look at lightgrep and icgrep. I'm not 100% they have relevant work for you, but they might. In practice, I just combine sets of codepoints down into UTF-8 automata since that's needed anyway for lazy DFA matching.
> not completely obvious to me .. worth measuring.
I concur.
> lightgrep and icgrep
These are great tips for a "related work" section, but nothing like prior art as it turns out. So yes, relevant and appreciated.
> UTF-8 automata .. lazy DFA matching
I wouldn't expect RuneSet to be useful for any automata-based regex pattern matching approach, there's not a good way to incorporate the context switch, among other considerations, and while it's fast, I will be surprised (very pleased, but surprised) if it's faster. It's designed as a component of an LPeg style parsing machine, where it can be added inline and treated as an unusually large instruction. I haven't done more than time its test suite, which has shown that it's fast enough for the intended use, and that optimization time is best put into other parts of the design for now.
Nope. That doesn't usually help in practice because it's not hard to amortize construction with a std::sync::LazyLock.
(The `memchr` crate doesn't even support const construction of a single substring searcher. I haven't spent much time thinking about it because `const` is still pretty limited in what it supports, although progress is being made. For example, you can't use trait methods in `const`. So while you can undoubtedly use `const` Rust to build a single substring searcher, whether `const` Rust can be used to build the one in `memchr` specifically isn't clear to me. And if it could be done, it will undoubtedly come with development/code-clarity costs and it's not clear to me that those are worth paying. Because as I said, amortizing construction for a static set of strings is usually very simple to do.)
And in any case, "takes a long time to build and uses a lot of memory" isn't avoided by doing it at compile time. Instead, you get bloated binaries and longer compile times. Which costs matter more? Dunno, especially when you amortize construction at runtime.
Yeah I guess I was mainly just asking "is const powerful enough yet for this to be a trivial addition".
> Instead, you get bloated binaries and longer compile times
Longer compile times probably are solved by having the strings live in a standalone compilation unit which the compiler will end up caching and you can only dirty it by adding a new string. Needing to shuffle that extra data around might still cost extra compile time, but in a real world program I suspect that's likely to not matter (both in terms of binary size & that other things in the program will be more expensive to compile).
That being said I agree that doing a LazyLock is probably sufficient and given the amount of entries being compared against acquiring the lazy lock probably also gets amortized (it's cheap but not free).
On the internet, the origin is the server sending the response to the user. I suppose you can look at it from the perspective of the owner of the server -- from their frame of reference, their journey _starts_ when they receive, process, and respond to the request.
Granted, computer scientists are infamously known for being terrible at naming things.
So this trie uses a custom u256 type that is a wrapper around an array of 4 u64s. Is rust smart enough to vecorize it into AVX instructions for the bit twiddling?
What about for the other integer sizes that the trie supports?
Am i missing something or is the big O notation in this article wrong? For example in "Sorted sets like BTreeSet use comparisons for searching, and that makes them logarithmic over key length O(log(L)), but they are also logarithmic in size too" how is the search done in logarithmic time? You could have a header that differs from an internal one only on the last character and then you have to read the whole thing. Also space-wise B-trees are linear, not O(nlogn). Additionally, at the end when talking about the trie, how do you achieve O(log(L)) for misses? Tries are not balanced, they do not halve the possible set on comparison (as I think the author states) and even if they did I don't see how that achieves the logarithmic time.
> You could have a header that differs from an internal one only on the last character and then you have to read the whole thing
When people talk about BigO they usually talk about average Big-o and worst/best case Big-o is explicitly called out if mentioned, so you can't just think in terms of the worst possible case. Regardless I think they made several mistakes here as you note correctly.
BTreeSet is log(N) in number of nodes compared. But for strings that comparison is O(L) since you at least have to compare it to the string you find. So I believe it's at best O(L log(N)) where L is the average string length and N is the number of nodes which is worse than the hash table which is just O(L). It's tricky though if most strings don't match your string. In that case I believe you end up degrading to an O(L) + O(log(N)).
Similarly, you are correct that they can't be logarithmic in size and must be linear since you have to store the data. Tries can be sublinear depending on the input since it's a form of compression as well.
Similarly you are right about trie complexity not being O(log(L)) for trie misses. I think what they're observing is a massive speedup because mismatches error out on the first character usually. But it wouldn't be logarithmic as there's unlikely to be a logarithmic relation between headers that mismatch and the universe of matching words.
I'll note I could very well be wrong - very back of the paper first principles derivation and could easily have made a mistake.
If you're in a relevant course, might be a good opportunity to flag this to a professor or TA as a way to review. These kind of analyses can be tricky sometimes, especially if you need to prove it vs just do an answer by feel which is how engineers end up doing it (vs CS students that take the proofs much more seriously).
A way to sometimes double-check is to just simulate performance as input grows and see if it matches the expected log(N).
Why adding and then removing HTTP headers of a existing request for routing instead of adding a dedicated (custom) routing protocol above the HTTP stack? This would allow to just drop the added protocol on the outbound and be much more space efficient.
I will note that http requests and internal requests are distinct.
The Zanzibar white paper from Google in 2019, for example lists top-line rpcs to Zanzibar at around 10m qps, but internal (in memory) cache hits at 200m qps and non cache internal fan-out at >20m qps. Zanzibar of course isn't http, but is a good public example of internal rpc fan-out.
Id expect cloutdflare similarly has some extremely high qps internal services.
Jumping on the bandwagon of other ideas... I wonder if it would be quicker to filter pit the internal headers when you write the request to the network (buffer)? ie something like `request_headers.filter(not_is_internal).for_each(...)`; that way you don't have to remove anything from the data structure at all, but does require you to break down a layer of abstraction for performance benefits.
No, you'd just be smearing the work at best. The expensive part is determining set containment not updating the data structure. It doesn't matter if you do it in a loop and update the data structure or do that check before writing.
I don't know Rust. Can someone explain how this data structure stores whether a node is itself a valid word or whether it only leads to more nodes? For example the node "do" in their (“and”, “ant”, “dad”, “do”, & “dot”) example. I guess it's a discriminated union provided by Rust algebraic data types or something like that, but why not show the full bit pattern in memory?
Rust doesn't guarantee a particular memory layout except for types explicitly requesting that you get a specific representation, so showing this would at least require a caveat that it's not guaranteed to be accurate.
Furthermore, this specific data structure's internals are generic, so to show the memory layout, it would have to be for a specific instantiation.
(They do also provide a structure that includes instantiations between u8 and u256, mentioning in a doc comment that the 256 size dominates overall size, and so if you want to slim it down, use the internals directly.)
By looking at the source code [1] they have a discriminated union with 3 variants representing the possible combinations of representing a valid word and leading to more nodes (excluding the case where it is neither, which is impossible). So the node for the 'o' in "do" should be a `SearchOrLeaf` storing "do", its corresponding value (the `T`, in a set this would be empty) and the `SearchNode` containing the successors, which should only contain a 't' to "dot".
Alternatively at request ingress time you could take all original header names and save them, and at egress time you could use that list to drive which keys to emit?
Then you wouldn't need to maintain a list of what is internal and you wouldn't have to do any trie lookups.
I would have stopped at the first step, saving almost 1% of cpu time (0.993%). Creating a new fast trie data type for the other 0.287% seems inefficient. Surely there are other parts of the code that take more than 1% of the cpu time elsewhere to be looked at first?
> At the time of writing, the rate of requests leaving pingora-origin (globally) is 35 million requests per second. Any code that has to be run per-request is in the hottest of hot paths...function consumes more than 1.7% of pingora-origin’s total cpu time. To put that in perspective, the total cpu time consumed by pingora-origin is 40,000 compute-seconds per second. You can think of this as 40,000 saturated CPU cores fully dedicated to running pingora-origin. Of those 40,000, 1.7% (680) are only dedicated to evaluating clear_internal_headers. The function’s heavy usage and simplicity make it seem like a great place to start optimizing.
Given how many requests are going out, and just how hot the function is, every single gain matters. If my read of the article is correct, this is the hottest part of the hot path, so every gain here is super important.
680 cores is peanuts. That is less than 10 modern specced compute servers. Its not nothing but you are probably talking about $20-30k/year in opex costs, which at the scale of cloudflare is probably a drop in the bucket.
Obviously it is an improvement but a lot of time there is lower hanging fruit, like the fact that the cloudflare control plane is/was not globally distributed and could not tolerate data center outages.
That said: probably different teams with different priorities.
I also disagree that “it’s _only_ $20-30k” as some kind of reason to not do this kind of work. Think of it another way: that’s now another 680 cores that can do _net more work_, that’s another 20-30k we don’t have to spend when traffic increases. “I landed a small PR that wiped $30k net off our costs” is a statement that would make a lot of people in a business pretty happy.
I work in a tech company which is a bit bigger than cloudflare and anything less than 6 figures/yr opex reduction is a rounding error.
It is pretty regular to save the company 7 figures per year when opex is in the billions.
It is about the opportunity cost, nobody is going to disagree that small optimizations are a net positive but what else could you have done with your time? Maybe cloudflare is at the point of maturity where there are no easy wins and only 1% gains left but I kind of doubt it given their reliability track record. How many dollars do they lose for every hour they are down? Probably more than $20k.
They don't explicitly say that in the blog post but I assume that the engineering team behind this code does not run any workload performance experiments themselves but they only run the local (micro)benchmarks.
Because of this I think there's no way for them to answer your question or to say "hey, this could very well be a vector since we could not find any statistically significant impact between a vector, hash-map and trie in this code path for given workload".
This is a common fallacy I've seen in too many places. In other words, investing non-trivial engineering effort into optimizing a code-path that resembles .000001% of your workload is a waste of resources.
And due to the Amdahl's law
> Optimizing operations that are already measured in microseconds may seem a little silly, but these small improvements add up.
this is not how things usually work out in the end. Making small improvements in insignificant code-paths will still remain insignificant in the overall runtime performance.
> If we compare the sampled performance of the different versions of clear_internal_headers, we can see that the results from the performance sampling closely match what our benchmarks predicted
Look at the end where they validate that the performance wins show up in production in a similar way that the microbenchmarks predict.
They only measure the CPU utilization of clear_internal_headers and they don't show what was the impact of the change to the overall runtime performance. The latter is what I was referring to.
The overall performance change for that specific software is minimal but at scale it frees up the cycles for other software since it’s a multi tenant system which is why they’re showing how much CPU time they shaved across the fleet. Or am I misunderstanding what point you’re making?
Do you perhaps know what the actual scale is? Is it a 100 proxies around the world or is 1k, 10k, 100k ...? I imagine that number is not very high since the blog post says that pingora-origin can serve 35M+ requests per second. My wild guess would be not more than 100 deployments. If so, then the "at scale" argument loosens up quite a bit IMO.
> it frees up the cycles for other software
I see this more like a theoretical stance since with such argument we could micro-optimize the code all day every day. It's a reverse HTTP proxy intended to be high-performance and low-latency so I don't imagine that the server that is running it has more important software to run.
Literally every single Cloudflare edge server that processes requests runs this proxy. Based on other things you said, it seems like your mental model of how Cloudflare's network runs is inaccurate. Mapping pingora origin requests/s to number of servers is difficult because of how tiered caching works & not all traffic is tiered cached and from what I remember pingora origin refers to all origin requests and a good amount of them aren't tiered (e.g. Workers outbound requests to the origin go through pingora origin & don't typically tier) which means those requests are being made directly from the edge (also tiering isn't like there's 100 servers - it spreads the load out evenly to avoid overwhelming any single server).
[1] suggests ballpark of 164 data centers and each data center is multiple machines each with multiple cores; [2] says the next gen servers have 128 cores while [3] suggests that the current gen has 96 and the previous has 48. We can assume that on average each server in a data center has 64 cores. Let's assume there are 20 machines in a data center on average (I don't remember the exact number so it's a wild guess). That comes out to an estimate of ~30k machines with ~2M total cores and this reverse proxy being responsible for ~20% of CPU time on average for every single machine (40k CPU cores from the article divided by 2M total cores across the network).
> It's a reverse HTTP proxy intended to be high-performance and low-latency so I don't imagine that the server that is running it has more important software to run.
In fact, the server that's running the proxy is also the server that's handling the inbound request as well as the Workers runtime, Cloudflare tunnels, firewall product, image optimizations, DDOS protections, etc etc etc & any other internal software they have to run to maintain their network (observability, keeping track of routing graphs, etc). So yes, CPU cycles are largely fungible and will be freed up for use on other tasks / to absorb growth as they acquire more traffic (& even if your server is underutilized this can help because it means the CPUs are switching contexts less often given how many threads are on these machines resulting in potentially better latencies at the extremes for neighboring services). There's more nuance here but this is a good first order approximation.
And yes, it's a micro optimization but clearly helpful at scale when you add up many teams doing micro optimizations. For example, here's a sqlite blog post [4] explaining how micro optimizations that don't even show up in real world systems (unlike this one which does) over an extended period of time yielded an overall 50% improvement.
Now how much is it worth to optimize global CPU usage by 0.34%? I don't know but Cloudflare does have a good way of converting CPU cycles saved to how much it saved in terms of number of servers they had to buy. Back of the envelope this added ~1k machines worth of compute time capacity to the network (0.34% * 2 million cores / 64 cores per machine). I think it's interesting the 0.34% is the actual CPU usage from their blog post and also comes up if you multiply the 20% of CPU time I estimated for the server by the 1.7% they said that hot function takes up but I think that's mostly an accident - they predicted 0.43% which suggests that the network is larger than 2M (~31k servers) and maybe they have 2.5M cores (~39k servers) and between offline servers and workloads varying due to time of day in practice they're probably closer to using ~2M cores total of their overall capacity. That optimization could still be worth doing since SRE's are responsible for maintaining fleet health while the Pingora team is responsible for their corner.
> That comes out to an estimate of ~30k machines with ~2M total cores and this reverse proxy being responsible for ~20% of CPU time on average for every single machine (40k CPU cores from the article divided by 2M total cores across the network).
I think you wanted to say 2% of CPU time on average and not 20%, no?
> In fact, the server that's running the proxy is also the server that's handling the inbound request as well as the Workers runtime, Cloudflare tunnels, firewall product, image optimizations, DDOS protections, etc etc etc & any other internal software they have to run to maintain their network (observability, keeping track of routing graphs, etc).
Sure, they have plethora of software that they need to run but that wasn't my point. My point was rather that I cannot imagine that a single VM that runs the low-latency service such as the proxy is also running any type of heavier weight services. Thus an argument of freeing up the CPU cycles for some other software when in reality there probably isn't any didn't make much sense to me.
This would be a legitimate argument if there was a challenge to begin with. Such as "we noticed that our 64-core 128G VM that runs the proxy is becoming saturated and thus the tail-latency of our XYZ service running on the same VM is starting to degrade". Without the problem statement "freeing up cycles for other software" is otherwise purely theoretical goal that is difficult if not impossible to quantify.
> And yes, it's a micro optimization but clearly helpful at scale when you add up many teams doing micro optimizations.
Perhaps in companies such as Google where they have a dedicated team developing fundamental low-level data-structures and/or runtime which is then used by 1000's of their other internal services. I believe that's actually one of the reasons why GoogleBenchmark was designed. In majority of other companies micro-optimizations are only a very good brain gymnastics exercise, I think.
> For example, here's a sqlite blog post [4] explaining how micro optimizations that don't even show up in real world systems (unlike this one which does) over an extended period of time yielded an overall 50% improvement.
The hypothesis of
"100s of micro-optimizations add up to an overall 50% improvement"
is equally, or mathematically speaking less likely (due to the Amdahl's law), than the hypothesis of
"there were a few changes but we don't know what are the exact changes that contributed to the majority of 50% improvement"
I intentionally say hypothesis because that's what it is. There's no evidence to support either of the claims. If I wanted to support my time spent working on the micro-optimizations at the XYZ company, this is also how I would frame it. This is not to say that I think that the author of sqlite was trying to do exactly that but only that the reasoning outlined in the post is not convincing to me.
That said, reducing the CPU cycles for certain workload by ~30% is a noble and quantifiable artifact especially if you're a database company hosting thousands of your database instances since this obviously directly translates to more $$$ due to the better hardware utilization.
However, trie-hard micro-optimization is done in the software that constitutes 2% of their total CPU time (if your ballpark estimates are somewhat close). Since the micro-optimization cut down the CPU utilization from 1.71% to 0.34% of 2% of 100% of CPU time this, in other words, means that total fleet-wide net of CPU utilization went from 0.000342% (before the change) to 0.000068% (after the change).
> My point was rather that I cannot imagine that a single VM that runs the low-latency service such as the proxy is also running any type of heavier weight services
I encourage you to imagine harder then. Nearly all of the software is running bare metal and as I said it’s all replicated on each machine. So the same server is running the reverse proxy and the CDN and the Workers runtime and Cloudflare tunnels because the any cast routing they use intentionally means that every single inbound request always hits the nearest server which generally processes the entire request right on the machine and uses Unimog to distribute the requests within a data center [1] and Plurimog [2] to distribute load between data centers. I’ll repeat again - every server runs the proxy along with every other internal and external product and there’s no VMs or containers*.
I’m not going to engage with the rest of your hypothesis that it’s not worth the effort until services are falling over or it’s the root cause for some tail latency since the team has indicated it clearly is worth their time and value to Cloudflare.
* technically there are VMs but it’s not worth thinking about for this specific discussion since the VM is closer to being just like any other application. It’s not virtualizing the cloudflare software stack but just providing a layer of isolation.
Given it's such a hot part of their code, I'm genuinely surprised they didn't go down the state machine route for as optimal a solution, or even further and made use of vector instructions. Maybe Rust can autovectorise the code though.
Shouldn't this line:
> "using regular expressions clocks in as taking about twice as long as the hashmap-based solution."
be written as the opposite? That the RE takes half as long/is twice as fast?
I think the point is that naive regex are a very generic purpose tool, but it's still in the same ballpark. Having a custom optimized state machine for this specific use case could bring another 5x improvement on top, leading to 2.5x faster, potentially.
They did do that at first; the “static Vec” reference is to this. A Vec<T> is a growable array type.
Arrays have O(n) lookup in the worst case, and HashMaps have O(1), as the post mentions. Basically, you have to walk through the entire array of the header you’re looking for is at the end, whereas it’s always cheap with a HashMap. But the trie is even better!
Basically, none of this is Rust specific, it’s just algorithms stuff.
Neat optimization! Would it have been feasible to spend a bit of cpu/memory and tag headers as internal upon the request construction? That way filtering on output is trivial.
This solution reminds me of an old blog post, where two members of the F# team at Microsoft battled back and forth on making the best solver for some word game. Cloudflare's solution here looks a lot like the second-to-last solution; the final post involved collapsing the tree nodes into a 2D array of u32s for maximum cache-hits. It was a really neat solution!
- An entire separate dictionary or other data structure.
- One single header containing all internal metadata.
- All headers have a prefix, and the internal ones start with I and the external ones start with E.
- All internal headers start with “CFInt”.
I would not have come up with a scheme in which headers in a particular list are internal. (What if someone else uses this name? What if something forgets to sanitize? What if different simultaneously running programs disagree in the list? What if the Connection header names a Cloudflare-internal header? What if the set-difference algorithm is annoyingly slow?)
The web is already full of obnoxiously ambiguous in-band signaling and header naming, and I find it bizarre that a company with Cloudflare’s scale uses such a tedious and error-prone mechanism internally.